Advanced Crawler Configuration

Overview

This guide explains advanced configuration for the Fess crawler. For basic crawler configuration, refer to Basic Crawler Configuration.

Warning

The settings on this page can affect the entire system. Thoroughly test any changes before applying them to production environments.

General Settings

Configuration File Locations

Detailed crawler settings are configured in the following files:

Main configuration: /etc/fess/fess_config.properties (or app/WEB-INF/classes/fess_config.properties)
Content length configuration: app/WEB-INF/classes/crawler/contentlength.xml
Component configuration: app/WEB-INF/classes/crawler/container.xml

Default Script

Configure the default script language for the crawler.

Property	Description	Default
`crawler.default.script`	Crawler script language	`groovy`

HTTP Thread Pool

HTTP crawler thread pool settings.

Property	Description	Default
`crawler.http.thread_pool.size`	HTTP thread pool size	`0`

Document Processing Settings

Basic Settings

Property	Description	Default
`crawler.document.max.site.length`	Maximum lines for document site	`100`
`crawler.document.site.encoding`	Document site encoding	`UTF-8`
`crawler.document.unknown.hostname`	Alternative value for unknown hostname	`unknown`
`crawler.document.use.site.encoding.on.english`	Use site encoding for English documents	`false`
`crawler.document.append.data`	Append data to document	`true`
`crawler.document.append.filename`	Append filename to document	`false`

Configuration Example

crawler.document.max.site.length=100
crawler.document.site.encoding=UTF-8
crawler.document.unknown.hostname=unknown
crawler.document.use.site.encoding.on.english=false
crawler.document.append.data=true
crawler.document.append.filename=false

Word Processing Settings

Property	Description	Default
`crawler.document.max.alphanum.term.size`	Maximum alphanumeric word length	`20`
`crawler.document.max.symbol.term.size`	Maximum symbol word length	`10`
`crawler.document.duplicate.term.removed`	Remove duplicate words	`false`

Configuration Example

# Change maximum alphanumeric length to 50 characters
crawler.document.max.alphanum.term.size=50

# Change maximum symbol length to 20 characters
crawler.document.max.symbol.term.size=20

# Remove duplicate words
crawler.document.duplicate.term.removed=true

Note

Increasing max.alphanum.term.size allows indexing long IDs, tokens, URLs, etc. in their complete form, but increases index size.

Character Processing Settings

Property	Description	Default
`crawler.document.space.chars`	Whitespace character definition	`\u0009\u000A...`
`crawler.document.fullstop.chars`	Period character definition	`\u002e\u06d4...`

Configuration Example

# Default values (includes Unicode characters)
crawler.document.space.chars=\u0009\u000A\u000B\u000C\u000D\u001C\u001D\u001E\u001F\u0020\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u200B\u200C\u202F\u205F\u3000\uFEFF\uFFFD\u00B6

crawler.document.fullstop.chars=\u002e\u06d4\u2e3c\u3002

Protocol Settings

Supported Protocols

Property	Description	Default
`crawler.web.protocols`	Web crawl protocols	`http,https`
`crawler.file.protocols`	File crawl protocols	`file,smb,smb1,ftp,storage,s3,gcs`

Configuration Example

Environment Variable Parameters

Property	Description	Default
`crawler.data.env.param.key.pattern`	Environment variable parameter key pattern	`^FESS_ENV_.*`

robots.txt Settings

Property	Description	Default
`crawler.ignore.robots.txt`	Ignore robots.txt	`false`
`crawler.ignore.robots.tags`	Robots tags to ignore	(empty)
`crawler.ignore.content.exception`	Ignore content exceptions	`true`

Warning

Setting crawler.ignore.robots.txt=true may violate site terms of service. Exercise caution when crawling external sites.

Error Handling Settings

Property	Description	Default
`crawler.failure.url.status.codes`	HTTP status codes considered failures	`404`

System Monitoring Settings

Property	Description	Default
`crawler.system.monitor.interval`	System monitoring interval (seconds)	`60`

Hot Thread Settings

Property	Description	Default
`crawler.hotthread.ignore_idle_threads`	Ignore idle threads	`true`
`crawler.hotthread.interval`	Snapshot interval	`500ms`
`crawler.hotthread.snapshots`	Number of snapshots	`10`
`crawler.hotthread.threads`	Number of threads to monitor	`3`
`crawler.hotthread.timeout`	Timeout	`30s`
`crawler.hotthread.type`	Monitoring type	`cpu`

Configuration Example

Metadata Settings

Property	Description	Default
`crawler.metadata.content.excludes`	Metadata to exclude	`resourceName,X-Parsed-By...`
`crawler.metadata.name.mapping`	Metadata name mapping	`title=title:string...`

# Metadata to exclude
crawler.metadata.content.excludes=resourceName,X-Parsed-By,Content-Encoding.*,Content-Type.*,X-TIKA.*,X-FESS.*

# Metadata name mapping
crawler.metadata.name.mapping=\
    title=title:string\n\
    Title=title:string\n\
    dc:title=title:string

HTML Crawler Settings

XPath Settings

XPath settings for extracting HTML elements.

Property	Description	Default
`crawler.document.html.content.xpath`	Content XPath	`//BODY`
`crawler.document.html.lang.xpath`	Language XPath	`//HTML/@lang`
`crawler.document.html.digest.xpath`	Digest XPath	`//META[@name='description']/@content`
`crawler.document.html.canonical.xpath`	Canonical URL XPath	`//LINK[@rel='canonical'][1]/@href`

Configuration Example

# Default settings
crawler.document.html.content.xpath=//BODY
crawler.document.html.lang.xpath=//HTML/@lang
crawler.document.html.digest.xpath=//META[@name='description']/@content
crawler.document.html.canonical.xpath=//LINK[@rel='canonical'][1]/@href

Custom XPath Examples

# Extract only specific div element as content
crawler.document.html.content.xpath=//DIV[@id='main-content']

# Include meta keywords in digest
crawler.document.html.digest.xpath=//META[@name='description']/@content|//META[@name='keywords']/@content

HTML Tag Processing

Property	Description	Default
`crawler.document.html.pruned.tags`	HTML tags to remove	`noscript,script,style,header,footer,aside,nav,a[rel=nofollow]`
`crawler.document.html.max.digest.length`	Maximum digest length	`120`
`crawler.document.html.default.lang`	Default language	(empty)

Configuration Example

# Add tags to remove
crawler.document.html.pruned.tags=noscript,script,style,header,footer,aside,nav,a[rel=nofollow],form

# Set digest length to 200 characters
crawler.document.html.max.digest.length=200

# Set default language to Japanese
crawler.document.html.default.lang=ja

URL Pattern Filters

Property	Description	Default
`crawler.document.html.default.include.index.patterns`	URL patterns to include in index	(empty)
`crawler.document.html.default.exclude.index.patterns`	URL patterns to exclude from index	`(?i).*(css\|js\|jpeg...)`
`crawler.document.html.default.include.search.patterns`	URL patterns to include in search results	(empty)
`crawler.document.html.default.exclude.search.patterns`	URL patterns to exclude from search results	(empty)

Configuration Example

# Default exclusion patterns
crawler.document.html.default.exclude.index.patterns=(?i).*(css|js|jpeg|jpg|gif|png|bmp|wmv|xml|ico|exe)

# Index only specific paths
crawler.document.html.default.include.index.patterns=https://example\\.com/docs/.*

File Crawler Settings

Basic Settings

Property	Description	Default
`crawler.document.file.name.encoding`	Filename encoding	(empty)
`crawler.document.file.no.title.label`	Label for files without title	`No title.`
`crawler.document.file.ignore.empty.content`	Ignore empty content	`false`
`crawler.document.file.max.title.length`	Maximum title length	`100`
`crawler.document.file.max.digest.length`	Maximum digest length	`200`

Configuration Example

# Process Windows-31J filenames
crawler.document.file.name.encoding=Windows-31J

# Label for files without title
crawler.document.file.no.title.label=No Title

# Ignore empty files
crawler.document.file.ignore.empty.content=true

# Title and digest lengths
crawler.document.file.max.title.length=200
crawler.document.file.max.digest.length=500

Content Processing

Property	Description	Default
`crawler.document.file.append.meta.content`	Append metadata to content	`true`
`crawler.document.file.append.body.content`	Append body to content	`true`
`crawler.document.file.default.lang`	Default language	(empty)

Configuration Example

File URL Pattern Filters

Property	Description	Default
`crawler.document.file.default.include.index.patterns`	Patterns to include in index	(empty)
`crawler.document.file.default.exclude.index.patterns`	Patterns to exclude from index	(empty)
`crawler.document.file.default.include.search.patterns`	Patterns to include in search results	(empty)
`crawler.document.file.default.exclude.search.patterns`	Patterns to exclude from search results	(empty)

Configuration Example

# Index only specific extensions
crawler.document.file.default.include.index.patterns=.*\\.(pdf|docx|xlsx|pptx)$

# Exclude temp folders
crawler.document.file.default.exclude.index.patterns=.*/temp/.*

Cache Settings

Document Cache

Property	Description	Default
`crawler.document.cache.enabled`	Enable document cache	`true`
`crawler.document.cache.max.size`	Maximum cache size (bytes)	`2621440` (2.5MB)
`crawler.document.cache.supported.mimetypes`	MIME types to cache	`text/html`
`crawler.document.cache.html.mimetypes`	MIME types to treat as HTML	`text/html`

Configuration Example

# Enable document cache
crawler.document.cache.enabled=true

# Set cache size to 5MB
crawler.document.cache.max.size=5242880

# MIME types to cache
crawler.document.cache.supported.mimetypes=text/html,application/xhtml+xml

# MIME types to treat as HTML
crawler.document.cache.html.mimetypes=text/html,application/xhtml+xml

Note

Enabling cache displays cache links in search results, allowing users to reference content as it was at crawl time.

JVM Options

You can configure JVM options for the crawler process.

Property	Description	Default
`jvm.crawler.options`	Crawler JVM options	`-Xms128m -Xmx512m...`

Default Settings

Key Options Explained

Option	Description
`-Xms128m`	Initial heap size (128MB)
`-Xmx512m`	Maximum heap size (512MB)
`-XX:MaxMetaspaceSize=128m`	Maximum Metaspace size (128MB)
`-XX:+UseG1GC`	Use G1 garbage collector
`-XX:MaxGCPauseMillis=60000`	GC pause time goal (60 seconds)
`-XX:-HeapDumpOnOutOfMemoryError`	Disable heap dump on OutOfMemory

Custom Configuration Examples

For crawling large files:

For debugging:

For details, see Memory Configuration.

Performance Tuning

Optimizing Crawl Speed

1. Adjust Thread Count

Increase parallel crawl count to improve crawl speed.

However, be mindful of load on target servers.

2. Adjust Timeouts

For slow-responding sites, adjust timeouts.

3. Exclude Unnecessary Content

Excluding images, CSS, JavaScript files, etc. improves crawl speed.

4. Retry Settings

Adjust retry count and interval on errors.

Optimizing Memory Usage

1. Adjust Heap Size

2. Adjust Cache Size

3. Exclude Large Files

For details, see Memory Configuration.

Improving Index Quality

1. Optimize XPath

Exclude unnecessary elements (navigation, ads, etc.).

2. Optimize Digest

3. Metadata Mapping

Troubleshooting

Memory Shortage

Symptoms:

OutOfMemoryError recorded in fess_crawler.log
Crawling stops midway

Solutions:

Increase crawler heap size
```
jvm.crawler.options=-Xms256m -Xmx2g
```
Reduce parallel thread count
Exclude large files

For details, see Memory Configuration.

Slow Crawling

Symptoms:

Crawling takes too long
Frequent timeouts

Solutions:

Increase thread count (be mindful of target server load)

Adjust timeouts

Exclude unnecessary URLs

Specific Content Cannot Be Extracted

Symptoms:

Page text not extracted correctly
Important information not included in search results

Solutions:

Check and adjust XPath

Check pruned tags

For content dynamically generated by JavaScript, consider alternative methods (API crawling, etc.)

Character Encoding Issues

Symptoms:

Character encoding issues in search results
Specific languages not displayed correctly

Solutions:

Check encoding settings

Configure filename encoding

Check logs for encoding errors

Best Practices

Verify in Test Environment

Thoroughly test in a test environment before applying to production.
Gradual Adjustments

Don’t change settings drastically at once; adjust gradually and verify effectiveness.
Monitor Logs

After changing settings, monitor logs to check for errors or performance issues.
```
tail -f /var/log/fess/fess_crawler.log
```

Backups

Always back up configuration files before making changes.

Documentation

Document the settings you changed and the reasons why.

S3/GCS Crawler Configuration

S3 Crawler

Configuration for crawling S3 and S3-compatible storage (such as MinIO). Add the following to “Configuration Parameters” in the file crawl settings.

Parameter	Description	Default
`client.endpoint`	S3 endpoint URL	(Required)
`client.accessKey`	Access key	(Required)
`client.secretKey`	Secret key	(Required)
`client.region`	AWS region	`us-east-1`
`client.connectTimeout`	Connection timeout (ms)	`10000`
`client.readTimeout`	Read timeout (ms)	`10000`

Configuration Example

GCS Crawler

Configuration for crawling Google Cloud Storage. Add the following to “Configuration Parameters” in the file crawl settings.

Parameter	Description	Default
`client.projectId`	Google Cloud project ID	(Required)
`client.credentialsFile`	Service account JSON file path	(Optional)
`client.endpoint`	Custom endpoint	(Optional)
`client.connectTimeout`	Connection timeout (ms)	`10000`
`client.writeTimeout`	Write timeout (ms)	`10000`
`client.readTimeout`	Read timeout (ms)	`10000`

Configuration Example

Note

If credentialsFile is omitted, the GOOGLE_APPLICATION_CREDENTIALS environment variable is used.

References

Basic Crawler Configuration - Basic Crawler Configuration
Thumbnail Configuration - Thumbnail Configuration
Memory Configuration - Memory Configuration
Log Configuration - Log Configuration
Search-Related Settings - Advanced Search Settings