Overview
This guide explains advanced configuration for the Fess crawler. For basic crawler configuration, refer to Basic Crawler Configuration.
Warning
The settings on this page can affect the entire system. Thoroughly test any changes before applying them to production environments.
General Settings
Configuration File Locations
Detailed crawler settings are configured in the following files:
Main configuration:
/etc/fess/fess_config.properties(orapp/WEB-INF/classes/fess_config.properties)Content length configuration:
app/WEB-INF/classes/crawler/contentlength.xmlComponent configuration:
app/WEB-INF/classes/crawler/container.xml
Default Script
Configure the default script language for the crawler.
| Property | Description | Default |
|---|---|---|
crawler.default.script | Crawler script language | groovy |
HTTP Thread Pool
HTTP crawler thread pool settings.
| Property | Description | Default |
|---|---|---|
crawler.http.thread_pool.size | HTTP thread pool size | 0 |
Document Processing Settings
Basic Settings
| Property | Description | Default |
|---|---|---|
crawler.document.max.site.length | Maximum lines for document site | 100 |
crawler.document.site.encoding | Document site encoding | UTF-8 |
crawler.document.unknown.hostname | Alternative value for unknown hostname | unknown |
crawler.document.use.site.encoding.on.english | Use site encoding for English documents | false |
crawler.document.append.data | Append data to document | true |
crawler.document.append.filename | Append filename to document | false |
Configuration Example
Word Processing Settings
| Property | Description | Default |
|---|---|---|
crawler.document.max.alphanum.term.size | Maximum alphanumeric word length | 20 |
crawler.document.max.symbol.term.size | Maximum symbol word length | 10 |
crawler.document.duplicate.term.removed | Remove duplicate words | false |
Configuration Example
Note
Increasing max.alphanum.term.size allows indexing long IDs, tokens, URLs, etc. in their complete form, but increases index size.
Character Processing Settings
| Property | Description | Default |
|---|---|---|
crawler.document.space.chars | Whitespace character definition | \u0009\u000A... |
crawler.document.fullstop.chars | Period character definition | \u002e\u06d4... |
Configuration Example
Protocol Settings
Supported Protocols
| Property | Description | Default |
|---|---|---|
crawler.web.protocols | Web crawl protocols | http,https |
crawler.file.protocols | File crawl protocols | file,smb,smb1,ftp,storage,s3,gcs |
Configuration Example
Environment Variable Parameters
| Property | Description | Default |
|---|---|---|
crawler.data.env.param.key.pattern | Environment variable parameter key pattern | ^FESS_ENV_.* |
robots.txt Settings
| Property | Description | Default |
|---|---|---|
crawler.ignore.robots.txt | Ignore robots.txt | false |
crawler.ignore.robots.tags | Robots tags to ignore | (empty) |
crawler.ignore.content.exception | Ignore content exceptions | true |
Warning
Setting crawler.ignore.robots.txt=true may violate site terms of service. Exercise caution when crawling external sites.
Error Handling Settings
| Property | Description | Default |
|---|---|---|
crawler.failure.url.status.codes | HTTP status codes considered failures | 404 |
System Monitoring Settings
| Property | Description | Default |
|---|---|---|
crawler.system.monitor.interval | System monitoring interval (seconds) | 60 |
Hot Thread Settings
| Property | Description | Default |
|---|---|---|
crawler.hotthread.ignore_idle_threads | Ignore idle threads | true |
crawler.hotthread.interval | Snapshot interval | 500ms |
crawler.hotthread.snapshots | Number of snapshots | 10 |
crawler.hotthread.threads | Number of threads to monitor | 3 |
crawler.hotthread.timeout | Timeout | 30s |
crawler.hotthread.type | Monitoring type | cpu |
Configuration Example
Metadata Settings
| Property | Description | Default |
|---|---|---|
crawler.metadata.content.excludes | Metadata to exclude | resourceName,X-Parsed-By... |
crawler.metadata.name.mapping | Metadata name mapping | title=title:string... |
HTML Crawler Settings
XPath Settings
XPath settings for extracting HTML elements.
| Property | Description | Default |
|---|---|---|
crawler.document.html.content.xpath | Content XPath | //BODY |
crawler.document.html.lang.xpath | Language XPath | //HTML/@lang |
crawler.document.html.digest.xpath | Digest XPath | //META[@name='description']/@content |
crawler.document.html.canonical.xpath | Canonical URL XPath | //LINK[@rel='canonical'][1]/@href |
Configuration Example
Custom XPath Examples
HTML Tag Processing
| Property | Description | Default |
|---|---|---|
crawler.document.html.pruned.tags | HTML tags to remove | noscript,script,style,header,footer,aside,nav,a[rel=nofollow] |
crawler.document.html.max.digest.length | Maximum digest length | 120 |
crawler.document.html.default.lang | Default language | (empty) |
Configuration Example
URL Pattern Filters
| Property | Description | Default |
|---|---|---|
crawler.document.html.default.include.index.patterns | URL patterns to include in index | (empty) |
crawler.document.html.default.exclude.index.patterns | URL patterns to exclude from index | (?i).*(css|js|jpeg...) |
crawler.document.html.default.include.search.patterns | URL patterns to include in search results | (empty) |
crawler.document.html.default.exclude.search.patterns | URL patterns to exclude from search results | (empty) |
Configuration Example
File Crawler Settings
Basic Settings
| Property | Description | Default |
|---|---|---|
crawler.document.file.name.encoding | Filename encoding | (empty) |
crawler.document.file.no.title.label | Label for files without title | No title. |
crawler.document.file.ignore.empty.content | Ignore empty content | false |
crawler.document.file.max.title.length | Maximum title length | 100 |
crawler.document.file.max.digest.length | Maximum digest length | 200 |
Configuration Example
Content Processing
| Property | Description | Default |
|---|---|---|
crawler.document.file.append.meta.content | Append metadata to content | true |
crawler.document.file.append.body.content | Append body to content | true |
crawler.document.file.default.lang | Default language | (empty) |
Configuration Example
File URL Pattern Filters
| Property | Description | Default |
|---|---|---|
crawler.document.file.default.include.index.patterns | Patterns to include in index | (empty) |
crawler.document.file.default.exclude.index.patterns | Patterns to exclude from index | (empty) |
crawler.document.file.default.include.search.patterns | Patterns to include in search results | (empty) |
crawler.document.file.default.exclude.search.patterns | Patterns to exclude from search results | (empty) |
Configuration Example
Cache Settings
Document Cache
| Property | Description | Default |
|---|---|---|
crawler.document.cache.enabled | Enable document cache | true |
crawler.document.cache.max.size | Maximum cache size (bytes) | 2621440 (2.5MB) |
crawler.document.cache.supported.mimetypes | MIME types to cache | text/html |
crawler.document.cache.html.mimetypes | MIME types to treat as HTML | text/html |
Configuration Example
Note
Enabling cache displays cache links in search results, allowing users to reference content as it was at crawl time.
JVM Options
You can configure JVM options for the crawler process.
| Property | Description | Default |
|---|---|---|
jvm.crawler.options | Crawler JVM options | -Xms128m -Xmx512m... |
Default Settings
Key Options Explained
| Option | Description |
|---|---|
-Xms128m | Initial heap size (128MB) |
-Xmx512m | Maximum heap size (512MB) |
-XX:MaxMetaspaceSize=128m | Maximum Metaspace size (128MB) |
-XX:+UseG1GC | Use G1 garbage collector |
-XX:MaxGCPauseMillis=60000 | GC pause time goal (60 seconds) |
-XX:-HeapDumpOnOutOfMemoryError | Disable heap dump on OutOfMemory |
Custom Configuration Examples
For crawling large files:
For debugging:
For details, see Memory Configuration.
Performance Tuning
Optimizing Crawl Speed
1. Adjust Thread Count
Increase parallel crawl count to improve crawl speed.
However, be mindful of load on target servers.
2. Adjust Timeouts
For slow-responding sites, adjust timeouts.
3. Exclude Unnecessary Content
Excluding images, CSS, JavaScript files, etc. improves crawl speed.
4. Retry Settings
Adjust retry count and interval on errors.
Optimizing Memory Usage
1. Adjust Heap Size
2. Adjust Cache Size
3. Exclude Large Files
For details, see Memory Configuration.
Improving Index Quality
1. Optimize XPath
Exclude unnecessary elements (navigation, ads, etc.).
2. Optimize Digest
3. Metadata Mapping
Troubleshooting
Memory Shortage
Symptoms:
OutOfMemoryErrorrecorded infess_crawler.logCrawling stops midway
Solutions:
Increase crawler heap size
Reduce parallel thread count
Exclude large files
For details, see Memory Configuration.
Slow Crawling
Symptoms:
Crawling takes too long
Frequent timeouts
Solutions:
Increase thread count (be mindful of target server load)
Adjust timeouts
Exclude unnecessary URLs
Specific Content Cannot Be Extracted
Symptoms:
Page text not extracted correctly
Important information not included in search results
Solutions:
Check and adjust XPath
Check pruned tags
For content dynamically generated by JavaScript, consider alternative methods (API crawling, etc.)
Character Encoding Issues
Symptoms:
Character encoding issues in search results
Specific languages not displayed correctly
Solutions:
Check encoding settings
Configure filename encoding
Check logs for encoding errors
Best Practices
Verify in Test Environment
Thoroughly test in a test environment before applying to production.
Gradual Adjustments
Don’t change settings drastically at once; adjust gradually and verify effectiveness.
Monitor Logs
After changing settings, monitor logs to check for errors or performance issues.
Backups
Always back up configuration files before making changes.
Documentation
Document the settings you changed and the reasons why.
S3/GCS Crawler Configuration
S3 Crawler
Configuration for crawling S3 and S3-compatible storage (such as MinIO). Add the following to “Configuration Parameters” in the file crawl settings.
| Parameter | Description | Default |
|---|---|---|
client.endpoint | S3 endpoint URL | (Required) |
client.accessKey | Access key | (Required) |
client.secretKey | Secret key | (Required) |
client.region | AWS region | us-east-1 |
client.connectTimeout | Connection timeout (ms) | 10000 |
client.readTimeout | Read timeout (ms) | 10000 |
Configuration Example
GCS Crawler
Configuration for crawling Google Cloud Storage. Add the following to “Configuration Parameters” in the file crawl settings.
| Parameter | Description | Default |
|---|---|---|
client.projectId | Google Cloud project ID | (Required) |
client.credentialsFile | Service account JSON file path | (Optional) |
client.endpoint | Custom endpoint | (Optional) |
client.connectTimeout | Connection timeout (ms) | 10000 |
client.writeTimeout | Write timeout (ms) | 10000 |
client.readTimeout | Read timeout (ms) | 10000 |
Configuration Example
Note
If credentialsFile is omitted, the GOOGLE_APPLICATION_CREDENTIALS environment variable is used.
References
Basic Crawler Configuration - Basic Crawler Configuration
Thumbnail Configuration - Thumbnail Configuration
Memory Configuration - Memory Configuration
Log Configuration - Log Configuration
Search-Related Settings - Advanced Search Settings