About Analyzer
When creating indexes for search, it is necessary to segment documents for registration as indexes. In Fess, the functionality to break documents into words is registered as an Analyzer. An Analyzer consists of CharFilter, Tokenizer, and TokenFilter.
Basically, items smaller than the units separated by the Analyzer will not be found even when searched. For example, consider the sentence “Living in Tokyo”. Suppose this sentence is divided by the Analyzer into “Tokyo”, “in”, and “Living”. In this case, a search for “Tokyo” will produce a hit. However, a search for “Kyo” will not produce a hit.
Fess provides a dedicated Analyzer for each language. The Analyzer applied is automatically switched based on the suffix of the field name in the index (e.g. content_ja, content_en).
Analyzer Definition Files
The Analyzer has no dedicated administration screen; it is changed by directly editing configuration files. The relevant files are located under app/WEB-INF/classes/fess_indices/.
| File | Description |
|---|---|
fess_indices/fess.json | Settings for the document index. Contains definitions for CharFilter, Tokenizer, TokenFilter, and Analyzer. |
fess_indices/fess/doc.json | Mapping for the document index. Assigns the Analyzer to apply for each field name pattern such as *_ja and *_en. |
fess_indices/fess/<lang>/ | Dictionary files per language (e.g. ja/kuromoji.txt, ko/nori.txt, en/protwords.txt, en/stemmer_override.txt, and stopwords.txt for each language). |
fess_indices/fess/mapping.txt, fess_indices/fess/synonym.txt | Character mapping dictionary and synonym dictionary shared across all languages. |
The Analyzer definitions themselves (combinations of Tokenizer and TokenFilter) are specified in fess.json, while which Analyzer to apply to which field is specified in fess/doc.json.
Note
When using a managed service such as Amazon OpenSearch Service, a configuration file corresponding to the search engine type takes precedence, such as fess_indices/_aws/fess.json or fess_indices/_cloud/fess.json.
Registering Analyzers
Analyzer settings are registered by creating an index based on the configuration files described above when no search index exists at Fess startup. The index is created with a timestamped name (e.g. fess.20240101120000000), and the aliases fess.search and fess.update are assigned to it.
Placeholders such as ${fess.dictionary.path} in the configuration files are replaced with actual values when the index is created. The location where dictionary files are placed can be changed with the system property fess.dictionary.path.
If an existing index is present, the already-defined settings are reused. Therefore, if you change Analyzer definitions, you must rebuild the index to reflect those changes.
Tuning with Dictionaries
The dictionaries referenced by the Analyzer can be edited from the administration screen.
Kuromoji Dictionary - User dictionary for Japanese morphological analysis
Synonym Dictionary - Synonym dictionary
Mapping Dictionary - Character mapping
Stopwords Dictionary - Stop words
Protwords Dictionary - Protected words
Stemmer Override Dictionary - Stemming overrides
For how to configure Analyzers, refer to the OpenSearch Analyzer documentation.
Notes
Analyzer configuration has a major impact on search. When changing Analyzers, either understand how Lucene Analyzers work before implementing changes, or consult commercial support.