CSV Connector

Overview

The CSV Connector provides functionality to retrieve data from CSV files and register it in the Fess index.

This feature requires the fess-ds-csv plugin.

Prerequisites

Plugin installation is required
Access to the CSV file is required
You must know the character encoding of the CSV file

Plugin Installation

Method 1: Place JAR file directly

# Download from Maven Central
wget https://repo1.maven.org/maven2/org/codelibs/fess/fess-ds-csv/X.X.X/fess-ds-csv-X.X.X.jar

# Place the file
cp fess-ds-csv-X.X.X.jar $FESS_HOME/app/WEB-INF/lib/
# or
cp fess-ds-csv-X.X.X.jar /usr/share/fess/app/WEB-INF/lib/

Method 2: Install from admin console

Open “System” -> “Plugins”
Upload the JAR file
Restart Fess

Configuration

Configure from the admin console via “Crawler” -> “Data Store” -> “Create New”.

Basic Settings

Item	Example
Name	Products CSV
Handler Name	CsvDataStore
Enabled	On

Parameter Settings

Local file:

Multiple files:

Note

Quote processing and escape processing are disabled by default. If you need to handle CSV files where fields enclosed in quotes contain delimiters or line breaks (RFC 4180 compliant), explicitly set quote_disabled=false to enable quote processing. See “Enabling Quote and Escape Processing” below for details.

Parameter List

Parameter	Required	Description
`files`	No	CSV file path (local path; multiple paths can be specified separated by commas). Either `files` or `directories` must be specified. If both are specified, `files` takes precedence. Files must have a `.csv` or `.tsv` extension; files with any other extension are skipped.
`directories`	No	Path to a directory containing CSV files (multiple paths can be specified separated by commas). Only `.csv` and `.tsv` files within the directory are processed. Used when `files` is not specified.
`file_encoding`	No	Character encoding (default: UTF-8)
`has_header_line`	No	Whether a header row exists (default: false)
`separator_character`	No	Separator character (default: comma `,`). Escape sequences such as `\t` can be specified (for tab-separated files).
`quote_character`	No	Quote character (default: double quote `"`). Note that quote processing is disabled by default (see `quote_disabled`).
`escape_character`	No	Escape character (default: backslash `\`). Note that escape processing is disabled by default (see `escape_disabled`).

Note

If both files and directories are empty, an error (DataStoreException) is raised. At least one of them must be specified.

Advanced Parameters

The following parameters provide fine-grained control over CSV parsing behaviour:

Parameter	Description
`quote_disabled`	Whether to disable quote processing (default: true). Set to `false` to handle RFC 4180 quoted fields.
`escape_disabled`	Whether to disable escape processing (default: true). Set to `false` to enable escaping via `escape_character`.
`skip_lines`	Number of leading lines to skip (default: 0)
`ignore_line_patterns`	Regular expression pattern for lines to ignore (e.g., `^#.*` to ignore comment lines)
`ignore_empty_lines`	Whether to ignore empty lines (default: false)
`ignore_trailing_whitespaces`	Whether to ignore trailing whitespace (default: false)
`ignore_leading_whitespaces`	Whether to ignore leading whitespace (default: false)
`null_string`	String value to treat as null
`break_string`	String used to replace line breaks within field values
`readInterval`	Wait time in milliseconds between processing each record (default: 0)

Script Settings

Field values are assembled by referencing the values of each CSV column. CSV columns are referenced directly in scripts as variables without any prefix (there is no data. prefix).

With header row (reference by column name):

Without header row (reference by column index):

Available Fields

<column_name> - Reference by header row column name (only when has_header_line=true and the column name is not blank)
cell<N> - Reference by column index (1-based: cell1, cell2, …; available regardless of whether a header row is present)
csvfile - Full path of the CSV file being processed
csvfilename - File name of the CSV file being processed

Note

If a column name contains characters that are invalid as a Groovy identifier, such as spaces or hyphens, the column cannot be referenced by name. Use cell<N> instead.

CSV Format Details

Standard CSV (RFC 4180 compliant)

Note

To include a delimiter inside a field by enclosing it in quotes, as in "Book, Programming" above, you must set quote_disabled=false to enable quote processing. When quote processing is disabled (the default), quotes are treated as ordinary characters and fields are split on the delimiter character.

Enabling Quote and Escape Processing

Quote processing and escape processing are disabled by default. Enable them explicitly as follows.

To enable quote processing:

To enable escape processing:

Changing Separator

Tab-separated (TSV):

Semicolon-separated:

Custom Quote Character

Single quote (quote processing must be enabled):

Encoding

Non-ASCII file (Shift_JIS):

Non-ASCII file (EUC-JP):

Usage Examples

Product Catalog CSV

CSV file (products.csv):

product_id,name,description,price,category,in_stock
1001,Laptop,High-performance laptop,120000,Computers,true
1002,Mouse,Wireless mouse,2500,Peripherals,true
1003,Keyboard,Mechanical keyboard,8500,Peripherals,false

Parameters:

Script:

Filtering by stock status:

url=in_stock == "true" ? "https://shop.example.com/product/" + product_id : null
title=in_stock == "true" ? name : null
content=in_stock == "true" ? description : null
price=in_stock == "true" ? price : null

Employee Directory CSV

CSV file (employees.csv):

emp_id,name,department,email,phone,position
E001,Taro Yamada,Sales Dept.,yamada@example.com,03-1234-5678,General Manager
E002,Hanako Sato,Engineering Dept.,sato@example.com,03-2345-6789,Manager
E003,Ichiro Suzuki,Administration Dept.,suzuki@example.com,03-3456-7890,Staff

Parameters:

Script:

url="https://intranet.example.com/employee/" + emp_id
title=name + " (" + department + ")"
content="Department: " + department + "\nPosition: " + position + "\nEmail: " + email + "\nPhone: " + phone
digest=department

CSV Without Header

CSV file (data.csv):

Parameters:

Script:

Multiple CSV Files Integration

Parameters:

Script:

Tab-Separated (TSV) File

TSV file (data.tsv):

Parameters:

Script:

Troubleshooting

File Not Found

Symptom: The crawl runs but no files are processed; is not found appears in the log

Check:

Verify the file path is correct (absolute path recommended)
Verify the file exists
Verify the file extension is .csv or .tsv (files with other extensions are skipped)
Verify the file has read permissions
Verify the file is accessible by the Fess process user

Character Encoding Issues

Symptom: Non-ASCII characters are not displayed correctly

Solution:

Specify the correct character encoding:

Check file encoding:

Columns Not Recognized Correctly

Symptom: Column separation is not recognized correctly, or a quoted field is split

Check:

Verify the separator is correct:

To handle quoted fields (fields that contain the delimiter character), enable quote processing:
```
quote_disabled=false
```
Verify the CSV file format (RFC 4180 compliant)

Header Row Handling

Symptom: The first row is recognized as data

Solution:

When a header row is present:

When no header row is present:

No Data Retrieved

Symptom: Crawl succeeds but the document count is 0

Check:

Verify the CSV file is not empty
Verify the script settings are correct (column names and cell<N> references must be used without a data. prefix)
Verify the column names are correct (when has_header_line=true)
Check the log for error messages

Large CSV Files

Symptom: Out of memory or timeout

Solution:

Split the CSV file into multiple smaller files
Use only the necessary columns in the script
Increase the Fess heap size
Filter out unnecessary rows

Fields with Line Breaks

In RFC 4180 format, fields containing line breaks can be handled by enclosing them in quotes. Since quote processing is disabled by default, quote_disabled=false must be specified:

Parameters:

CsvListDataStore

The fess-ds-csv plugin also includes the CsvListDataStore handler in addition to CsvDataStore.

CsvListDataStore extends CsvDataStore and provides the following additional features:

Multi-threaded processing (controlled by the numOfThreads parameter)
Automatic deletion of processed CSV files
Timestamp-based file filtering (skips files that may still be written to)

All parameters and script settings of CsvDataStore are available as-is.

Basic Settings

Item	Example
Handler Name	CsvListDataStore

Additional Parameters

Parameter	Required	Description
`timestamp_margin`	No	Elapsed time in milliseconds since the file’s last modification time. Files that have not yet exceeded this threshold are considered to still be written to and are skipped (default: 10000).
`numOfThreads`	No	Number of processing threads (default: 1)

Note

CsvListDataStore automatically deletes CSV files after processing is complete. If an error occurs during processing, the file is renamed to .txt (if renaming fails, the file is deleted).

Advanced Script Examples

Data Processing

Conditional Indexing

// Only index products with a price of 10000 or more
url=Integer.parseInt(price) >= 10000 ? "https://example.com/product/" + id : null
title=Integer.parseInt(price) >= 10000 ? name : null
content=Integer.parseInt(price) >= 10000 ? description : null
price=Integer.parseInt(price) >= 10000 ? price : null

Combining Multiple Columns

Date Formatting

Reference

Data Store Connector Overview - DataStore Connector Overview
JSON Connector - JSON Connector
Database Connector (Database Search) - Database Connector
Data Store Crawling - Data Store Configuration Guide
RFC 4180 - CSV Format