Bulk Retrieval of Search Results

Overview

In Fess normal search, only a certain number of search results are displayed via paging functionality. To retrieve all search results in bulk, use the Scroll Search feature.

This feature is useful when you need to process all search results, such as bulk data export or backup, and large-scale data analysis.

Use Cases

Scroll search is suitable for purposes such as:

Exporting all search results
Retrieving large amounts of data for analysis
Data retrieval in batch processing
Data synchronization to external systems
Data collection for report generation

Warning

Scroll search returns large amounts of data and consumes more server resources compared to normal search. Enable only when necessary.

Configuration

Enabling Scroll Search

By default, scroll search is disabled from security and performance perspectives. To enable it, change the following setting in fess_config.properties or /etc/fess/fess_config.properties.

api.search.scroll=true

Note

After changing settings, you must restart Fess.

Response Field Configuration

You can customize fields included in search result responses. By default, only basic fields are returned, but you can specify additional fields.

query.additional.scroll.response.fields=content,mimetype,filename,created,last_modified

When specifying multiple fields, list them separated by commas.

Scroll Timeout Configuration

You can configure the lifespan of scroll contexts. The default is 1 minute.

api.search.scroll.timeout=1m

Units: - s: seconds - m: minutes - h: hours

Usage

Basic Usage

Access scroll search using the following URL:

http://localhost:8080/json/scroll?q=search keyword

Search results are returned in NDJSON (Newline Delimited JSON) format. Each line outputs one document in JSON format.

Example:

curl "http://localhost:8080/json/scroll?q=Fess"

Request Parameters

The following parameters can be used with scroll search:

Parameter Name	Description
`q`	Search query (required)
`size`	Number of items to retrieve per scroll (default: 100)
`scroll`	Scroll context validity period (default: 1m)
`fields.label`	Filtering by label

Specifying Search Queries

You can specify search queries just like normal searches.

Example: Keyword search

curl "http://localhost:8080/json/scroll?q=search engine"

Example: Field-specific search

curl "http://localhost:8080/json/scroll?q=title:Fess"

Example: Retrieve all (no search conditions)

curl "http://localhost:8080/json/scroll?q=*:*"

Specifying Retrieval Count

You can change the number of items retrieved per scroll.

curl "http://localhost:8080/json/scroll?q=Fess&size=500"

Note

Setting the size parameter too large increases memory usage. It is recommended to set it in the range of 100-1000.

Filtering by Label

You can retrieve only documents belonging to a specific label.

curl "http://localhost:8080/json/scroll?q=*:*&fields.label=public"

When Authentication is Required

When using role-based search, you need to include authentication information.

curl -u username:password "http://localhost:8080/json/scroll?q=Fess"

Or use an API token:

curl -H "Authorization: Bearer YOUR_API_TOKEN" \
     "http://localhost:8080/json/scroll?q=Fess"

Response Format

NDJSON Format

Scroll search responses are returned in NDJSON (Newline Delimited JSON) format. Each line represents one document.

Example:

{"url":"http://example.com/page1","title":"Page 1","content":"..."}
{"url":"http://example.com/page2","title":"Page 2","content":"..."}
{"url":"http://example.com/page3","title":"Page 3","content":"..."}

Response Fields

Main fields included by default:

url: Document URL
title: Title
content: Body (excerpt)
score: Search score
boost: Boost value
created: Creation date/time
last_modified: Last modified date/time

Data Processing Examples

Python Processing Example

import requests
import json

# Execute scroll search
url = "http://localhost:8080/json/scroll"
params = {
    "q": "Fess",
    "size": 100
}

response = requests.get(url, params=params, stream=True)

# Process NDJSON response line by line
for line in response.iter_lines():
    if line:
        doc = json.loads(line)
        print(f"Title: {doc.get('title')}")
        print(f"URL: {doc.get('url')}")
        print("---")

Saving to File

Example of saving search results to a file:

curl "http://localhost:8080/json/scroll?q=*:*" > all_documents.ndjson

Converting to CSV

Example of converting to CSV using jq command:

curl "http://localhost:8080/json/scroll?q=Fess" | \
    jq -r '[.url, .title, .score] | @csv' > results.csv

Data Analysis

Example of analyzing retrieved data:

import json
import pandas as pd
from collections import Counter

# Read NDJSON file
documents = []
with open('all_documents.ndjson', 'r') as f:
    for line in f:
        documents.append(json.loads(line))

# Convert to DataFrame
df = pd.DataFrame(documents)

# Basic statistics
print(f"Total documents: {len(df)}")
print(f"Average score: {df['score'].mean()}")

# URL domain analysis
df['domain'] = df['url'].apply(lambda x: x.split('/')[2])
print(df['domain'].value_counts())

Performance and Best Practices

Efficient Usage

Set appropriate size parameter
- Too small increases communication overhead
- Too large increases memory usage
- Recommended: 100-1000
Optimize search conditions
- Specify search conditions to retrieve only necessary documents
- Execute full retrieval only when truly necessary
Use off-peak hours
- Retrieve large amounts of data during low system load periods
Use in batch processing
- Execute periodic data synchronization as batch jobs

Optimizing Memory Usage

When processing large amounts of data, use streaming processing to reduce memory usage.

import requests
import json

url = "http://localhost:8080/json/scroll"
params = {"q": "*:*", "size": 100}

# Process with streaming
with requests.get(url, params=params, stream=True) as response:
    for line in response.iter_lines(decode_unicode=True):
        if line:
            doc = json.loads(line)
            # Process document
            process_document(doc)

Security Considerations

Access Restrictions

Since scroll search returns large amounts of data, set appropriate access restrictions.

IP Address Restriction

Allow access only from specific IP addresses
API Authentication

Use API tokens or Basic authentication
Role-Based Restrictions

Allow access only to users with specific roles

Rate Limiting

To prevent excessive access, it is recommended to configure rate limiting with a reverse proxy.

Troubleshooting

Cannot Use Scroll Search

Verify that api.search.scroll is set to true.
Verify that Fess was restarted.
Check error logs.

Timeout Errors Occur

Increase the value of api.search.scroll.timeout.
Reduce the size parameter to distribute processing.
Narrow search conditions to reduce amount of retrieved data.

Out of Memory Errors

Reduce the size parameter.
Increase Fess heap memory size.
Check OpenSearch heap memory size.

Response is Empty

Verify that the search query is correct.
Verify that specified labels or filter conditions are correct.
Verify role-based search permission settings.

References

Search Features - Search features details
Search-Related Settings - Search-related settings
OpenSearch Scroll API

Menu

Bulk Retrieval of Search Results

Overview

Use Cases

Configuration

Enabling Scroll Search

Response Field Configuration

Scroll Timeout Configuration

Usage

Basic Usage

Request Parameters

Specifying Search Queries

Specifying Retrieval Count

Filtering by Label

When Authentication is Required

Response Format

NDJSON Format

Response Fields

Data Processing Examples

Python Processing Example

Saving to File

Converting to CSV

Data Analysis

Performance and Best Practices

Efficient Usage

Optimizing Memory Usage

Security Considerations

Access Restrictions

Rate Limiting

Troubleshooting

Cannot Use Scroll Search

Timeout Errors Occur

Out of Memory Errors

Response is Empty

References