Overview
The Box Connector provides functionality to retrieve files from Box.com cloud storage and register them in the Fess index.
This connector authenticates to the enterprise using JWT (Server Authentication) and recursively crawls the files accessible to each user in the enterprise by impersonating (impersonation) each of them. The users to crawl can be narrowed down with the filter_term parameter.
This feature requires the fess-ds-box plugin.
Prerequisites
Plugin installation is required
A Box developer account and application creation is required
JWT (JSON Web Token) authentication setup is required
Installing the Plugin
Method 1: Direct JAR file placement
Method 2: Install from admin console
Open “System” -> “Plugins”
Upload the JAR file
Restart Fess
Configuration
Configure in the admin console under “Crawler” -> “Data Store” -> “Create New”.
Basic Settings
| Item | Example |
|---|---|
| Name | Company Box Storage |
| Handler Name | BoxDataStore |
| Enabled | On |
Parameter Configuration
JWT authentication example:
Parameter List
Authentication Parameters (Required)
Crawl Parameters (Optional)
| Parameter | Default Value | Description |
|---|---|---|
max_size | 10000000 | Maximum file size (bytes) to crawl. Default is 10 MB. |
supported_mimetypes | .* | MIME types to crawl (regular expression). Multiple values can be specified separated by commas. |
include_pattern | (none) | URL pattern to include in crawl targets |
exclude_pattern | (none) | URL pattern to exclude from crawl targets |
number_of_threads | 1 | Number of threads for crawl processing |
ignore_folder | true | Whether to exclude folders from indexing. In the current implementation, folders themselves are not indexed (only files are targeted), so this parameter has no effect. |
ignore_error | true | Whether to continue processing when an error occurs |
filter_term | (none) | Filter condition to narrow down the enterprise users to crawl. If not specified, all enterprise users are targeted. |
fields | (all fields) | Specification of fields to retrieve from the Box API |
Connection Parameters (Optional)
| Parameter | Default Value | Description |
|---|---|---|
base_url | https://app.box.com | Base URL used to construct the URL for opening a file in a browser (file.url). It does not affect the API endpoints used by the Box SDK. |
max_retry_count | 10 | Maximum number of retries for API calls |
proxy_host | (none) | HTTP proxy host name |
proxy_port | (none) | HTTP proxy port number |
refresh_token_interval | 3540 | Token refresh interval (seconds). Default is 59 minutes. |
Script Configuration
Available Fields
Primary Fields
| Field | Description |
|---|---|
file.url | Link to open the file in a browser |
file.contents | Text content of the file |
file.mimetype | File MIME type |
file.filetype | File type |
file.name | File name |
file.size | File size (bytes) |
file.created_at | Creation date and time |
file.modified_at | Last modified date and time |
file.download_url | Box direct download URL |
file.id | Box item ID |
file.description | File description |
file.extension | File extension |
file.sha1 | SHA1 hash of the file |
file.path_collection | List of folder paths |
Metadata Fields
| Field | Description |
|---|---|
file.type | Item type (“file” or “folder”) |
file.file_version | File version information |
file.sequence_id | Sequence ID |
file.etag | ETag hash |
file.trashed_at | Date and time moved to trash |
file.purged_at | Date and time permanently deleted |
file.content_created_at | Content creation date and time |
file.content_modified_at | Content modified date and time |
file.created_by | Creator information |
file.modified_by | Modifier information |
file.owned_by | Owner information |
file.shared_link | Shared link information |
file.parent | Parent folder information |
file.item_status | Item status |
file.version_number | Version number |
file.comment_count | Comment count |
file.permissions | Permission information |
file.tags | Tag information |
file.lock | Lock information |
file.is_package | Package flag |
file.is_watermark | Watermark flag |
file.collections | Collection information |
file.representations | Representation format information |
file.api | Box file API object (for retrieving collaboration and permission information) |
For details, refer to Box File Object.
Box Authentication Setup
JWT Authentication Setup Steps
1. Create an Application in Box Developer Console
Access https://app.box.com/developers/console:
Click “Create New App”
Select “Custom App”
Select “Server Authentication (with JWT)” for authentication method
Enter app name and create
2. Application Configuration
In the “Configuration” tab:
Application Scopes:
Check “Read all files and folders stored in Box”
Advanced Features:
Click “Generate a Public/Private Keypair”
Download the generated JSON file (important!)
App Access Level:
Select “App + Enterprise Access”
4. Obtain Authentication Credentials
Obtain the following information from the downloaded JSON file:
Private Key Format
Replace newlines in private_key with \n to make it a single line:
Usage Examples
Crawling Entire Company Box Storage
Parameters:
Script:
Crawling Only Specific Folders
Filtering by folder path is possible using the include_pattern parameter.
Parameters:
Script:
Crawling Only PDF Files
Filtering by MIME type is possible using the supported_mimetypes parameter.
Parameters:
Script:
Troubleshooting
Authentication Errors
Symptom: Authentication failed or Invalid grant
Check:
Verify
client_idandclient_secretare correctVerify private key is correctly copied (newlines are
\n)Verify passphrase is correct
Verify app is authorized in Box admin console
Verify
enterprise_idis correct
Private Key Format Errors
Symptom: Invalid private key format
Resolution:
Verify newlines are correctly converted to \n:
Cannot Retrieve Files
Symptom: Crawl succeeds but 0 files
Check:
Verify “Read all files and folders” is enabled in Application Scopes
Verify App Access Level is set to “App + Enterprise Access”
Verify files actually exist in Box storage
Verify service account has appropriate permissions
When There Are a Large Number of Files
Symptom: Crawling takes a long time or times out
Resolution:
Split processing in the data store settings:
Adjust crawl interval
Divide into multiple data stores (e.g., by folder)
Increase thread count with the
number_of_threadsparameterDistribute load with schedule settings
Permissions and Access Control
Reflecting Box Collaboration Permissions
Through the BoxFileAPI object provided by the file.api field, you can map Box collaboration information to Fess search roles. file.api.collaborationRoles returns a list of search roles corresponding to the users and groups that can access the file.
Set permissions in the script:
Note
file.api.collaborationRoles retrieves collaboration information for each file, which increases the number of Box API calls and may slow down crawling.
To assign a fixed role to all files, specify it as follows:
Reference Information
Data Store Connector Overview - Data Store Connector Overview
Dropbox Connector - Dropbox Connector
Google Workspace Connector - Google Workspace Connector
Data Store Crawling - Data Store Configuration Guide