Overview
The Git Connector provides functionality to retrieve files from Git repositories and register them in the Fess index.
This feature requires the fess-ds-git plugin.
Supported Repositories
GitHub (public/private)
GitLab (public/private)
Bitbucket (public/private)
Local Git repositories
Other Git hosting services
Prerequisites
Plugin installation is required
Authentication credentials are required for private repositories
Read access to the repository is required
Plugin Installation
Install from “System” -> “Plugins” in the admin console.
Or refer to Plugin for details.
Configuration
Configure from admin console via “Crawler” -> “Data Store” -> “Create New”.
Basic Settings
| Item | Example |
|---|---|
| Name | Project Git Repository |
| Handler Name | GitDataStore |
| Enabled | On |
Parameter Settings
Public repository example:
Private repository example (with authentication):
Parameter List
Script Settings
Available Fields
| Field | Description |
|---|---|
url | File URL |
path | File path within repository |
name | File name |
content | File text content |
contentLength | Content length |
timestamp | Last modified date |
mimetype | File MIME type |
author | Last commit author information (PersonIdent) |
committer | Committer information (PersonIdent). May differ from author |
uri | Git repository URI |
Git Repository Authentication
GitHub Personal Access Token
1. Generate Personal Access Token on GitHub
Access https://github.com/settings/tokens:
Click “Generate new token” -> “Generate new token (classic)”
Enter token name (e.g., Fess Crawler)
Check “repo” in scopes
Click “Generate token”
Copy the generated token
2. Include authentication in URI
GitLab Private Token
1. Generate Access Token on GitLab
GitLab User Settings -> Access Tokens:
Enter token name
Check “read_repository” in scopes
Click “Create personal access token”
Copy the generated token
2. Include authentication in URI
SSH Authentication
When using SSH key:
Note
When using SSH authentication, the SSH key must be configured for the user running Fess.
Extractor Settings
Extractors by MIME Type
Specify extractors by file type using extractors parameter:
Format: <MIME type regex>:<extractor name>,
Default Extractors
textExtractor- For text filestikaExtractor- For binary files (PDF, Word, etc.)
Crawl Text Files Only
Crawl All Files
Specific File Types Only
Incremental Crawl
Crawl Only Changes Since Last Commit
After initial crawl, set prev_commit_id to the previous commit ID:
Note
prev_commit_id is automatically updated to the latest commit ID after a successful crawl. Set it to empty for the initial crawl to process all files; subsequent crawls will only process changes.
Handling Deleted Files
When base_url is set, files detected as deleted via Git DiffEntry (ChangeType.DELETE) are automatically removed from the index.
Usage Examples
GitHub Public Repository
Parameters:
Script:
GitHub Private Repository
Parameters:
Script:
GitLab (Self-Hosted)
Parameters:
Script:
Crawl Documentation Only (Markdown Files)
Parameters:
Script:
Crawl Specific Directory Only
Filter in script:
Troubleshooting
Authentication Error
Symptom: Authentication failed or Not authorized
Check:
Verify Personal Access Token is correct
Verify token has appropriate permissions (
reposcope)Verify URI format is correct:
Verify token has not expired
Repository Not Found
Symptom: Repository not found
Check:
Verify repository URL is correct
Verify repository exists and is not deleted
Verify authentication credentials are correct
Verify access permissions to repository
Files Not Retrieved
Symptom: Crawl succeeds but 0 files
Check:
Verify
extractorssetting is appropriateVerify repository contains files
Verify script settings are correct
Verify files exist on target branch
MIME Type Error
Symptom: Certain files are not crawled
Solution:
Adjust extractor settings:
Large Repositories
Symptom: Crawl takes long time or runs out of memory
Solution:
Limit target files with
extractorsFilter specific directories in script
Use incremental crawl (
prev_commit_idsetting)Adjust crawl interval
Specifying Branch
To crawl a branch other than the default, specify the branch name or tag using the commit_id parameter:
URL Generation
base_url Setting Patterns
GitHub:
GitLab:
Bitbucket:
URLs are generated by combining base_url with the file path.
URL Generation in Script
Or custom URL:
Reference
Data Store Connector Overview - DataStore Connector Overview
Database Connector - Database Connector
Data Store Crawling - Data Store Configuration Guide