Git Connector

Overview

The Git Connector provides functionality to retrieve files from Git repositories and register them in the Fess index.

This feature requires the fess-ds-git plugin.

Supported Repositories

GitHub (public/private)
GitLab (public/private)
Bitbucket (public/private)
Local Git repositories
Other Git hosting services

Prerequisites

Plugin installation is required
Authentication credentials are required for private repositories
Read access to the repository is required

Plugin Installation

Install from “System” -> “Plugins” in the admin console.

Or refer to Plugin for details.

Configuration

Configure from admin console via “Crawler” -> “Data Store” -> “Create New”.

Basic Settings

Item	Example
Name	Project Git Repository
Handler Name	GitDataStore
Enabled	On

Parameter Settings

Public repository example:

uri=https://github.com/codelibs/fess.git
base_url=https://github.com/codelibs/fess/blob/master/
extractors=text/.*:textExtractor,application/xml:textExtractor,application/javascript:textExtractor,
prev_commit_id=

Private repository example (with authentication):

uri=https://username:personal_access_token@github.com/company/private-repo.git
base_url=https://github.com/company/private-repo/blob/master/
extractors=text/.*:textExtractor,application/xml:textExtractor,
prev_commit_id=

Parameter List

Parameter	Required	Description
`uri`	Yes	Git repository URI (for cloning)
`base_url`	No	Base URL for file viewing. If not set, URLs will be empty and automatic deletion of removed files will be disabled
`username`	No	Git authentication username. Used with `password` as an alternative to embedding credentials in URI
`password`	No	Git authentication password or token. Used with `username`
`extractors`	No	Extractor settings by MIME type
`default_extractor`	No	Fallback extractor when no MIME pattern matches (default: `tikaExtractor`)
`prev_commit_id`	No	Previous commit ID (for incremental crawl). Automatically updated after successful crawl
`commit_id`	No	Target commit ID (default: HEAD). Branch or tag can be specified
`ref_specs`	No	Git ref specs (default: `+refs/heads/:refs/heads/`)
`repository_path`	No	Local repository path. If not set, a temporary directory is created and cleaned up after crawl
`include_pattern`	No	File path inclusion filter (regex)
`exclude_pattern`	No	File path exclusion filter (regex)
`max_size`	No	Maximum file size to index in bytes (default: `10000000`)
`cache_threshold`	No	Threshold in bytes for switching between memory and disk buffering (default: `1000000`)

Script Settings

url=url
host="github.com"
site="github.com/codelibs/fess/" + path
title=name
content=content
cache=""
digest=author.toExternalString()
anchor=
content_length=contentLength
last_modified=timestamp
mimetype=mimetype

Available Fields

Field	Description
`url`	File URL
`path`	File path within repository
`name`	File name
`content`	File text content
`contentLength`	Content length
`timestamp`	Last modified date
`mimetype`	File MIME type
`author`	Last commit author information (PersonIdent)
`committer`	Committer information (PersonIdent). May differ from author
`uri`	Git repository URI

Git Repository Authentication

GitHub Personal Access Token

1. Generate Personal Access Token on GitHub

Access https://github.com/settings/tokens:

Click “Generate new token” -> “Generate new token (classic)”
Enter token name (e.g., Fess Crawler)
Check “repo” in scopes
Click “Generate token”
Copy the generated token

2. Include authentication in URI

GitLab Private Token

1. Generate Access Token on GitLab

GitLab User Settings -> Access Tokens:

Enter token name
Check “read_repository” in scopes
Click “Create personal access token”
Copy the generated token

2. Include authentication in URI

SSH Authentication

When using SSH key:

Note

When using SSH authentication, the SSH key must be configured for the user running Fess.

Extractor Settings

Extractors by MIME Type

Specify extractors by file type using extractors parameter:

Format: <MIME type regex>:<extractor name>,

Default Extractors

textExtractor - For text files
tikaExtractor - For binary files (PDF, Word, etc.)

Crawl Text Files Only

Crawl All Files

Specific File Types Only

Incremental Crawl

Crawl Only Changes Since Last Commit

After initial crawl, set prev_commit_id to the previous commit ID:

Note

prev_commit_id is automatically updated to the latest commit ID after a successful crawl. Set it to empty for the initial crawl to process all files; subsequent crawls will only process changes.

Handling Deleted Files

When base_url is set, files detected as deleted via Git DiffEntry (ChangeType.DELETE) are automatically removed from the index.

Usage Examples

GitHub Public Repository

Parameters:

Script:

GitHub Private Repository

Parameters:

uri=https://username:YOUR_GITHUB_TOKEN@github.com/company/repo.git
base_url=https://github.com/company/repo/blob/main/
extractors=text/.*:textExtractor,application/xml:textExtractor,application/javascript:textExtractor,

Script:

GitLab (Self-Hosted)

Parameters:

Script:

Crawl Documentation Only (Markdown Files)

Parameters:

Script:

Crawl Specific Directory Only

Filter in script:

Troubleshooting

Authentication Error

Symptom: Authentication failed or Not authorized

Check:

Verify Personal Access Token is correct
Verify token has appropriate permissions (repo scope)

Verify URI format is correct:

Verify token has not expired

Repository Not Found

Symptom: Repository not found

Check:

Verify repository URL is correct
Verify repository exists and is not deleted
Verify authentication credentials are correct
Verify access permissions to repository

Files Not Retrieved

Symptom: Crawl succeeds but 0 files

Check:

Verify extractors setting is appropriate
Verify repository contains files
Verify script settings are correct
Verify files exist on target branch

MIME Type Error

Symptom: Certain files are not crawled

Solution:

Adjust extractor settings:

Large Repositories

Symptom: Crawl takes long time or runs out of memory

Solution:

Limit target files with extractors
Filter specific directories in script
Use incremental crawl (prev_commit_id setting)
Adjust crawl interval

Specifying Branch

To crawl a branch other than the default, specify the branch name or tag using the commit_id parameter:

URL Generation

base_url Setting Patterns

GitHub:

GitLab:

Bitbucket:

URLs are generated by combining base_url with the file path.

URL Generation in Script

Or custom URL:

Reference

Data Store Connector Overview - DataStore Connector Overview
Database Connector - Database Connector
Data Store Crawling - Data Store Configuration Guide
GitHub Personal Access Tokens
GitLab Personal Access Tokens