Introduction
In the previous article, we organized the concepts of semantic search. In this article, as an evolution of that approach, we build an internal AI assistant using RAG (Retrieval-Augmented Generation).
RAG is a mechanism that “finds relevant documents through search, and then an LLM (Large Language Model) generates answers based on their content.” Because answers are generated based on internal documents, RAG can handle company-specific questions that a general-purpose chat AI cannot answer.
Target Audience
Those interested in building an internal AI assistant
Those who want to learn how to implement RAG
Those who want to understand the options for LLM integration
How RAG Works
RAG Pipeline
Fess’s AI search mode operates through the following pipeline:
Intent Analysis (Intent): Analyzes the user’s question and classifies intent (search, summarization, FAQ, ambiguous)
Search: Retrieves relevant documents from Fess’s index (automatically regenerates queries and re-searches on zero hits)
Evaluate: The LLM evaluates the relevance of retrieved documents
Full-Text Fetch: Retrieves the full text of highly relevant documents
Answer Generation (Answer): The LLM generates a streaming response with citations based on document content
This pipeline mitigates “plausible but inaccurate responses (hallucinations)” from the LLM and provides answers backed by internal documents.
Fess’s AI search mode does not require vector search (embedding models). It leverages existing full-text search indexes as-is, with the LLM handling search result evaluation and answer generation. This means you can introduce RAG-based AI search immediately, without additional infrastructure preparation such as selecting embedding models or building vector databases.
Choosing an LLM Provider
Fess supports three LLM backends. Below is a summary of each provider’s characteristics and selection criteria.
Selection Criteria
When to choose a Cloud API (OpenAI / Gemini)
You want to minimize initial costs
You cannot prepare a GPU server
You prioritize answer quality above all
Sending data externally is acceptable
When to choose Local Execution (Ollama)
Sending internal data externally is not permitted
Security and privacy requirements are strict
You can prepare a GPU server
You want to reduce running costs (for high-volume usage)
Setting Up AI Search Mode
Installing the Plugin
Install the plugin corresponding to your chosen LLM provider.
Go to [System] > [Plugins] in the admin console
Install the corresponding plugin (e.g.,
fess-llm-openai)
Enabling AI Search Mode
Enable AI search mode in Fess’s settings.
Set rag.chat.enabled=true in the admin console settings or the configuration file.
LLM Connection Settings
Configure the connection information for your chosen LLM provider.
For OpenAI
API key
Model to use (gpt-4o, gpt-4o-mini, etc.)
Maximum token count
For Ollama
Ollama server URL
Model name to use
Adding a service via Docker Compose
services:
ollama:
image: ollama/ollama
volumes:
- ollama-data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
The AI Chat Experience
Chat UI
When AI search mode is enabled, AI search mode is added to the Fess search screen. Users can switch between keyword search and AI chat.
Question and Answer Example
Question: “How do I apply for paid leave?”
AI Response:
To apply for paid leave, follow these steps:
Access the “Attendance Management” page on the internal portal
Select “Paid Leave” from the “Leave Request” menu
Enter the dates, select your manager, and submit the request
Please submit your request at least 3 business days before the leave date.
Reference: [Employment Regulations, Article 15](https://portal/rules/chapter15.html)
The response includes source links, allowing users to verify the original documents.
Using the Chat API (SSE)
To use AI search mode from a program, use the Chat API. The Chat API returns streaming responses via Server-Sent Events (SSE).
For streaming responses (SSE):
GET /api/v1/chat/stream?message=How to apply for paid leave
For non-streaming JSON responses:
POST /api/v1/chat
Content-Type: application/x-www-form-urlencoded
message=How to apply for paid leave
With SSE, responses are sent to the client in real time as they are generated. Users can start reading the progressively displayed response without waiting for the full answer to be generated.
Conversation History
The Chat API supports session-based conversation history. Follow-up questions based on the context of previous questions are possible.
Example:
Q1: “How do I apply for paid leave?”
A1: (Response as above)
Q2: “What should I do if I miss the application deadline?”
A2: (Response based on the context of Q1)
Tuning RAG
Improving Answer Quality
RAG answer quality is influenced by the following factors:
Search Quality
Since RAG generates answers based on search results, search quality directly affects answer quality. Improving search quality through the tuning cycle described in Part 8 also leads to improved RAG quality.
Document Quality
If the documents being searched are outdated, inaccurate, or ambiguous, RAG answer quality will also decline. Regular updates and quality management of documents are important.
Prompt Settings
Tuning the prompts (instruction text) sent to the LLM allows you to adjust the style and accuracy of responses.
Safety Considerations
Prompt Injection Countermeasures
Fess’s RAG feature has built-in prompt injection defenses. It protects against attacks that attempt to manipulate LLM behavior through malicious inputs.
Preventing Information Leakage
Since RAG generates answers based on search results, combining it with role-based search (Part 5) ensures that only answers appropriate to the user’s permissions are generated. Content from documents the user is not authorized to access is not included in RAG responses.
Summary
In this article, we explained how to build an internal AI assistant using Fess’s AI search mode.
How the RAG pipeline works (intent analysis -> search -> evaluation -> answer generation)
Selection criteria for three LLM providers (OpenAI, Gemini, Ollama)
Setting up and experiencing AI search mode
Using the Chat API (SSE) from programs
Tuning answer quality and safety considerations
With an AI assistant based on internal documents, knowledge utilization shifts from “searching” to “asking.”
In the next article, we will cover how to integrate Fess as an MCP server into AI agents.