Introduction
In the previous articles, we have primarily focused on searching text-based documents. However, enterprise knowledge includes a great deal of content beyond text. Product photos, engineering drawings, presentation slide images, whiteboard photos – if these “images” could also be searched, the possibilities for knowledge utilization would expand significantly.
In this article, we introduce how to build a multimodal search environment that enables cross-searching of text and images.
Target Audience
Those who face challenges in searching documents that contain images
Those interested in applications of vector search
Those who want to understand the concept of multimodal AI
What Is Multimodal Search?
Multimodal search is a technology that enables cross-searching across different types of data (text, images, audio, etc.).
For example, when you search with the text “red sports car design,” images that conceptually match are displayed in the search results. It is a mechanism that allows searching for images from text, or text from images.
CLIP Model
The foundation of multimodal search is models such as CLIP (Contrastive Language-Image Pre-Training). CLIP converts text and images into the same vector space, making it possible to calculate the similarity between text and images.
Multimodal Search in Fess
Fess can achieve cross-searching of text and images through its multimodal search plugin.
Components
The components of multimodal search are as follows:
CLIP Server: Converts text and images into vectors
OpenSearch: Searches vectors using KNN (K-Nearest Neighbor)
Fess: Provides crawling, indexing, and search UI
Setup Procedure
1. Preparing the CLIP Server
Prepare a server to run the CLIP model. An environment with GPU availability is recommended.
You can add a CLIP server using Docker Compose.
2. Installing the Plugin
Install the multimodal search plugin for Fess.
3. Configuring the KNN Index
Configure the KNN index settings to perform vector search in OpenSearch. Set the vector dimensions to match the CLIP model you are using.
4. Crawl Settings
Configure directories and websites containing images as crawl targets. Image files (PNG, JPEG, GIF, etc.) are also collected as crawl targets.
Search Experience
Searching for Images with Text
When you search with text such as “product exterior photo,” “meeting whiteboard,” or “engineering drawing,” images that conceptually match are displayed in the search results.
Thumbnail images are shown in the search results, allowing you to visually find the desired images.
Mixed Results of Text and Images
In multimodal search, search results containing a mix of text documents and images are returned. Rank Fusion (see Part 18) is used to integrate the results of text search and image search.
Use Cases
Manufacturing: Searching for Parts and Product Images
In manufacturing, vast numbers of parts photos and product images are managed. By searching with text such as “round metal part” or by searching for similar parts from a photo of a particular part, past design assets can be leveraged.
Design Teams: Managing Design Assets
Design teams manage large volumes of visual assets such as logos, icons, photo materials, and mockups. Because you can search with natural language such as “blue gradient background,” asset discovery becomes easier.
Research and Development: Searching Experimental Data
R&D departments manage graphs of experimental results, microscope photos, and images of measurement data. By making these images searchable, referencing past experimental data becomes easier.
Considerations for Deployment
Hardware Requirements
Multimodal search requires computational resources to run the CLIP model.
Recommended: GPU server (NVIDIA GPU)
Minimum: Can run on CPU, but indexing speed will be reduced
Indexing time depends on model processing speed, so a GPU environment is strongly recommended when indexing a large number of images.
Supported Image Formats
Common image formats (JPEG, PNG, GIF, BMP, TIFF, etc.) are supported. Support for images within PDFs and embedded images within office documents depends on crawl settings.
Phased Deployment
Multimodal search can be deployed as an addition to an existing text search environment.
First, conduct a trial deployment targeting directories and sites with many images
Verify search quality and usage
Gradually expand the scope
Summary
In this article, we introduced cross-searching of images and text using multimodal search.
The concept of multimodal search (unified vector space for text and images via CLIP)
Components and configuration of multimodal search in Fess
The experience of searching for images with text, and searching for similar images with images
Use cases in manufacturing, design, and research and development
GPU requirements and a phased deployment approach
In the next article, we will cover knowledge visualization in organizations through search data analysis.