Use of LibreOffice
This page is generated by Machine Translation from Japanese.
Use of OpenOffice/LibreOffice
It is possible to crawl using the Apache POI Fess environmental standard in MS Office system document. You can crawl Office system document regarding LibreOffice, OpenOffice, do even more accurate text extraction from documents.
How to set up
JodConverter Fess server install. from http://jodconverter.googlecode.com/jodconverter-core-3.0-Beta-4-Dist.zipThe download. Expand and copy the jar file to Fess server.
$ unzip jodconverter-core-3.0-beta-4-dist.zip
$ cp jodconverter-core-3.0-beta-4/lib/juh-3.2.1.jar \
jodconverter-core-3.0-beta-4/lib/jurt-3.2.1.jar \
jodconverter-core-3.0-beta-4/lib/ridl-3.2.1.jar \
jodconverter-core-3.0-beta-4/lib/unoil-3.2.1.jar \
jodconverter-core-3.0-beta-4/lib/jodconverter-core-3.0-beta-4.jar \
fess-server-9.2.0/webapps/fess/WEB-INF/cmd/lib/
$ cd fess-server-9.2.0/
Create a s2robot_extractor.dicon to the next.
vi webapps/fess/WEB-INF/classes/s2robot_extractor.dicon
s2robot_extractor.dicon effective jodExtractor with following contents.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE components PUBLIC "-//SEASAR//DTD S2Container 2.4//EN"
"http://www.seasar.org/dtd/components24.dtd">
<components>
<component name="tikaExtractor" class="org.seasar.robot.extractor.impl.TikaExtractor"/>
<component name="msWordExtractor"
class="org.seasar.robot.extractor.impl.MsWordExtractor"/>
<component name="msExcelExtractor"
class="org.seasar.robot.extractor.impl.MsExcelExtractor"/>
<component name="msPowerPointExtractor"
class="org.seasar.robot.extractor.impl.MsPowerPointExtractor"/>
<component name="msPublisherExtractor"
class="org.seasar.robot.extractor.impl.MsPublisherExtractor"/>
<component name="msVisioExtractor"
class="org.seasar.robot.extractor.impl.MsVisioExtractor"/>
<component name="pdfExtractor" class="org.seasar.robot.extractor.impl.PdfExtractor"/>
<component name="textExtractor" class="org.seasar.robot.extractor.impl.TextExtractor"/>
<component name="htmlExtractor" class="org.seasar.robot.extractor.impl.HtmlExtractor"/>
<component name="xmlExtractor" class="org.seasar.robot.extractor.impl.XmlExtractor"/>
<component name="htmlXpathExtractor"
class="org.seasar.robot.extractor.impl.HtmlXpathExtractor">
<initMethod name="addFeature">
<arg>"http://xml.org/sax/features/namespaces"</arg>
<arg>"false"</arg>
</initMethod>
</component>
<component name="officeManagerConfiguration"
class="org.artofsolving.jodconverter.office.DefaultOfficeManagerConfiguration">
</component>
<component name="jodExtractor"
class="org.seasar.robot.extractor.impl.JodExtractor">
<property name="officeManager">
officeManagerConfiguration.setOfficeHome("/usr/lib/libreoffice")
.buildOfficeManager()
</property>
</component>
<component name="extractorFactory" class="org.seasar.robot.extractor.ExtractorFactory">
<initMethod name="addExtractor">
<arg>{
"application/msword",
"application/vnd.ms-excel",
"application/vnd.ms-powerpoint",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"application/vnd.openxmlformats-officedocument.presentationml.presentation"
}</arg>
<arg>jodExtractor</arg>
</initMethod>
...
Index to generate the settings later, usually crawled into the street.