Part 7: Crawling Sites with Authentication

<<This page is generated by Machine Translation from Japanese. Pull Request is welcome!>>

Since we have had time since last time, this time we will explain how to crawl sites that require authentication.

Many websites have restricted access that can be used after logging in. There are many ways to authenticate websites, but Fess can crawl websites with those certifications. Fess supports Basic, Digest, NTLM and Form authentication.

What is web authentication?

Web authentication at Fess is authentication at a website that requires a login. Websites can set up web authentication so that only certain users can access it.

There are various types of web authentication, but we will briefly explain the authentication methods supported by Fess.

Basic authentication is one of the basic authentication methods defined in HTTP. You can access the site by writing the authentication information in the Authorization field of the HTTP header and sending it. Digest authentication and NTLM authentication are accessed using HTTP, similar to Basic authentication.

Form authentication differs from the above authentication in that the user logs in using a login form instead of HTTP, and the system authenticates the user using cookie information and other information. Form authentication is a web authentication used in many web applications.

How to set up web authentication

Learn how to crawl web-authenticated sites. This time, we use Fess 12.3.1. The Fess ZIP file can be obtained from the download page. Extract the ZIP file and execute bin/fess.[sh|bat] to start.

First, open the Fess administration screen in your browser and create a crawl with “Crawl”> “Web”. Create this web crawl configuration just as you would for a normal site crawl.

Select “Crawl” > “Web Authentication” from the menu on the left to display the web authentication setting list screen.

Press the “New” button on the upper right to display the crawl setting screen. The explanation of the main setting items is as follows.

Item Description
Host name Host name of target site (any host name if omitted)
Port Port number of the target site (any port number if omitted)
Realm Realm of target site (any realm name if omitted)
Scheme Authentication method
User name User name to log in to the target site
Password Password to log in to the target site
Parameter Set if there are settings required to log in to the authentication site
Web settings Crawler name to crawl authentication sites

The following is an example of settings for crawling sites with Basic, Digest, NTLM, and Form authentication.

Basic authentication

Consider crawling a site for which Basic authentication has been set with the following settings.

Item Value
URL https://basic.codelibs.org/
Username testuser
Password testpass

If you create a crawl configuration with the name BasicAuth Example, configure the following for web authentication.

Item Value
Hostname basic.codelibs.org
Port (omitted)
Realm (omitted)
Scheme Basic
Username testuser
Password testpass
Parameters (not entered)
Web Settings BasicAuth Example

The host name can be omitted. If you need to handle multiple web authentications in one crawl setup, specify the host name so that each site can be authenticated.

Digest authentication

Crawl sites that have Digest authentication configured with the following settings.

Item Value
URL https://digest.codelibs.org/
Username testuser
Password testpass

If you created a crawl configuration with the name DigestAuth Example, configure the following for web authentication:

Item Value
Scheme Digest
Username testuser
Password testpass
Parameters (not entered)
Web Settings DigestAuth Example

NTLM authentication

Crawl a site with NTLM authentication configured with the following settings:

Item Value
URL https://ntlm.codelibs.org/
Username testuser
Password testpass

If you create a crawl configuration with the name NTLMAuth Example, configure the following for web authentication.

Item Value
Scheme NTLM
Username testuser
Password testpass
Parameters Fill in as needed
Web Settings NTLMAuth Example

For NTLM authentication, the workstation name and domain name can be set as the workstation and domain values, respectively. Set these values according to the target environment. When setting, describe as follows in the parameter column.

workstation = HOGE
domain = FUGA

Form authentication

There are various sites for Form authentication, but this time we will explain as an example of crawling Redmine, a web application for project management. Redmine can be used with the following settings.

Item Value
URL https://redmine.codelibs.org/
Username testuser
Password testpass

If you create a crawl configuration with the name Redmine Example, configure the following for web authentication.

Item Value
Scheme Form
Username testuser
Password testpass
Parameters
encoding=UTF-8
token_method=GET
token_url=https://redmine.codelibs.org/login
token_pattern=name=”authenticity_token” +value=”([^”]+)”
token_name = authenticity_token
login_method = POST
login_parameters=username=${username}&password=${password}
Web Settings Redmine Example

Redmine uses authenticity_token as a transaction token, so you need to send it along with your login information when you log in. authenticity_token can be obtained on the login screen of Redmine. To get the token, Fess sets the method of getting it with token_ and gets the value of authenticiy_token.

Set the information required to log in to the site with login_. In login_url, specify the URL for login authentication processing, and in login_parameters, specify the request parameters required for login. ${username} and ${password} set the username and password values for web authentication.

Using the above information, Fess will automatically log in to the site when crawling and crawl the site with Form authentication.

Form authentication methods vary from website to website. When crawling a site with Form authentication, you need to check the HTML and HTTP headers on the login page and set the appropriate parameters.

Summary

This time, we introduced how to crawl various web authentication sites of Fess. There are many sites that require authentication, such as sites used in companies and membership sites, and you often want to search for these sites as well. Fess also supports Form authentication, so you can build an environment to search in many situations.