User Tools

Site Tools


config:crawlerconfiguration.xml
Translations of this page:

CrawlerConfiguration.xml

The CrawlerConfiguration.xml contains the configuration of the crawler.

<startlist> tag

role: A list of URLs the crawler should start with. You may add any number of start URLs, which allows to add document from several sources in one index.

values: Any number of <start> child tags.

example:

<startlist>
   <start> ... </start>
   <start> ... </start>
   ...
</startlist>

<start> tag

Child tag of the <startlist> tag.

role: A start URL, the crawler starts with.

attribute parse: Specifies whether the document should be parsed for URLs. (values: true or false)

attribute index: Specifies whether a found document should be indexed. (values: true or false)

values: A URL.

example:

<start parse="true" index="false">http://www.murfman.de/bla/blubb</start>

or

<start parse="true" index="true">file://P:/documents</start>

<whitelist> tag

role: The white list. Contains prefixes a URL must have to be processed.

values: Any number of <prefix> or <regex> child tags.

example:

<whitelist>
  <prefix>http://www.mydomain.de</prefix>
</whitelist>

<prefix> tag

Child tag of the <whitelist> tag.

role: Specifies a URL prefix a URL must have in order to be processed by the crawler.

attribute name: The name of the white list entry. This is needed for the partial indexing. (Optional, values: A unique name)

values: A URL prefix.

example:

<prefix>http://www.murfman.de/bla/blubb</prefix>

or

<prefix name="docs">file://c:/MyDocuments</prefix>

<regex> tag

Child tag of the <whitelist> tag.

role: Specifies a regular expression a URL must match to in order to be processed by the crawler.

values: A regular expression.

since version: 1.1 Beta 5

example:

<regex>/docs/[^/]**$</regex>

<blacklist> tag

role: The black list. Contains prefixes a URL must not have to be processed.

values: Any number of <prefix> or <regex> child tags.

example:

<blacklist>
  <prefix>http://www.mydomain.de/some/dynamic/content/</prefix>
</blacklist>

<prefix> tag

Child tag of the <blacklist> tag.

role: Specifies a URL prefix a URL must not have in order to be processed by the crawler.

values: A URL prefix.

example:

<prefix>http://www.murfman.de/dynamic/content</prefix>

or

<prefix>file://c:/MyDocuments/stuff/I/won't/search/for</prefix>

<regex> tag

Child tag of the <blacklist> tag.

role: Specifies a regular expression a URL must match to in order to be processed by the crawler.

values: A regular expression.

since version: 1.1 Beta 5

example:

<regex>/backup/[^/]**$</regex>

<proxy> tag

role: Sets the HTTP proxy that should be used by the crawler.

values: Knows the following child tags:

  • <host>: The host name of the proxy.
  • <port>: The port of the proxy.
  • <user>: The user name if authentification is desired.
  • <password>: The password if authentification is desired.

All these fields may be left empty, if no proxy or no authentification is desired.

If you don't know, what you should set here, use the settings of your web browser.

example:

<proxy>
  <host>proxy.something.de</host>
  <port>3128</port>
  <user>karl34</user>
  <password>gkxy23</password>
</proxy>

<host> tag

Child tag of the <proxy> tag.

role: The host name of the proxy. May be left out if no proxy is desired.

values: A host name.

example:

<host>proxy.something.de</host>

or

<host>10.10.10.53</host>

<port> tag

Child tag of the <proxy> tag.

role: The port number of the proxy. May be left out if no proxy is desired.

values: A number.

example:

<port>8080</port>

<user> tag

Child tag of the <proxy> tag.

role: The user name for proxy authentification. May be left out if no proxy authentification is desired.

values: A user name.

example:

<user>karl34</user>

<password> tag

Child tag of the <proxy> tag.

role: The password for proxy authentification. May be left out if no proxy authentification is desired.

values: A password.

example:

<password>i56dFg2s</password>

<searchIndex> tag

role: Contains all settings according the search index.

values: Knows the following child tags:

  • <dir>: The directory, where the search index should be created.
  • <buildIndex>: Specifies whether to create an index at all.
  • <analyzerType>: The analyzer to use.
  • <writeAnalysisFiles>: Specifies whether to create an analysis files.
  • <maxFailedDocuments>: Specifies the maximum percentage of failed documents.
  • <stopwordList>: A list of words that should not be indexed.
  • <excludeList>: A list of words that shouldn't be changed by the analyzer before indexing.

example:

<searchIndex>
  <dir>C:\regain\index</dir>
  <buildIndex>true</buildIndex>
  <analyzerType>german</analyzerType>
  <writeAnalysisFiles>false</writeAnalysisFiles>
  <maxFailedDocuments>100</maxFailedDocuments>
  <stopwordList> ... </stopwordList>
  <exclusionList> ... </exclusionList>
</searchIndex>

<dir> tag

Child tag of the <searchIndex> tag.

role: The directory, where the search index should be created.

values: A directory name.

example:

<dir>C:\regain\index</dir>

<buildIndex> tag

Child tag of the <searchIndex> tag.

role: Specifies whether to create an index at all.

By setting this to false you can use the crawler to detect dead links or to fill a server side cache.

values: true or false

example:

<buildIndex>true</buildIndex>

<analyzerType> tag

Child tag of the <searchIndex> tag.

role: The analyzer to use.

values: standard or german

example:

<analyzerType>standard</analyzerType>

<breakpointInterval> tag

Child tag of the <searchIndex> tag.

role: Specifies the interval between two breakpoints in minutes. If set to 0, no breakpoints will be created.

values: A number.

since version: 1.1 Beta 6

example: With the following the crawler will create a breakpoint every 10 minutes:

<breakpointInterval>10</breakpointInterval>

<writeAnalysisFiles> tag

Child tag of the <searchIndex> tag.

role: Specifies whether to create an analysis files.

values: true or false

example:

<writeAnalysisFiles>false</writeAnalysisFiles>

<maxFailedDocuments> tag

Child tag of the <searchIndex> tag.

role: Specifies the maximum percentage of failed documents.

values: A floating point number between 0 and 100.

example:

<maxFailedDocuments>9.5</maxFailedDocuments>

<stopwordList> tag

Child tag of the <searchIndex> tag.

role: A list of words that should not be indexed. This list is called stopword list.

This list helps keeping the index small. Specify here frequent words noone would search for.

values: A space separated list of words.

example:

<stopwordList>we and or without with at in of</stopwordList>

<exclusionList> tag

Child tag of the <searchIndex> tag.

role: A list of words that shouldn't be changed by the analyzer before indexing.

The analyzer tries to find for every word its base. For some words this may cause problems. Put these words in this list.

values: A space separated list of words.

example:

<exclusionList>veryProblematicWord evenMoreProblematicWord</exclusionList>

<preparatorList> tag

role: Contains the preparators in the order they should be applied. Preparators that aren't listed here will be applied after the listed ones.

You can use this list

  • to define the priority (order) of the preparators
  • to disable preparators
  • to configure preparators

values: Any number of <preparator> child tags.

example:

<preparatorList>
  <preparator> ... </preparator>
  <preparator> ... </preparator>
  ...
</preparatorList>

<preparator> tag

Child tag of the <preparatorList> tag.

role: Contains the settings for one preparator.

attribute enabled: Specifies whether the preparator is enabled. (value: true or false, default: true)

values: Knows the following child tags:

  • <class>: The class name of the preparator.
  • <urlPattern>: The regular expression a URL must match.
  • <config>: The configuration of the preparator.

example:

<preparator>
  <class>de.murfman.mypackage.MyPreparator</class>
  <config> ... </config>
</preparator>

<class> tag

Child tag of the <preparator> tag.

role: The class name of the preparator.

values: A fully classified class name. If the class is of the package net.sf.regain.crawler.preparator, the package may be abbreviated with a dot.

example:

<class>de.murfman.mypackage.MyPreparator</class>

or

<class>.HtmlPreparator</class>

<urlPattern> tag

Child tag of the <preparator> tag.

role: The regular expression a URL must match to, to be prepared by this preparator. If specified, the regular expression used internally by the preparator is overridden.

values: A regular expression.

since version: 1.1 Beta 5

example:

<urlPattern>\.(html|jsp)$</urlPattern>

<config> tag

Child tag of the <preparator> tag.

role: The configuration of the preparator.

attribute file: If this attribute is set, the preparator configuration is loaded from an extra file (values: A file name).

values: Any number of <section> child tags.

example:

<config>
  <section> ... </section>
  <section> ... </section>
  ...
</config>
<section> tag

Child tag of the <config> tag.

role: One section of the preparator configuration.

Which sections are provided depends on the particular preparator.

attribute name: The name of the section (values: A section name).

values: Any number of <param> child tags.

example:

<section>
  <param name="parameter1">value1</param>
  <param name="parameter2">value2</param>
  ...
</section>

<param> tag Child tag of the <section> tag.

role: A parameter of the preparator configuration.

Which parameters are provided depends on the particular preparator.

attribute name: The name of the parameter (values: A parameter name).

values: The configured value.

example:

<param name="aParam">the value</param>

<crawlerPluginList> tag

role: Contains the crawler_plugins in the order they should be applied. Plugins that aren't listed here, but present in the folder /plugins, will be registered at the end..

You can use this list

  • to define the priority (order) of the plugins
  • to disable plugins
  • to configure plugins

values: Any number of <crawlerPlugin> child tags.

since version: 1.7.8 Preview

example:

<crawlerPluginList>
  <crawlerPlugin> ... </crawlerPlugin>
  <crawlerPlugin> ... </crawlerPlugin>
  ...
</crawlerPluginList>

<crawlerPlugin> tag

Child tag of the <crawlerPluginList> tag.

role: Contains the settings for one crawler_plugins.

attribute enabled: Specifies whether the plugin is enabled. (value: true or false, default: true)

values: Knows the following child tags:

  • <class>: The class name of the plugin.
  • <config>: The configuration of the plugin.

since version: 1.7.8 Preview

example:

<crawlerPlugin>
  <class>de.murfman.mypackage.MyPlugin</class>
  <config> ... </config>
</crawlerPlugin>

<class> tag

Child tag of the <crawlerPlugin> tag.

role: The class name of the plugin.

values: A fully classified class name.

since version: 1.7.8 Preview

example:

<class>de.murfman.mypackage.MyPreparator</class>

<config> tag

Child tag of the <crawlerPlugin> tag.

role: The configuration of the plugin.

attribute file: If this attribute is set, the plugin configuration is loaded from an extra file (values: A file name).

values: Any number of <section> child tags.

since version: 1.7.8 Preview

example:

<config>
  <section> ... </section>
  <section> ... </section>
  ...
</config>
<section> tag

Child tag of the <config> tag.

role: One section of the plugin configuration.

Which sections are provided depends on the particular plugin.

attribute name: The name of the section (values: A section name).

values: Any number of <param> child tags.

since version: 1.7.8 Preview

example:

<section>
  <param name="parameter1">value1</param>
  <param name="parameter2">value2</param>
  ...
</section>

<param> tag Child tag of the <section> tag.

role: A parameter of the plugin configuration.

Which parameters are provided depends on the particular plugin.

attribute name: The name of the parameter (values: A parameter name).

values: The configured value.

since version: 1.7.8 Preview

example:

<param name="aParam">the value</param>

<auxiliaryFieldList> tag

role: A list of auxiliary fields.

The search index may be extended by auxiliary fields that are generated from the document's URL.

Example: Assumed you have a directory with a sub directory for every project. Then you can generate an auxiliary field with the project name. When searching for offer project:otto23 you will only get results from that project directory.

The following tag generates an auxiliary field project with the value otto23 from the URL file://c:/projects/otto23/docs/Spez.doc:

<auxiliaryField name="project" regexGroup="1">
  ^file://c:/projects/([^/]**)
</auxiliaryField>

values: Any number of <auxiliaryField> child tags.

example:

<auxiliaryFieldList>
  <auxiliaryField> ... </auxiliaryField>
  <auxiliaryField> ... </auxiliaryField>
  ...
</auxiliaryFieldList>

<auxiliaryField> tag

Child tag of the <auxiliaryFieldList> tag.

role: The definition of an auxiliary field.

In order to define the value the auxiliary field should get, use either the attribute value (for fixed values) or the attribute regexGroup (for values extracted from the URL), not both.

attribute name: The name of the auxiliary field. (value: A name, example: project or system)

attribute value: The value the auxiliary field should get. (value: A string, example: letters, Since version: 1.1 Beta 6)

attribute regexGroup: The number of the regular expression group that contains the value the auxiliary field should get. (value: A number)

attribute toLowerCase: Specifies whether the value extracted by regexGroup should be converted to lower case. (value: true or false Optional. Default is true, Since version: 1.1 Beta 6)

values: A regular expression.

example (value extracted from the URL):

<auxiliaryField name="project" regexGroup="1">^file://c:/projects/([^/]**)</auxiliaryField>

example (fixed value):

<auxiliaryField name="doctype" value="letters">^file://c:/docs/letters</auxiliaryField>
<auxiliaryField name="doctype" value="images">^file://c:/docs/(images|cliparts)</auxiliaryField>
<auxiliaryField name="doctype" value="photos">^file://c:/docs/photos</auxiliaryField>

<loadUnparsedUrls> tag

role: Specifies whether URLs should be loaded that are neither parsed nor indexed.

By this means dead links can be detected. This function may also be used to fill the cache of the server.

values: true or false

example:

<loadUnparsedUrls>false</loadUnparsedUrls>

<httpTimeout> tag

role: Specifies the maximum time in seconds, a HTTP download may take in total.

values: A number.

example:

<httpTimeout>180</httpTimeout>

<userAgent> tag

role: Sets the user agent the crawler should use for identifying at the HTTP server(s).

values: A string.

since version: 1.1 Beta 6

example: Use the following userAgent to identify at web servers as Internet Explorer.

<userAgent>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)</userAgent>

<useLinkTextAsTitleList> tag

role: Contains a list of regular expressions that must match to the URL of a document that should use the text of the link that pointed to the document as title instead of the real document title.

values: Any number of <urlPattern> tags.

example:

<useLinkTextAsTitleList>
  <urlPattern> ... </urlPattern>
</useLinkTextAsTitleList>

<urlPattern> tag

Child tag of the <useLinkTextAsTitleList> tag.

role: A regular expression that matches to a document that should use the link text as title.

values: A regular expression.

example:

<urlPattern>^http://.**\.(pdf|xls|doc|rtf)$</urlPattern>

<controlFiles> tag

role: Contains the names of the control files.

The control files may be used for controlling a script for instance.

optional: This tag is optional.

values: Knows the following child tags:

  • <finishedWithoutFatalsFile>: The control file for successful index creation.
  • <finishedWithFatalsFile>: The control file for failed index creation.

example:

<controlFiles>
  <finishedWithoutFatalsFile>C:\lucene\control\NoFatals</finishedWithoutFatalsFile>
  <finishedWithFatalsFile>C:\lucene\control\WithFatals</finishedWithFatalsFile>
</controlFiles>

<finishedWithoutFatalsFile> tag

Child tag of the <controlFiles> tag.

role: The control file for successful index creation.

This file is created, if an index was successfully created.

optional: This tag is optional.

values: A file name.

example:

<finishedWithoutFatalsFile>C:\lucene\control\NoFatals</finishedWithoutFatalsFile>

<finishedWithoutFatalsFile> tag

Child tag of the <controlFiles> tag.

role: The control file for failed index creation.

This file is created, if an index creation failed.

optional: This tag is optional.

values: A file name.

example:

<finishedWithoutFatalsFile>C:\lucene\control\WithFatals</finishedWithoutFatalsFile>

<crawlerAccessController> tag

role: The CrawlerAccessController to use.

This is a part of the access rights management that ensures that only those documents are shown in the search results that the user is allowed to read.

If you specify a CrawlerAccessController, don't forget to specify the SearchAccessController counterpart in the SearchConfiguration.xml!

values: Knows the following child tags:

  • <class>: The class name of the CrawlerAccessController.
  • <config>: The configuration of the CrawlerAccessController.

since version: 1.1 Beta 4

example:

<crawlerAccessController>
  <class jar="myAccess.jar">mypackage.MyCrawlerAccessController</class>
  <config> ... </config>
</crawlerAccessController>

<class> tag

Child tag of the <crawlerAccessController> tag.

role: The class name of the CrawlerAccessController.

attribute jar: The name of the jar file, where the CrawlerAccessController class is included. May be omitted, if the class is already in the classpath.

values: A fully classified class name.

since version: 1.1 Beta 4

example:

<class jar="myAccess.jar">mypackage.MyCrawlerAccessController</class>

<config> tag

Child tag of the <crawlerAccessController> tag.

role: The configuration of the CrawlerAccessController.

values: Any number of <param> child tags.

since version: 1.1 Beta 4

example:

<config>
  <param name="param1">value1</param>
  <param name="param2">value2</param>
</config>

<param> tag

Child tag of the <config> tag.

role: A parameter of the CrawlerAccessController configuration.

attribute name: The name of the parameter (values: A parameter name).

values: The configured value.

since version: 1.1 Beta 4

example:

<param name="scriptPath">c:\regain\access\getGroups.cmd</param>

<htmlParserPatternList> tag

role: A list of patterns that are used for identifying URLs when a HTML document is parsed.

values: Any number of <pattern> child tags.

example:

<htmlParserPatternList>
  <pattern> ... </pattern>
  <pattern> ... </pattern>
  ...
</htmlParserPatternList>

<pattern> tag

Child tag of the <htmlParserPatternList> tag.

role: A pattern that finds a URL in a HTML document.

attribute parse: Specifies whether a found document should be parsed for URLs as well. (values: true or false)

attribute index: Specifies whether a found document should be indexed. (values: true or false)

attribute regexGroup: The number of the regular expression group that contains the URL. (values: A number)

values: A regular expression.

example:

<pattern parse="false" index="true" regexGroup="1">="([^"]**\.(doc|pdf|rtf))"</pattern>
config/crawlerconfiguration.xml.txt · Last modified: 2014/10/29 10:22 (external edit)