User Tools

Site Tools


components:crawler_plugins
Translations of this page:

This is an old revision of the document!


Crawler Plugins

Crawler Plugins hook into the crawling process in order to add advanced functionality.

What can crawler plugins do?

Some examples:

  • Modify the result of preparators
    • by specifying default-values if the chosen preparator does not fill in a certain field (onBeforePrepare)
    • by overriding or modyfing the results of whatever preparator was chosen (onAfterPrepare)
  • Modify their storage in the lucene index
  • Do sth at every start or end of the crawling process (e.g. inform the administrator via email)

How to create a crawler plugin

  1. Create a class that implements CrawlerPlugin.
  2. Packaged it (and all its dependencies) as a .jar
    • In the manifest file, the attribute Plugin-Class must be set to the complete class name of the implementing class.
  3. Drop it into the plugins-Directory.

Crawler Plugin API

onStartCrawling

void onStartCrawling(Crawler crawler)

Called before the crawling process starts (Crawler::run()).

This may be called multiple times during the lifetime of a plugin instance, but onFinishCrawling() is always called in between.

Parameters:

Paramter Name Description
crawler The crawler instance that is about to begin crawling

onFinishCrawling

void onFinishCrawling(Crawler crawler)

Called after the crawling process has finished or aborted (because of an exception).

This may be called multiple times during the lifetime of a plugin instance.

Parameters:

Paramter Name Description
crawler The crawler instance that is about to begin crawling

onAcceptURL

void onAcceptURL(String url, CrawlerJob job)

Called during the crawling process when a new URL is added to the processing Queue.

Note that ignored URLs (that is, URL that were already accepted or declined before), do not appear here.

Parameters:

Paramter Name Description
url URL that just was declined

onDeclineURL

void onDeclineURL(String url)

Called during the crawling process when a new URL is declined to be added to the processing Queue.

As the queue is filled recursively, these calls can come between prepare Calls.

Parameters:

Paramter Name Description
url URL that just was accepted
job CrawlerJob that was created as a consequence

Existing Plugins

components/crawler_plugins.1312016816.txt.gz · Last modified: 2014/10/29 10:21 (external edit)