User Tools

Site Tools


components:crawler_plugins
Translations of this page:

This is an old revision of the document!


Crawler Plugins

Crawler Plugins hook into the crawling process in order to add advanced functionality.

What can crawler plugins do?

Some examples:

  • Modify the result of preparators
    • by specifying default-values if the chosen preparator does not fill in a certain field (onBeforePrepare)
    • by overriding or modyfing the results of whatever preparator was chosen (onAfterPrepare)
  • Modify their storage in the lucene index
  • Do sth at every start or end of the crawling process (e.g. inform the administrator via email)

How to create a crawler plugin

  1. Create a class that implements CrawlerPlugin.
  2. Packaged it (and all its dependencies) as a .jar
    • In the manifest file, the attribute Plugin-Class must be set to the complete class name of the implementing class.
  3. Drop it into the plugins-Directory.

Crawler Plugin API

onStartCrawling

void onStartCrawling(Crawler crawler)

Called before the crawling process starts (Crawler::run()).

This may be called multiple times during the lifetime of a plugin instance, but onFinishCrawling() is always called in between.

Parameters:

Paramter Name Description
crawler The crawler instance that is about to begin crawling

onFinishCrawling

void onFinishCrawling(Crawler crawler)

Called after the crawling process has finished or aborted (because of an exception).

This may be called multiple times during the lifetime of a plugin instance.

Parameters:

Paramter Name Description
crawler The crawler instance that is about to begin crawling

onAcceptURL

void onAcceptURL(String url, CrawlerJob job)

Called during the crawling process when a new URL is added to the processing Queue.

As the queue is filled recursively, these calls can come between prepare Calls.

Parameters:

Paramter Name Description
url URL that just was accepted
job CrawlerJob that was created as a consequence

onDeclineURL

void onDeclineURL(String url)

Called during the crawling process when a new URL is declined to be added to the processing Queue.

Note that ignored URLs (that is, URL that were already accepted or declined before), do not appear here.

Parameters:

Paramter Name Description
url URL that just was declined

onCreateIndexEntry

void onCreateIndexEntry(Document doc, IndexWriter index)

Called when a document as added to the index.

This may be a newly indexed document, or a document that has changed since and, thus, is reindexed.

Parameters:

Paramter Name Description
doc Document to write
index Lucene Index Writer

onDeleteIndexEntry

void onDeleteIndexEntry(Document doc, IndexReader index)

Called when a document is deleted from the index.

Note that when being replaced by another document (“update index”), the old document is added to index first, deleting is part of the cleaning-up-at-the-end-Phase.

Parameters:

Paramter Name Description
doc Document to read
index Lucene Index Reader

onBeforePrepare

void onBeforePrepare(RawDocument document, WriteablePreparator preparator)

Called before a document is being prepared to be added to the index. (Good point to fill in default values.)

Parameters:

Paramter Name Description
document Regain document that will be analysed
preparator Preperator that was chosen to analyse this document

onAfterPrepare

void onAfterPrepare(RawDocument document, WriteablePreparator preparator)

Called after a document is being prepared to be added to the index. Here you can override the results of the preperator, if necessary.

Parameters:

Paramter Name Description
document Regain document that was analysed
preparator Preperator that has analysed this document

Existing Plugins

components/crawler_plugins.1312021657.txt.gz · Last modified: 2014/10/29 10:21 (external edit)