User Tools

Site Tools


components:crawler_plugins
Translations of this page:

**This is an old revision of the document!** ----

A PCRE internal error occured. This might be caused by a faulty plugin

====== Crawler Plugins ====== Crawler Plugins hook into the [[project_info:crawling_process|crawling process]] in order to add advanced functionality. ==== What can crawler plugins do? ==== Some examples: * Modify the result of preparators * by specifying default-values if the chosen preparator does not fill in a certain field (''onBeforePrepare'') * by overriding or modyfing the results of whatever preparator was chosen (''onAfterPrepare'') * Modify their storage in the lucene index * Do sth at every start or end of the crawling process (e.g. inform the administrator via email) ==== How to create a crawler plugin ==== - Create a class that implements ''CrawlerPlugin''. - Packaged it (and all its dependencies) as a .jar * In the manifest file, the attribute ''Plugin-Class'' must be set to the complete class name of the implementing class. - Drop it into the ''plugins''-Directory. ==== Crawler Plugin API ==== === onStartCrawling === ''void onStartCrawling(Crawler crawler)'' Called before the crawling process starts (''Crawler::run()''). This may be called multiple times during the lifetime of a plugin instance, but [[Crawler Plugins#onFinishCrawling|onFinishCrawling()]] is always called in between. **Parameters**: ^ Paramter Name ^ Description ^ | crawler | The crawler instance that is about to begin crawling | === onFinishCrawling === ''void onFinishCrawling(Crawler crawler)'' Called after the crawling process has finished or aborted (because of an exception). This may be called multiple times during the lifetime of a plugin instance. **Parameters**: ^ Paramter Name ^ Description ^ | crawler | The crawler instance that is about to begin crawling | === onAcceptURL === ''void onAcceptURL(String url, CrawlerJob job)'' Called during the crawling process when a new URL is added to the processing Queue. As the queue is filled recursively, these calls can come between prepare Calls. **Parameters**: ^ Paramter Name ^ Description ^ | url | URL that just was accepted | | job | CrawlerJob that was created as a consequence | === onDeclineURL === ''void onDeclineURL(String url)'' Called during the crawling process when a new URL is declined to be added to the processing Queue. Note that ignored URLs (that is, URL that were already accepted or declined before), do not appear here. **Parameters**: ^ Paramter Name ^ Description ^ | url | URL that just was declined | === onCreateIndexEntry === ''void onCreateIndexEntry(Document doc, IndexWriter index)'' Called when a document as added to the index. This may be a newly indexed document, or a document that has changed since and, thus, is reindexed. **Parameters**: ^ Paramter Name ^ Description ^ | doc | Document to write | | index | Lucene Index Writer | === onDeleteIndexEntry === ''void onDeleteIndexEntry(Document doc, IndexReader index)'' Called when a document is deleted from the index. Note that when being replaced by another document ("update index"), the old document is added to index first, deleting is part of the cleaning-up-at-the-end-Phase. **Parameters**: ^ Paramter Name ^ Description ^ | doc | Document to read | | index | Lucene Index Reader | === onBeforePrepare === ''void onBeforePrepare(RawDocument document, WriteablePreparator preparator)'' Called before a document is being prepared to be added to the index. (Good point to fill in default values.) **Parameters**: ^ Paramter Name ^ Description ^ | document | Regain document that will be analysed | | preparator | Preperator that was chosen to analyse this document | === onAfterPrepare === ''void onAfterPrepare(RawDocument document, WriteablePreparator preparator)'' Called after a document is being prepared to be added to the index. Here you can override the results of the preperator, if necessary. **Parameters**: ^ Paramter Name ^ Description ^ | document | Regain document that will be analysed | | preparator | Preperator that was chosen to analyse this document | ==== Existing Plugins ==== * Create Thumbnails of indexed documents: https://github.com/benjamin4ruby/java-thumbnailer

components/crawler_plugins.1312021610.txt.gz · Last modified: 2024/09/18 08:31 (external edit)