====== Differences ====== This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
components:crawler_plugins [2011/07/30 11:03] benjamin |
components:crawler_plugins [2024/09/18 08:31] (current) |
||
---|---|---|---|
Line 33: | Line 33: | ||
**Parameters**: | **Parameters**: | ||
- | ^ Paramter Name ^ Description ^ | + | ^ Parameter Name ^ Description ^ |
| crawler | The crawler instance that is about to begin crawling | | | crawler | The crawler instance that is about to begin crawling | | ||
Line 45: | Line 45: | ||
**Parameters**: | **Parameters**: | ||
- | ^ Paramter Name ^ Description ^ | + | ^ Parameter Name ^ Description ^ |
| crawler | The crawler instance that is about to begin crawling | | | crawler | The crawler instance that is about to begin crawling | | ||
+ | |||
+ | === checkDynamicBlacklist === | ||
+ | |||
+ | ''boolean checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText)'' | ||
+ | |||
+ | Allows to blacklist specific URLs. | ||
+ | |||
+ | This function is called when the URL would normally be accepted, i.e. included in whitelist, not included in blacklist. | ||
+ | |||
+ | **Parameters**: | ||
+ | ^ Parameter Name ^ Description ^ | ||
+ | | url | URL of the crawling job that should normally be added. | | ||
+ | | sourceUrl | The URL where the url above has been found (<a>-Tag, PDF or similar) | | ||
+ | | sourceLinkText | The label of the URL in the document where the url above has been found. | | ||
+ | |||
+ | **Returns**: | ||
+ | |||
+ | ''True'': blacklist this URL. ''False'': Allow this URL. | ||
+ | |||
+ | If at least one of the crawler plugins returns true, the file will be treated as blacklisted. | ||
+ | |||
+ | === onAcceptURL === | ||
+ | |||
+ | ''void onAcceptURL(String url, CrawlerJob job)'' | ||
+ | |||
+ | Called during the crawling process when a new URL is added to the processing Queue. | ||
+ | |||
+ | As the queue is filled recursively, these calls can come between prepare Calls. | ||
+ | |||
+ | **Parameters**: | ||
+ | ^ Parameter Name ^ Description ^ | ||
+ | | url | URL that just was accepted | | ||
+ | | job | CrawlerJob that was created as a consequence | | ||
+ | |||
+ | === onDeclineURL === | ||
+ | |||
+ | ''void onDeclineURL(String url)'' | ||
+ | |||
+ | Called during the crawling process when a new URL is declined to be added to the processing Queue. | ||
+ | |||
+ | Note that ignored URLs (that is, URL that were already accepted or declined before), do not appear here. | ||
+ | |||
+ | **Parameters**: | ||
+ | ^ Parameter Name ^ Description ^ | ||
+ | | url | URL that just was declined | | ||
+ | |||
+ | === onCreateIndexEntry === | ||
+ | |||
+ | ''void onCreateIndexEntry(Document doc, IndexWriter index)'' | ||
+ | |||
+ | Called when a document as added to the index. | ||
+ | |||
+ | This may be a newly indexed document, or a document that has changed since | ||
+ | and, thus, is reindexed. | ||
+ | |||
+ | **Parameters**: | ||
+ | ^ Parameter Name ^ Description ^ | ||
+ | | doc | Document to write | | ||
+ | | index | Lucene Index Writer | | ||
+ | |||
+ | === onDeleteIndexEntry === | ||
+ | |||
+ | ''void onDeleteIndexEntry(Document doc, IndexReader index)'' | ||
+ | |||
+ | Called when a document is deleted from the index. | ||
+ | |||
+ | Note that when being replaced by another document ("update index"), | ||
+ | the old document is added to index first, deleting is part of the cleaning-up-at-the-end-Phase. | ||
+ | |||
+ | **Parameters**: | ||
+ | ^ Parameter Name ^ Description ^ | ||
+ | | doc | Document to read | | ||
+ | | index | Lucene Index Reader | | ||
+ | |||
+ | === onBeforePrepare === | ||
+ | |||
+ | ''void onBeforePrepare(RawDocument document, WriteablePreparator preparator)'' | ||
+ | |||
+ | Called before a document is being prepared to be added to the index. | ||
+ | (Good point to fill in default values.) | ||
+ | |||
+ | **Parameters**: | ||
+ | ^ Parameter Name ^ Description ^ | ||
+ | | document | Regain document that will be analysed | | ||
+ | | preparator | Preperator that was chosen to analyse this document | | ||
+ | |||
+ | === onAfterPrepare === | ||
+ | |||
+ | ''void onAfterPrepare(RawDocument document, WriteablePreparator preparator)'' | ||
+ | |||
+ | Called after a document is being prepared to be added to the index. | ||
+ | Here you can override the results of the preperator, if necessary. | ||
+ | |||
+ | |||
+ | **Parameters**: | ||
+ | ^ Parameter Name ^ Description ^ | ||
+ | | document | Regain document that was analysed | | ||
+ | | preparator | Preperator that has analysed this document | | ||
==== Existing Plugins ==== | ==== Existing Plugins ==== | ||
- | * Create Thumbnails of indexed documents: https://github.com/benjamin4ruby/java-thumbnailer | + | * **FilesizeFilterPlugin** (included in regain): Blacklist files that have a filesize below or above a certain treshold |
+ | * [[https://github.com/benjamin4ruby/java-thumbnailer|JavaThumbnailer]]: Create Thumbnails of indexed documents | ||