User Tools

Site Tools


components:crawler_plugins

====== Differences ====== This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
components:crawler_plugins [2011/07/30 12:26]
benjamin
components:crawler_plugins [2024/09/18 08:31] (current)
Line 33: Line 33:
  
 **Parameters**: ​ **Parameters**: ​
-Paramter ​Name       ^ Description ​                                             ^+Parameter ​Name       ^ Description ​                                             ^
 | crawler ​            | The crawler instance that is about to begin crawling ​    ​| ​ | crawler ​            | The crawler instance that is about to begin crawling ​    ​| ​
  
Line 45: Line 45:
  
 **Parameters**: ​ **Parameters**: ​
-Paramter ​Name       ^ Description ​                                             ^+Parameter ​Name       ^ Description ​                                             ^
 | crawler ​            | The crawler instance that is about to begin crawling ​    ​| ​ | crawler ​            | The crawler instance that is about to begin crawling ​    ​| ​
 +
 +=== checkDynamicBlacklist ===
 +
 +''​boolean checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText)''​
 +
 +Allows to blacklist specific URLs.
 +
 +This function is called when the URL would normally be accepted, i.e. included in whitelist, not included in blacklist.
 +
 +**Parameters**: ​
 +^ Parameter Name       ^ Description ​                                             ^
 +| url             | URL of the crawling job that should normally be added. ​    ​| ​
 +| sourceUrl ​            | The URL where the url above has been found (<​a>​-Tag,​ PDF or similar) ​    ​| ​
 +| sourceLinkText | The label of the URL in the document where the url above has been found. |
 +
 +**Returns**:​
 +
 +''​True'':​ blacklist this URL. ''​False'':​ Allow this URL.
 +
 +If at least one of the crawler plugins returns true, the file will be treated as blacklisted.
  
 === onAcceptURL === === onAcceptURL ===
Line 57: Line 77:
  
 **Parameters**: ​ **Parameters**: ​
-Paramter ​Name       ^ Description ​                                             ^+Parameter ​Name       ^ Description ​                                             ^
 | url             | URL that just was accepted ​    ​| ​ | url             | URL that just was accepted ​    ​| ​
 | job             | CrawlerJob that was created as a consequence ​    ​| ​ | job             | CrawlerJob that was created as a consequence ​    ​| ​
Line 70: Line 90:
  
 **Parameters**: ​ **Parameters**: ​
-Paramter ​Name       ^ Description ​                                             ^+Parameter ​Name       ^ Description ​                                             ^
 | url             | URL that just was declined ​    ​| ​ | url             | URL that just was declined ​    ​| ​
  
Line 83: Line 103:
  
 **Parameters**: ​ **Parameters**: ​
-Paramter ​Name       ^ Description ​                                             ^+Parameter ​Name       ^ Description ​                                             ^
 | doc             | Document to write     | | doc             | Document to write     |
 | index             | Lucene Index Writer ​    ​|  ​ | index             | Lucene Index Writer ​    ​|  ​
Line 97: Line 117:
  
 **Parameters**: ​ **Parameters**: ​
-Paramter ​Name       ^ Description ​                                             ^+Parameter ​Name       ^ Description ​                                             ^
 | doc             | Document to read     | | doc             | Document to read     |
 | index             | Lucene Index Reader ​    ​|  ​ | index             | Lucene Index Reader ​    ​|  ​
Line 109: Line 129:
  
 **Parameters**: ​ **Parameters**: ​
-Paramter ​Name       ^ Description ​                                             ^+Parameter ​Name       ^ Description ​                                             ^
 | document | Regain document that will be analysed | | document | Regain document that will be analysed |
 | preparator | Preperator that was chosen to analyse this document | | preparator | Preperator that was chosen to analyse this document |
Line 119: Line 139:
 Called after a document is being prepared to be added to the index. Called after a document is being prepared to be added to the index.
 Here you can override the results of the preperator, if necessary. Here you can override the results of the preperator, if necessary.
 +
  
 **Parameters**: ​ **Parameters**: ​
-Paramter ​Name       ^ Description ​                                             ^ +Parameter ​Name       ^ Description ​                                             ^ 
-| document | Regain document that will be analysed | +| document | Regain document that was analysed | 
-| preparator | Preperator that was chosen to analyse ​this document |+| preparator | Preperator that has analysed ​this document | 
  
 ==== Existing Plugins ==== ==== Existing Plugins ====
  
-  * Create Thumbnails of indexed documents: https://​github.com/​benjamin4ruby/​java-thumbnailer+  * **FilesizeFilterPlugin** (included in regain)Blacklist files that have a filesize below or above a certain treshold 
 +  * [[https://​github.com/​benjamin4ruby/​java-thumbnailer|JavaThumbnailer]]:​ Create Thumbnails of indexed documents
  
  
components/crawler_plugins.1312021610.txt.gz · Last modified: 2024/09/18 08:31 (external edit)