User Tools

Site Tools


project_info:crawling_process

====== Differences ====== This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
project_info:crawling_process [2011/08/02 18:42]
benjamin
project_info:crawling_process [2024/09/18 08:31] (current)
Line 7: Line 7:
 === 1. Creating the job queue === === 1. Creating the job queue ===
  
-   * According to the Crawling Whitelist ​(and Blacklist), the start URLs are added to the job queue (Crawler::​addJob) (''​__onAcceptURL(String url, CrawlerJob job)__''​ or ''​__onDeclineURL(String url)__''​ is called to inform the plugin)+   * According to the Crawling ​[[features:​white_and_black_list|Whitelist and Blacklist]], the start URLs are added to the job queue (Crawler::​addJob) (''​__onAcceptURL(String url, CrawlerJob job)__''​ or ''​__onDeclineURL(String url)__''​ is called to inform the plugin
 +      * If a job would be accepted, the crawler plugins are asked if they want to blacklist it anyway (''​__boolean checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText)__''​ - if at least one of the plugins returns true, the file isn't indexed.)
    * Recursively,​ all entries are checked: (Crawler::​run)    * Recursively,​ all entries are checked: (Crawler::​run)
       * If they are folders (and should be indexed), their content is added to the job queue       * If they are folders (and should be indexed), their content is added to the job queue
Line 22: Line 23:
    * Create a new Index entry (IndexWriterManager::​createNewIndexEntry)    * Create a new Index entry (IndexWriterManager::​createNewIndexEntry)
    * First the document is prepared for indexation (DocumentFactory::​createDocument)    * First the document is prepared for indexation (DocumentFactory::​createDocument)
 +      * [[features:​auxiliary_fields|Auxiliary Fields]] are calculated.
 +      * The [[features:​access_rights_management|Crawler Access Controller]] (if available) is asked to retrieve the allowed groups.
       * The MIME-Type is identified.(org.semanticdesktop.aperture.mime.identifier.magic.MagicMimeTypeIdentifier::​identify())       * The MIME-Type is identified.(org.semanticdesktop.aperture.mime.identifier.magic.MagicMimeTypeIdentifier::​identify())
       * All preparators are collected which accept this MIME-Type.       * All preparators are collected which accept this MIME-Type.
Line 29: Line 32:
    * Then it is added to the [[http://​lucene.apache.org|Lucene]] [[components:​search_index|index]],​ after notification of the plugins (''​__onCreateIndexEntry(Document doc, IndexWriter index)__''​).    * Then it is added to the [[http://​lucene.apache.org|Lucene]] [[components:​search_index|index]],​ after notification of the plugins (''​__onCreateIndexEntry(Document doc, IndexWriter index)__''​).
  
-At the en, ''​__onFinishCrawling(Crawler)__''​ is called.+At the end, ''​__onFinishCrawling(Crawler)__''​ is called.
project_info/crawling_process.1312303359.txt.gz · Last modified: 2024/09/18 08:31 (external edit)