User Tools

Site Tools


project_info:crawling_process

====== Differences ====== This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
project_info:crawling_process [2011/07/28 12:49]
benjamin add crawler plugin info
project_info:crawling_process [2024/09/18 08:31] (current)
Line 7: Line 7:
 === 1. Creating the job queue === === 1. Creating the job queue ===
  
-   * According to the Crawling Whitelist ​(and Blacklist), the start URLs are added to the job queue (Crawler::​addJob) (''​__onAcceptURL(String url, CrawlerJob job)__''​ or ''​__onDeclineURL(String url)__''​ is called to inform the plugin)+   * According to the Crawling ​[[features:​white_and_black_list|Whitelist and Blacklist]], the start URLs are added to the job queue (Crawler::​addJob) (''​__onAcceptURL(String url, CrawlerJob job)__''​ or ''​__onDeclineURL(String url)__''​ is called to inform the plugin
 +      * If a job would be accepted, the crawler plugins are asked if they want to blacklist it anyway (''​__boolean checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText)__''​ - if at least one of the plugins returns true, the file isn't indexed.)
    * Recursively,​ all entries are checked: (Crawler::​run)    * Recursively,​ all entries are checked: (Crawler::​run)
       * If they are folders (and should be indexed), their content is added to the job queue       * If they are folders (and should be indexed), their content is added to the job queue
Line 22: Line 23:
    * Create a new Index entry (IndexWriterManager::​createNewIndexEntry)    * Create a new Index entry (IndexWriterManager::​createNewIndexEntry)
    * First the document is prepared for indexation (DocumentFactory::​createDocument)    * First the document is prepared for indexation (DocumentFactory::​createDocument)
 +      * [[features:​auxiliary_fields|Auxiliary Fields]] are calculated.
 +      * The [[features:​access_rights_management|Crawler Access Controller]] (if available) is asked to retrieve the allowed groups.
       * The MIME-Type is identified.(org.semanticdesktop.aperture.mime.identifier.magic.MagicMimeTypeIdentifier::​identify())       * The MIME-Type is identified.(org.semanticdesktop.aperture.mime.identifier.magic.MagicMimeTypeIdentifier::​identify())
       * All preparators are collected which accept this MIME-Type.       * All preparators are collected which accept this MIME-Type.
       * The preparator with the highest priority is executed. (DocumentFactory::​createDocument)       * The preparator with the highest priority is executed. (DocumentFactory::​createDocument)
-        * Before (''​onBeforePrepare(RawDocument document, WriteablePreparator preparator)''​) and after (''​onAfterPrepare(RawDocument document, WriteablePreparator preparator)''​) the actual preparation,​ the plugins are called. ​+        * Before (''​__onBeforePrepare(RawDocument document, WriteablePreparator preparator)__''​) and after (''​__onAfterPrepare(RawDocument document, WriteablePreparator preparator)__''​) the actual preparation,​ the plugins are called. ​
       * If he fails, then an empty document is returned (DocumentFactory::​createSubstituteDocument).       * If he fails, then an empty document is returned (DocumentFactory::​createSubstituteDocument).
    * Then it is added to the [[http://​lucene.apache.org|Lucene]] [[components:​search_index|index]],​ after notification of the plugins (''​__onCreateIndexEntry(Document doc, IndexWriter index)__''​).    * Then it is added to the [[http://​lucene.apache.org|Lucene]] [[components:​search_index|index]],​ after notification of the plugins (''​__onCreateIndexEntry(Document doc, IndexWriter index)__''​).
  
-At the en, ''​__onFinishCrawling(Crawler)__''​ is called.+At the end, ''​__onFinishCrawling(Crawler)__''​ is called.
project_info/crawling_process.1311850149.txt.gz · Last modified: 2024/09/18 08:31 (external edit)