User Tools

Site Tools


project_info:crawling_process

====== Differences ====== This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
project_info:crawling_process [2011/06/27 09:25]
benjamin links
project_info:crawling_process [2024/09/18 08:31] (current)
Line 1: Line 1:
 ====== Crawling Process ====== ====== Crawling Process ======
  
-How does the crawling process work?+How does the crawling process work? Where do the [[components:​crawler_plugins|Crawler Plugins]] interact? 
 + 
 +At the beginning, ''​__onStartCrawling(Crawler)__''​ is called for all plugins.
  
 === 1. Creating the job queue === === 1. Creating the job queue ===
  
-   * According to the Crawling Whitelist ​(and Blacklist), the start URLs are added to the job queue (Crawler::​addJob)+   * According to the Crawling ​[[features:​white_and_black_list|Whitelist and Blacklist]], the start URLs are added to the job queue (Crawler::​addJob) (''​__onAcceptURL(String url, CrawlerJob job)__''​ or ''​__onDeclineURL(String url)__''​ is called to inform the plugin) 
 +      * If a job would be accepted, the crawler plugins are asked if they want to blacklist it anyway (''​__boolean checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText)__''​ - if at least one of the plugins returns true, the file isn't indexed.)
    * Recursively,​ all entries are checked: (Crawler::​run)    * Recursively,​ all entries are checked: (Crawler::​run)
       * If they are folders (and should be indexed), their content is added to the job queue       * If they are folders (and should be indexed), their content is added to the job queue
Line 13: Line 16:
 === 2. Indexing a single document === === 2. Indexing a single document ===
  
-   ​*(Note that a document (RawDocument) may be a file, a file on a network +   ​*(Note that a document (RawDocument) may be a file, a file on a network share (smb), a HTTP-Request,​ or an IMAP-Message).
-share (smb), a HTTP-Request,​ or an IMAP-Message).+
    * First, verify if the document already exists in the index (IndexWriterManager::​addToIndex)    * First, verify if the document already exists in the index (IndexWriterManager::​addToIndex)
       * If so, check if it was indexed recently (less than a day ago) by comparing their lastModified-time.       * If so, check if it was indexed recently (less than a day ago) by comparing their lastModified-time.
          * If it was indexed recently, stop this process and continue with the next document.          * If it was indexed recently, stop this process and continue with the next document.
-         * If it was indexed more than 1 day ago, or this cannot be checked (HTTP), then delete the current entry and proceed ​<ref>It is not deleted immediatly, but rather at the end of the crawling process (IndexWriterManager::​removeObsoleteEntries).+         * If it was indexed more than 1 day ago, or this cannot be checked (HTTP), then delete the current entry and proceed ​(Note: ​It is not deleted immediatly, but rather at the end of the crawling process (IndexWriterManager::​removeObsoleteEntries). ​That is also when ''​__onDeleteIndexEntry(Document doc, IndexReader index)__''​ will be called (just before deletion).)
    * Create a new Index entry (IndexWriterManager::​createNewIndexEntry)    * Create a new Index entry (IndexWriterManager::​createNewIndexEntry)
    * First the document is prepared for indexation (DocumentFactory::​createDocument)    * First the document is prepared for indexation (DocumentFactory::​createDocument)
 +      * [[features:​auxiliary_fields|Auxiliary Fields]] are calculated.
 +      * The [[features:​access_rights_management|Crawler Access Controller]] (if available) is asked to retrieve the allowed groups.
       * The MIME-Type is identified.(org.semanticdesktop.aperture.mime.identifier.magic.MagicMimeTypeIdentifier::​identify())       * The MIME-Type is identified.(org.semanticdesktop.aperture.mime.identifier.magic.MagicMimeTypeIdentifier::​identify())
       * All preparators are collected which accept this MIME-Type.       * All preparators are collected which accept this MIME-Type.
       * The preparator with the highest priority is executed. (DocumentFactory::​createDocument)       * The preparator with the highest priority is executed. (DocumentFactory::​createDocument)
 +        * Before (''​__onBeforePrepare(RawDocument document, WriteablePreparator preparator)__''​) and after (''​__onAfterPrepare(RawDocument document, WriteablePreparator preparator)__''​) the actual preparation,​ the plugins are called. ​
       * If he fails, then an empty document is returned (DocumentFactory::​createSubstituteDocument).       * If he fails, then an empty document is returned (DocumentFactory::​createSubstituteDocument).
-   * Then it is added to the [[http://​lucene.apache.org|Lucene]] [[components:​search_index|index]].+   * Then it is added to the [[http://​lucene.apache.org|Lucene]] [[components:​search_index|index]], after notification of the plugins (''​__onCreateIndexEntry(Document doc, IndexWriter index)__''​).
  
 +At the end, ''​__onFinishCrawling(Crawler)__''​ is called.
project_info/crawling_process.1309159534.txt.gz · Last modified: 2024/09/18 08:31 (external edit)