====== Differences ====== This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
project_info:crawling_process [2011/06/27 09:25] benjamin links |
project_info:crawling_process [2024/09/18 08:31] (current) |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Crawling Process ====== | ====== Crawling Process ====== | ||
- | How does the crawling process work? | + | How does the crawling process work? Where do the [[components:crawler_plugins|Crawler Plugins]] interact? |
+ | |||
+ | At the beginning, ''__onStartCrawling(Crawler)__'' is called for all plugins. | ||
=== 1. Creating the job queue === | === 1. Creating the job queue === | ||
- | * According to the Crawling Whitelist (and Blacklist), the start URLs are added to the job queue (Crawler::addJob) | + | * According to the Crawling [[features:white_and_black_list|Whitelist and Blacklist]], the start URLs are added to the job queue (Crawler::addJob) (''__onAcceptURL(String url, CrawlerJob job)__'' or ''__onDeclineURL(String url)__'' is called to inform the plugin) |
+ | * If a job would be accepted, the crawler plugins are asked if they want to blacklist it anyway (''__boolean checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText)__'' - if at least one of the plugins returns true, the file isn't indexed.) | ||
* Recursively, all entries are checked: (Crawler::run) | * Recursively, all entries are checked: (Crawler::run) | ||
* If they are folders (and should be indexed), their content is added to the job queue | * If they are folders (and should be indexed), their content is added to the job queue | ||
Line 13: | Line 16: | ||
=== 2. Indexing a single document === | === 2. Indexing a single document === | ||
- | *(Note that a document (RawDocument) may be a file, a file on a network | + | *(Note that a document (RawDocument) may be a file, a file on a network share (smb), a HTTP-Request, or an IMAP-Message). |
- | share (smb), a HTTP-Request, or an IMAP-Message). | + | |
* First, verify if the document already exists in the index (IndexWriterManager::addToIndex) | * First, verify if the document already exists in the index (IndexWriterManager::addToIndex) | ||
* If so, check if it was indexed recently (less than a day ago) by comparing their lastModified-time. | * If so, check if it was indexed recently (less than a day ago) by comparing their lastModified-time. | ||
* If it was indexed recently, stop this process and continue with the next document. | * If it was indexed recently, stop this process and continue with the next document. | ||
- | * If it was indexed more than 1 day ago, or this cannot be checked (HTTP), then delete the current entry and proceed <ref>It is not deleted immediatly, but rather at the end of the crawling process (IndexWriterManager::removeObsoleteEntries). | + | * If it was indexed more than 1 day ago, or this cannot be checked (HTTP), then delete the current entry and proceed (Note: It is not deleted immediatly, but rather at the end of the crawling process (IndexWriterManager::removeObsoleteEntries). That is also when ''__onDeleteIndexEntry(Document doc, IndexReader index)__'' will be called (just before deletion).) |
* Create a new Index entry (IndexWriterManager::createNewIndexEntry) | * Create a new Index entry (IndexWriterManager::createNewIndexEntry) | ||
* First the document is prepared for indexation (DocumentFactory::createDocument) | * First the document is prepared for indexation (DocumentFactory::createDocument) | ||
+ | * [[features:auxiliary_fields|Auxiliary Fields]] are calculated. | ||
+ | * The [[features:access_rights_management|Crawler Access Controller]] (if available) is asked to retrieve the allowed groups. | ||
* The MIME-Type is identified.(org.semanticdesktop.aperture.mime.identifier.magic.MagicMimeTypeIdentifier::identify()) | * The MIME-Type is identified.(org.semanticdesktop.aperture.mime.identifier.magic.MagicMimeTypeIdentifier::identify()) | ||
* All preparators are collected which accept this MIME-Type. | * All preparators are collected which accept this MIME-Type. | ||
* The preparator with the highest priority is executed. (DocumentFactory::createDocument) | * The preparator with the highest priority is executed. (DocumentFactory::createDocument) | ||
+ | * Before (''__onBeforePrepare(RawDocument document, WriteablePreparator preparator)__'') and after (''__onAfterPrepare(RawDocument document, WriteablePreparator preparator)__'') the actual preparation, the plugins are called. | ||
* If he fails, then an empty document is returned (DocumentFactory::createSubstituteDocument). | * If he fails, then an empty document is returned (DocumentFactory::createSubstituteDocument). | ||
- | * Then it is added to the [[http://lucene.apache.org|Lucene]] [[components:search_index|index]]. | + | * Then it is added to the [[http://lucene.apache.org|Lucene]] [[components:search_index|index]], after notification of the plugins (''__onCreateIndexEntry(Document doc, IndexWriter index)__''). |
+ | At the end, ''__onFinishCrawling(Crawler)__'' is called. |