regain manual

====== Differences ====== This shows you the differences between two versions of the page.

--- project_info:crawling_process [2011/08/02 18:42]
benjamin
+++ project_info:crawling_process [2024/09/18 08:31] (current)
@@ Line 7: / Line 7: @@
 === 1. Creating the job queue ===
-   * According to the Crawling Whitelist (and Blacklist), the start URLs are added to the job queue (Crawler::addJob) (''__onAcceptURL(String url, CrawlerJob job)__'' or ''__onDeclineURL(String url)__'' is called to inform the plugin)
+   * According to the Crawling [[features:white_and_black_list|Whitelist and Blacklist]], the start URLs are added to the job queue (Crawler::addJob) (''__onAcceptURL(String url, CrawlerJob job)__'' or ''__onDeclineURL(String url)__'' is called to inform the plugin)
+      * If a job would be accepted, the crawler plugins are asked if they want to blacklist it anyway (''__boolean checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText)__'' - if at least one of the plugins returns true, the file isn't indexed.)
    * Recursively, all entries are checked: (Crawler::run)
       * If they are folders (and should be indexed), their content is added to the job queue
@@ Line 22: / Line 23: @@
    * Create a new Index entry (IndexWriterManager::createNewIndexEntry)
    * First the document is prepared for indexation (DocumentFactory::createDocument)
+      * [[features:auxiliary_fields|Auxiliary Fields]] are calculated.
+      * The [[features:access_rights_management|Crawler Access Controller]] (if available) is asked to retrieve the allowed groups.
       * The MIME-Type is identified.(org.semanticdesktop.aperture.mime.identifier.magic.MagicMimeTypeIdentifier::identify())
       * All preparators are collected which accept this MIME-Type.
@@ Line 29: / Line 32: @@
    * Then it is added to the [[http://lucene.apache.org|Lucene]] [[components:search_index|index]], after notification of the plugins (''__onCreateIndexEntry(Document doc, IndexWriter index)__'').
-At the en, ''__onFinishCrawling(Crawler)__'' is called.
+At the end, ''__onFinishCrawling(Crawler)__'' is called.

regain manual

User Tools

Site Tools

Page Tools