User Tools

Site Tools


project_info:crawling_process
Translations of this page:

Crawling Process

How does the crawling process work? Where do the Crawler Plugins interact?

At the beginning, onStartCrawling(Crawler) is called for all plugins.

1. Creating the job queue

  • According to the Crawling Whitelist and Blacklist, the start URLs are added to the job queue (Crawler::addJob) (onAcceptURL(String url, CrawlerJob job) or onDeclineURL(String url) is called to inform the plugin)
    • If a job would be accepted, the crawler plugins are asked if they want to blacklist it anyway (boolean checkDynamicBlacklist(String url, String sourceUrl, String sourceLinkText) - if at least one of the plugins returns true, the file isn't indexed.)
  • Recursively, all entries are checked: (Crawler::run)
    • If they are folders (and should be indexed), their content is added to the job queue
    • If they are files (and should be parsed), their content is analysed in order to add new URL to the job queue
    • If they are files (and should be indexed), then it is indexed according to the next section.

2. Indexing a single document

  • (Note that a document (RawDocument) may be a file, a file on a network share (smb), a HTTP-Request, or an IMAP-Message).
  • First, verify if the document already exists in the index (IndexWriterManager::addToIndex)
    • If so, check if it was indexed recently (less than a day ago) by comparing their lastModified-time.
      • If it was indexed recently, stop this process and continue with the next document.
      • If it was indexed more than 1 day ago, or this cannot be checked (HTTP), then delete the current entry and proceed (Note: It is not deleted immediatly, but rather at the end of the crawling process (IndexWriterManager::removeObsoleteEntries). That is also when onDeleteIndexEntry(Document doc, IndexReader index) will be called (just before deletion).)
  • Create a new Index entry (IndexWriterManager::createNewIndexEntry)
  • First the document is prepared for indexation (DocumentFactory::createDocument)
    • Auxiliary Fields are calculated.
    • The Crawler Access Controller (if available) is asked to retrieve the allowed groups.
    • The MIME-Type is identified.(org.semanticdesktop.aperture.mime.identifier.magic.MagicMimeTypeIdentifier::identify())
    • All preparators are collected which accept this MIME-Type.
    • The preparator with the highest priority is executed. (DocumentFactory::createDocument)
      • Before (onBeforePrepare(RawDocument document, WriteablePreparator preparator)) and after (onAfterPrepare(RawDocument document, WriteablePreparator preparator)) the actual preparation, the plugins are called.
    • If he fails, then an empty document is returned (DocumentFactory::createSubstituteDocument).
  • Then it is added to the Lucene index, after notification of the plugins (onCreateIndexEntry(Document doc, IndexWriter index)).

At the end, onFinishCrawling(Crawler) is called.

project_info/crawling_process.txt · Last modified: 2014/10/29 10:22 (external edit)