User Tools

Site Tools


project_info:crawling_process
Translations of this page:

**This is an old revision of the document!** ----

A PCRE internal error occured. This might be caused by a faulty plugin

====== Crawling Process ====== How does the crawling process work? === 1. Creating the job queue === * According to the Crawling Whitelist (and Blacklist), the start URLs are added to the job queue (Crawler::addJob) * Recursively, all entries are checked: (Crawler::run) * If they are folders (and should be indexed), their content is added to the job queue * If they are files (and should be parsed), their content is analysed in order to add new URL to the job queue * If they are files (and should be indexed), then it is indexed according to the next section. === 2. Indexing a single document === *(Note that a document (RawDocument) may be a file, a file on a network share (smb), a HTTP-Request, or an IMAP-Message). * First, verify if the document already exists in the index (IndexWriterManager::addToIndex) * If so, check if it was indexed recently (less than a day ago) by comparing their lastModified-time. * If it was indexed recently, stop this process and continue with the next document. * If it was indexed more than 1 day ago, or this cannot be checked (HTTP), then delete the current entry and proceed <ref>It is not deleted immediatly, but rather at the end of the crawling process (IndexWriterManager::removeObsoleteEntries). * Create a new Index entry (IndexWriterManager::createNewIndexEntry) * First the document is prepared for indexation (DocumentFactory::createDocument) * The MIME-Type is identified.(org.semanticdesktop.aperture.mime.identifier.magic.MagicMimeTypeIdentifier::identify()) * All preparators are collected which accept this MIME-Type. * The preparator with the highest priority is executed. (DocumentFactory::createDocument) * If he fails, then an empty document is returned (DocumentFactory::createSubstituteDocument). * Then it is added to the [[Lucene]] [[Search index|index]].

project_info/crawling_process.1307116884.txt.gz · Last modified: 2024/09/18 08:31 (external edit)