User Tools

Site Tools


features

====== Differences ====== This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
features [2009/03/02 15:22]
til fixed link
features [2024/09/18 08:30] (current)
Line 13: Line 13:
     * and much more. You can find more information about the search syntax [[http://​jakarta.apache.org/​lucene/​docs/​queryparsersyntax.html|here]].     * and much more. You can find more information about the search syntax [[http://​jakarta.apache.org/​lucene/​docs/​queryparsersyntax.html|here]].
   * [[:​features:​Multi index search]]: Search multiple indexes with one search mask. Totally transparent for the user.   * [[:​features:​Multi index search]]: Search multiple indexes with one search mask. Totally transparent for the user.
-  * [[:​features:​URL-Rewriting]]: You can use URL-Rewriting at your search. This enables you to index documents from ''<​nowiki>​file://​c:/​www-data/​intranet/​docs</​nowiki>''​ and show them in the browser as ''<​nowiki>​http://​intranet.murfman.de/​docs</​nowiki>''​. +  * URL-Rewriting:​ You can use URL-Rewriting at your search. This enables you to index documents from ''<​nowiki>​file://​c:/​www-data/​intranet/​docs</​nowiki>''​ and show them in the browser as ''<​nowiki>​http://​intranet.murfman.de/​docs</​nowiki>''​. 
-  * [[:​features:​Advanced search]]: All values that are in the index for one field may now be provided as a drop down list on the search page. Particularly together with [[:​features:​auxiliary fields]] this is very useful.+  * Advanced search: All values that are in the index for one field may now be provided as a drop down list on the search page. Particularly together with [[:​features:​auxiliary fields]] this is very useful.
   * [[:​features:​File-to-http-bridge]]:​ Some browsers load for security reasons no file links from http pages. Thus all documents that are in the index are now provided over HTTP. Of course this may switched off and at the [[:​project_info:​variant_comparison|desktop search]] these documents are only accessible from the local host.   * [[:​features:​File-to-http-bridge]]:​ Some browsers load for security reasons no file links from http pages. Thus all documents that are in the index are now provided over HTTP. Of course this may switched off and at the [[:​project_info:​variant_comparison|desktop search]] these documents are only accessible from the local host.
  
Line 22: Line 22:
  
   * [[:​features:​White and black list]]: With a white list and a black list you may isolate very exactly which documents the crawler should process. E.g. you may index all from ''<​nowiki>​http://​www.murfman.de</​nowiki>''​ except for ''<​nowiki>​http://​www.murfman.de/​dynamiccontent</​nowiki>''​.   * [[:​features:​White and black list]]: With a white list and a black list you may isolate very exactly which documents the crawler should process. E.g. you may index all from ''<​nowiki>​http://​www.murfman.de</​nowiki>''​ except for ''<​nowiki>​http://​www.murfman.de/​dynamiccontent</​nowiki>''​.
-  * [[:​features:​Several sources in one index]]: You may index documents from different file systems and/or web sites in the same search index. +  * Several sources in one index: You may index documents from different file systems and/or web sites in the same search index. 
-  * [[:​features:​Partial indexing]]: Assumed your search index contains documents from a network drive (file server) and a web page. You may update only the documents from the network drive. In doing so you may update some drives every hour and others only every week.+  * Partial indexing: Assumed your search index contains documents from a network drive (file server) and a web page. You may update only the documents from the network drive. In doing so you may update some drives every hour and others only every week.
  
  
 ===== Indexing ===== ===== Indexing =====
  
-  * [[:​features:​Hot deployment]]: Change on a new search index without restarting your servlet engine (e.g. Tomcat). +  * Hot deployment: Change on a new search index without restarting your servlet engine (e.g. Tomcat). 
-  * [[:​features:​Stopword list]]: Defined words are not indexed.+  * Stopword list: Defined words are not indexed.
   * [[:​features:​Analysis files]]: If desired, all intermediate steps of the indexing process can be written out as files. This allows you see exactly what is in the search index.   * [[:​features:​Analysis files]]: If desired, all intermediate steps of the indexing process can be written out as files. This allows you see exactly what is in the search index.
-  * [[:​features:​Content extraction for HTML]]: Index only the actual content of your web pages. regain removes the navigation and footer from your html documents. +  * Content extraction for HTML: Index only the actual content of your web pages. regain removes the navigation and footer from your html documents. 
-  * [[:​features:​Path extraction for HTML]]: Shows the navigation path of your web pages in the search results. +  * Path extraction for HTML: Shows the navigation path of your web pages in the search results. 
-  * [[:​features:​Dead link detection]]: As a sort of by-product all found dead links (links to non-existing documents) are written out.+  * Dead link detection: As a sort of by-product all found dead links (links to non-existing documents) are written out.
   * [[:​features:​Breakpoint]]s:​ The crawler creates periodic breakpoints. When doing so, the current state of the search index is copied into a separate directory. If the index update should be cancelled (e.g. if the computer is shut down), the crawler will go on from the last breakpoint the next time it is started.   * [[:​features:​Breakpoint]]s:​ The crawler creates periodic breakpoints. When doing so, the current state of the search index is copied into a separate directory. If the index update should be cancelled (e.g. if the computer is shut down), the crawler will go on from the last breakpoint the next time it is started.
   * [[:​features:​Auxiliary fields]]: The index may also be extended by auxiliary fields that are extracted from a document'​s URL. For example, assuming that you have a directory containing a sub directory for each project, you can generate an auxiliary field using the project name. This allows you to only get documents from individual projects (eg. only obtain documents from the project directory "​otto23"​ when searching for "Offer project:​otto23"​).   * [[:​features:​Auxiliary fields]]: The index may also be extended by auxiliary fields that are extracted from a document'​s URL. For example, assuming that you have a directory containing a sub directory for each project, you can generate an auxiliary field using the project name. This allows you to only get documents from individual projects (eg. only obtain documents from the project directory "​otto23"​ when searching for "Offer project:​otto23"​).
Line 42: Line 42:
  
   * [[:​components:​Preparator]]s:​ The preparation of a certain file format is done by so-called preparators. Thus you are able to specify which preparators regain should use. In addition regain may easily be extended for more file formats.   * [[:​components:​Preparator]]s:​ The preparation of a certain file format is done by so-called preparators. Thus you are able to specify which preparators regain should use. In addition regain may easily be extended for more file formats.
-  * [[:​components:​Tag Library for the search]]: regain offers a Tag Library for creating the Java Server Page for the search. Thus the adaption of the search page to your web page's design is particularly easy.+  * [[:​components:​the search mask jsp pages|Tag Library for the search]]: regain offers a Tag Library for creating the Java Server Page for the search. Thus the adaption of the search page to your web page's design is particularly easy.
   * [[:config | Configuration]]:​ regain is highly adaptable. The whole configuration of the crawler is in one XML file.   * [[:config | Configuration]]:​ regain is highly adaptable. The whole configuration of the crawler is in one XML file.
   * [[:​features:​Access rights management]]:​ It is now possible to integrate an access rights management, that ensures that a user only sees results for documents he has reading rights for.   * [[:​features:​Access rights management]]:​ It is now possible to integrate an access rights management, that ensures that a user only sees results for documents he has reading rights for.
  
features.1236003754.txt.gz · Last modified: 2024/09/18 08:30 (external edit)