User Tools

Site Tools


config:regular_expression
Translations of this page:

Regular expressions

The configuration of regain works with regular expressions. Regular expressions (short: regex) are very powerfull wildcards and the are well appropriate to describe any string patterns, e.g. URLs.

If you are not familiar with this technique you find a very detailed description here. If you want a more compact intruduction just google for regex, you will be covered with them.

Note: regain uses the regex dialect of Java, which is the same as the one of Perl.

Regard: In the XML configuration files CrawlerConfiguration.xml and SearchConfiguration.xml all XML characters like &, < or > must be replaced by the according entities (&amp;, &lt; or &gt;)!

Example: The regex <a[^>]*>The&nbsp;link</a> must be specified as &lt;a[^&gt;]*&lt;The&amp;nbsp;link&lt;/a&lt;. (This example is extrem, of corse…)

Regex groups

Parts of regular expressions may be combined to groups. A group is marked by a left and a right parenthesis. Each group has a unique number, used to identify it.

Groups are numbered by its left parenthesis: The whole regex has the number 0. The group with the first left parenthesis in the string has the number 1. The group with the second left parenthesis has the number 2, and so on.

Example:

a(b(a(b|c)a)a(b|e)*)c   - Number 0
 (b(a(b|c)a)a(b|e)*)    - Number 1
   (a(b|c)a)            - Number 2
     (b|c)              - Number 3
             (b|e)      - Number 4
config/regular_expression.txt · Last modified: 2014/10/29 10:22 (external edit)