one2edit Support Portal

Knowledge Base / Administrators – Preparing your one2edit™ v3 Workspace / Segmentation Rules

Creating or editing Segmentation Rules

Created on 28th November 2016 at 18:02 by Jamie O'Connell

Segmentation rules are sets of regular expressions (regex) that can be applied to specific languages.

A segmentation rule consists of 'before' and 'after' patterns, as well as a possible 'break' command.

If a 'break' command is present, a segmentation break will appear wherever the 'before' text pattern is followed by the 'after' text pattern.
If a 'break' command is not present, it is an exception to the above, so a segmentation break will not appear for those particular patterns.

In order to achieve this, the 'exceptions' must occur higher in the list than the breaks. This is because the rules are looked at sequentially, and if a rule is matched, the system moves onto the next characters and starts again from the top.

For example, if we wanted to segment our text into sentences, we would like a segmentation break to occur after a full-stop/period - such patterns would be marked with a break and appear lower in the list. However, we would not like a segmentation break to occur after an abbreviation, which also uses a full-stop/period as a marker - such patterns would be marked without a break and appear higher in the list.

NOTE:
This lesson can be used to alter the default Segmentation Rules, or ones that have been imported (via SRX) from an external source. It is not necessary to start from scratch completely.

NOTE:
Regular expressions are a standard way to 'match' strings of text.

You can find out more about them online:

Create new Segmentation Rule set

To create a new set of segmentation rules, click the 'New' button in the 'Segmentation Rules' window.
Alternatively, select an existing set of segmentation rules and click the 'Edit' button.

Name it and add a language

Name the set.
Click the green 'plus' symbol to add a new language. The new language will have a placeholder name and pattern, as shown.

Specify a language

When applying segmentation rules to a document in one2edit™, you are asked to choose a language. This will determine which rules are used to segment the text in that document.

For this reason, the new language needs to be given both a name and a language pattern in order to ensure that the correct rules are used for the language that is selected.

In the example above, we have given the language the name 'English'. This field is a free-text field and you can type in whatever you want in order to identify these rules.
In the 'Language Pattern' field, we have entered 'EN.*'. This means that the rules under this language will be applied to any language that starts with the letters 'EN' (or 'en' - it is case-insensitive) because the '.*' is a wildcard. For example, both 'en_UK' and 'en_US' will fall under this umbrella.

NOTE:
The language codes (i.e. patterns) are defined by ISO 639.

Add a rule

Click the green 'plus' button in the lower area to add a rule for this language.

The new rule will have some placeholder regular expression patterns, as shown.

Enter your regular expressions

We should then change the placeholder patterns into whatever we need as a rule set. Simply click on a placeholder pattern to edit it.

Example:
In the 'Before Pattern', we are defining the abbreviation 'St.', which can mean 'Saint' or 'Street'. The dot is a special character and needs to be 'escaped' using the backslash character.
In the 'After Pattern', we are defining a whitespace character.

NOTE:
We do not check the 'Break' box in this case, because we are defining an exception for this particular abbreviation.

Repeat to enter all required rules

Repeat the above steps to add more rules to your set.

In the above example, a segmentation break WILL occur if one or more of a full-stop/period, a question-mark, or an exclamation-mark are seen, followed by a whitespace.

However, if the text before the full-stop/period is 'St', followed by a whitespace, then no segmentation break is inserted. This is because the 'exception' occurs higher up the list than the 'break', meaning that when it matches, the system will jump to the next character and start from the top again.

Building patterns using regular expressions may take some time, but it is a very powerful tool.

Click 'Save'

When you have finished creating your set of segmentation rules, click the 'Save' button.

New segmentation rules created

You will now see your set of segmentation rules in the list.

Reset password

Your session has timed out due to inactivity.