one2edit Support Portal

Knowledge Base / Administrators – Preparing your one2edit™ v4 Workspace / Segmentation Rules - v4

Create or Edit Segmentation Rules

Created on 24th May 2024 at 10:47 by Jamie O'Connell

Segmentation Rules are sets of language-specific regular expressions (regex) that define how to segment the text of a document (e.g. into sentences, paragraphs, etc.).

A segmentation rule consists of a before pattern and an after pattern, as well as an optional break flag.

If a break flag is set, the text will be split wherever the before text pattern is followed by the after text pattern.
If a break flag is not set, it is an exception to the above, so the text will not be split on these particular patterns.

The exceptions are always defined higher in the list than the breaks. This is because Segmentation Rules are parsed sequentially, and if the patterns in a rule are matched, the system moves onto the next character in the text and starts again from the top of the rule list.

For example:

I define a rule to break on a full stop/period followed by a whitespace and an upper-case letter.
I would then define an exception rule higher in the list for the pattern of Mr. followed by a whitespace and an upper-case letter (i.e. for names).

NOTE: This lesson can be used to alter the default one2edit™ Segmentation Rules, or ones that have been imported (via SRX) from an external source. It is not necessary to define your rules from scratch.

NOTE: Regular expressions are a standard way to match strings of text. You can find out more about them online:

Create new Segmentation Rules set

To create a new set of segmentation rules, click the + (plus) button in the Segmentation Rules dialog.
Alternatively, you can edit an existing set of segmentation rules by clicking Properties in its option menu, or double-clicking the rule set.

Name it and add a new language rule

Name the set.
Click the + (plus) button to add a new language rule. The new language will have a placeholder name and pattern, as shown.

Define the language

When applying segmentation rules to a document in one2edit™, you are asked to choose a language. This can be used to determine which rules are used to segment the text in that document (i.e. for language-specific segmentation).

For this reason, the new language can be given both a name and a Language Pattern.

In the example above, we have given the language the name English.
In the Language Pattern field, we have entered EN.*. These rules will be applied to any language code that starts with the letters EN (or en – it is case-insensitive), because the .* is a wildcard. For example, both en_GB and en_US will fall under this umbrella.
You may also click the Default button to import the default one2edit™ Segmentation Rules as a starting point.

NOTE: Rules under a Language Pattern of .* will always be applied, no matter which language has been selected.

NOTE: The language codes (i.e. patterns) are defined by ISO 639.

Enter Patterns (Regular Expressions)

The lower section of the dialog contains the segmentation rules, as defined through regular-expression patterns.

Double-click on a pattern to edit it. Update the Before Pattern and After Pattern to reflect your segmentation break or an exception.
Check the Break box if a segmentation break should occur between the patterns. Leave it blank for an exception

A translation segment will typically be a sentence of text. In many languages, sentences are defined by a dot (full stop, period) followed by whitespace and then an upper-case character. This pattern should be set up as a segment break.

However, these same languages often also use the dot at the end of an abbreviation. And that abbreviation may also be followed by a whitespace and then an upper-case character. The patterns for these abbreviations should be set up as exceptions that do not cause a segment break.

The example above defines:

Before Pattern of Mt. (the abbreviation of Mount)
After Pattern of whitespace followed by an upper-case character in the range A-Z
Break is not checked, because this will be an exception and not the end of a sentence

This means that, if the text contains the words Mt. Everest, the segmentation rules will not insert a segment break at that point.

NOTE: The rules are parsed from top to bottom, so you must place all of your exception patterns at the top of your list.

Repeat for all Rules and Languages

Repeat the above steps to add all required languages and their rules to your set.

Click the + (plus) buttons to add new languages and rules.
Edit the Language Pattern for each language.
Edit the patterns for the exceptions and breaks.

Building patterns using regular expressions may take some time, but it is a very powerful tool.

In the above example, a segmentation break will occur if one or more of a dot, a question-mark, or an exclamation-mark are followed by a whitespace and then an upper-case character or digit.

However, if the text before the dot is Mt or etc, then the exception pattern is matched, and no segmentation break is inserted.

Remember, the exceptions must be placed higher up the list than the breaks, so that they are matched first.

NOTE: The Default button will import the default one2edit™ rules, giving you a very good starting point.

NOTE: Multiple SRX (Segmentation Rules eXchange) files are available online.

Save Changes

When you have finished creating your set of segmentation rules, click Save.

New segmentation rules created

You will now see your set of Segmentation Rules in the dialog.

top

Reset password

Your session has timed out due to inactivity.