Implement syllable-matching logic for Abugida scripts (Thai, Lao, Khmer, Burmese, etc.)

Abugida scripts are a group of phonemic writing systems, including Thai, Lao, Khmer, Burmese, Tibetan and all Indic scripts. These writings systems share most of the properties below:

* Vowel sounds are marked using symbols on all directions of the initial consonant sound or cluster (e.g. Thai `เ` (e) is placed before the consonant, and `า` (a) is placed after the consonant);
* Some vowels are _inherit_ and are not written out;
* Syllable boundaries are not marked in the script. In some cases even word boundaries are not delimited.
* Certain letters are shared by different phonemes (e.g. Lao `ວ` can be either the vowel sound `oua` or the consonant `v` depending context).

There could also be other language-specific irregularities, e.g. the same letter can be shared by two syllables, vowel sounds that require multiple glyphs in the representation, etc.

Due to the nature of these scripts, syllable boundaries can be ambiguous and requires lexical information for disambiguation. There are cases where there could be multiple way to segment a string into syllables, and all are valid pronunciation according to the spelling rules. Example: (Thai) `เปลา` has four letters, in the order of writing, `e`, `p`, `l`, `a`. When `e-a` appear in the same syllable, the two glyphs represent the rhyme `ao`. So the same sequence of letters can be `pe-la` or `plao` according to Thai spelling rules, but `plao` is the only correct pronunciation in this case. Another word with a similar structure, `เพลา`, `e`, `ph`, `l`,`a`, can be either `phe-la` or `phlao`.

In most cases, syllable boundaries cannot be easily implemented by substitution. At the very least, a dictionary lookup will be needed for transcription, and words not found in a dictionary needs to be parsed into syllables using a more sophisticated (either rule-based or probabilistic) matching logic.

At least these features will be needed in order to achieve higher accuracy:
* Dictionary Lookup (have been implemented for Korean, see #240 )
* Frequency lookup (check syllable frequency or bi-gram / tri-gram frequency table)
* Control structure (if..then..else)
* Variables
* A built-in way to mark syllabic boundaries

These features have been discussed in fuller detail in ~~#202~~ (~~moved to interscript/lcs#2~~ back at #718) .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement syllable-matching logic for Abugida scripts (Thai, Lao, Khmer, Burmese, etc.) #253

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement syllable-matching logic for Abugida scripts (Thai, Lao, Khmer, Burmese, etc.) #253

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions