-
Notifications
You must be signed in to change notification settings - Fork 14
Implement syllable-matching logic for Abugida scripts (Thai, Lao, Khmer, Burmese, etc.) #253
Description
Abugida scripts are a group of phonemic writing systems, including Thai, Lao, Khmer, Burmese, Tibetan and all Indic scripts. These writings systems share most of the properties below:
- Vowel sounds are marked using symbols on all directions of the initial consonant sound or cluster (e.g. Thai
เ(e) is placed before the consonant, andา(a) is placed after the consonant); - Some vowels are inherit and are not written out;
- Syllable boundaries are not marked in the script. In some cases even word boundaries are not delimited.
- Certain letters are shared by different phonemes (e.g. Lao
ວcan be either the vowel soundouaor the consonantvdepending context).
There could also be other language-specific irregularities, e.g. the same letter can be shared by two syllables, vowel sounds that require multiple glyphs in the representation, etc.
Due to the nature of these scripts, syllable boundaries can be ambiguous and requires lexical information for disambiguation. There are cases where there could be multiple way to segment a string into syllables, and all are valid pronunciation according to the spelling rules. Example: (Thai) เปลา has four letters, in the order of writing, e, p, l, a. When e-a appear in the same syllable, the two glyphs represent the rhyme ao. So the same sequence of letters can be pe-la or plao according to Thai spelling rules, but plao is the only correct pronunciation in this case. Another word with a similar structure, เพลา, e, ph, l,a, can be either phe-la or phlao.
In most cases, syllable boundaries cannot be easily implemented by substitution. At the very least, a dictionary lookup will be needed for transcription, and words not found in a dictionary needs to be parsed into syllables using a more sophisticated (either rule-based or probabilistic) matching logic.
At least these features will be needed in order to achieve higher accuracy:
- Dictionary Lookup (have been implemented for Korean, see #240 )
- Frequency lookup (check syllable frequency or bi-gram / tri-gram frequency table)
- Control structure (if..then..else)
- Variables
- A built-in way to mark syllabic boundaries
These features have been discussed in fuller detail in #202 (moved to interscript/lcs#2 back at #718) .