Language Processing Modules

Encoding

  • Descender-less YO YING and THO THAN in Pali-Sanskrit text.
    • OpenType: use REQ (required) rule for Pali and Sanskrit
      • Nothing to do with encoding, but needs language markup
      • What about plain text?
    • Encoding schemes
      • New code points for descender-less YO YING and THO THAN?
      • ZWJ (U+200D) as descender remover?
        • Unicode 5.1: 16.2 - Special Areas and Format Characters - Layout Controls
        • Abuse of cursive/ligature? Is cursive form in use for Thai?
      • Variation selector (U+FE00 - U+FE0F)?
        • Unicode 5.1: 16.4 - Special Areas and Format Characters - Variation Selectors

Explicit Breaks

Define explicitly when to use which kind of explicit breaks in Thai language text (or, if applicable, text in other languages that uses Thai script)

"Explicit breaks" are Zero-Width Space, Non-breaking Space, …

Input Method (Editing)

Update WTT 2.0 Input/Output Method

  • SARA AM should no longer be allowed without base consonant.
    • Rationale: SARA AM is usually rendered decomposed in modern rendering engines.
    • Add new class 'AM'
  • NIKHAHIT should allow composition with tone mark.
    • Example: In old writing "ลํ๋ตายหายกว่า"; Lao "ຂໍ້​ມູນ"
    • Add new class 'AD4'
  • Leading vowels should allow composition with PHINTHU.
    • Rationale: For Kuy and So languages written with Thai script, such as โฺทร (So language)
    • Fix the table
  • MAITAIKHU should be able to function as upper vowel.
    • Rationale: For Bru language written with Thai script, such as แต็่ง
  • MAITRI should be able to function as upper vowel.
    • Rationale: For Bru language written with Thai script, such as โจ๊่
  • NIKHAHIT should be able to follow long-sound vowel as well.
    • Rationale: For minority language in Northern Khmer written with Thai script, such as มูํย
  • Add a class for below-base Lao consonant.
    • Example: ຫຼົງໃຫຼ
    • Add new class 'BCONS'

Extension Proposal

Input Sequence Correction

A few standard keyboard shortcuts?

Especially to control word/line breaking and/or ligatures/forms.

  • Non-breaking space - ctrl+space (from OOo)
  • No-width optional-break - ctrl+/ (from OOo)
  • No-width no break (from OOo)

I use "no-width no break" in OOo very often to control line breaks for words not in the dictionary, but currently there is no shortcut for it. If we implement the functionality in other systems like GNOME, we should use the same key for all the systems.

Word Prediction

Research in Thai word prediction has been worked out by NECTEC and TCL, e.g. Intelligent Key Prediction by N-grams and Error-correction Rules (2001)

Recent advance of Thai word prediction development at NECTEC has been called i-key and used in Sansarn search engine and other applications.

Output Method (Display)

  • Display cells could still be clusterized based on input method level 1.
  • Should we change the recommendation for displaying incorrect sequences to prefer dotted circle?
  • Explicitly describe how to display Zero-Width Space - also related to line-breaking behavior

Font Design

OpenType Technology

Text decoration

GUI Interaction

Don't know yet the best name of this section.
This is partially input, partially output, and some other things about GUI behavior related to text.

Highlighting/Selection

  • Selection boundary should be allowed to falls only at the character cluster

Cut/Paste

  • You can cut/copy part-of-cluster from other non-conforming applications, e.g. ู้ from "กู้", result in incorrect sequence in the clipboard
  • When you paste to a WTT-3.0 applications, what should it do? Automatic sequence correction?

Locale

  • Should we keep CLDR (Common Locale Data Repository) in mind ?
    • http://unicode.org/cldr/
    • used in ICU, OpenOffice.org, Solaris, and more, more to come.
    • has Thai section, chiefly maintained by IBM Thailand's National Language Development and Translation Services Center and Samphan Raruenrom
    • streamlining is possible here, get latest CLDR as a jumpstart for WTT 3.0 Locale development, and then each WTT Locale recommendation will pushed back to CLDR, periodically
  • iso-codes as a resource for country names, language names, monetary units, etc.

Characters for Listing / Enumeration

Define a (recommended) set of characters that will be used for listing ?

ก. …
ข. …
ค. …

for example, should characters like ซ ฏ ฎ ฃ ฅ ฆ … be allowed ?
Those mentioned characters share one common visual property - a *zigzag* at its head, *that's the only visual property that distinguished them from their twins* - ช ข ค ม … — but at small print, it is difficult to notice this and distinguished.

And how about the obsolete "ฦ", which should always excluded? All of the above are included in Thai-letter "Bullet and Numbering" sequence in OpenOffice.org because it is impossible to find any reference back then.

Used in word processing, office suite softwares.

In thailatex and gnome-doc-utils, the skipped characters are ฃ ฅ ฆ ฤ ฦ.

In the page numbers of a preamble of the Royal Institute dictionary 2525 B.E., skipped characters are ฃ ฅ but not ฆ. (Last page is ด.) However the 2542 B.E. edition skips none. (Last page is ฒ.)

Recommended Criteria ?

  • should only be consonant ? - so ฤ and ฦ (vowels) are excluded
  • should easily be distinguished by shape ? - zigzag twins are excluded, ข but not ฃ, ค but not ฅ
    • By this criterion, ต is also skipped in favor of ด?
    • Probably, adapting this criterion to only skip the obsolete ฃ and ฅ should be enough.
  • should has different read-out ? - ค but not ฆ, ถ but not ฐ, ท but not ฒ, ต but not ฏ, ด but not ฎ, ย but not ญ, ช but not ฌ
    • This may only apply to short lists, like answer choices in school tests. But for long lists, like page numbers, it will be quickly exhausted.
  • cons: by these limitations, we will have smaller set of characters to use and will quickly need to use double chars like กก กข กค ….
  • Skipping some consonants may be important only for first ones. Thus ฆ could arguably be skipped, to be consistent with school test answer choices. But for later sequences, people may not care about the difference between individual items. Rather, completeness for countability may become important as the list grows.
  • Examples of listing currently in use
    • Page number in official documents
    • Item number in the constitution, laws
    • Car plate (car license)
    • School test answer choices

Segmentation

For information and illustration in this area, please advise:

Character segmentation

Character breaks are boundaries of combining character sequences.

Example in Thai may be ก็ where the character break should never falls between ก and ็

Character boundary analyzation is used in text editor for :-

  • left/right arrowing
  • up/down arrowing
  • hit test - the algorithm to calculate the position in a block of text when you single-click the mouse on it
  • text search

Word segmentation

Word break is useful for word selection (e.g. double click, Ctrl+<arrow>), word count (for report, readability test).

Currently, NECTEC and partners run BEST (BEnchmarkS for Thai language) project.
The project has a recommendation/guideline on Thai word segmentation by human- http://hlt.nectec.or.th/best/

Do we need different word segmentation specification for Thai full-text search?

Line segmentation (Line wrapping)

Line breaks are logically possible line breaks, actual line breaks are usually determined based on display width. Line break is useful for word wrapping text.

AFAIK, one pattern is problematic in UAX #14 definition. I'm not sure whether UAX #14 break "นาง(สาว)" before "(" but for "ดี ๆ" it will break before "ๆ". This should be specified precisely in our standard, including interaction with no-width option break and no-width no break.

  • note: this is display-oriented, and not intended for general text processing
  • Line-break iterator (segmentator) may purpose more than one break position, it is the work of page layout engine to decide which position is best.
  • Related standard: UAX #14 - Unicode Line Breaking Properties
  • #text-wrap property, CSS Text Level 3

Hyphenation

In line segmentation, especially for multi-columns page, sometimes it's to find an optimum line break that falls exactly at word break. In this case, it is possible that line break will falls at the middle of word. This is where hyphenation come into the scene.

Good hyphenation: represen- (newline) tative

Bad hyphenation: repres- (newline) -entative

Should WTT 3.0 cover hyphenation ?

Thai syllable segmentation will of course be useful for this task.

Sentence/Clause segmentation

note: the reason we didn't put it just simply "sentence segmentation" here is because some linguists still skeptical whether Thai has a construct like "sentence" ? See Thoughts on Word and Sentence Segmentation in Thai http://pioneer.chula.ac.th/~awirote/ling/snlp2007-wirote.pdf which suggested that "clause" may be more probable.

Question: does it possible to do this with rule-based approach ? corpus-based approach has been explored e.g. The Automatic Thai Sentence Extraction (2000), Sentence Break Disambiguation for Thai, but that would that be too-heavy and impractical ? Will of course have problem to push the corpus-based implementation into any international OSS project. (Mozilla used to rejected dictionary-based word segmentation from Samphan)

One answer: It can be possibly carried out by rules in a clearer level of text tokens such as phrase. The corpus-based is anyway more applicable by its deterministic feature. However at present, it is mostly domain dependent due to the limitation of Thai language resources for training. It is still an important research topics not only in engineering but also linguistic fields in several research institutes. It is hence difficult to put on standard.

Apart from specifications, should WTT provide some recommendation, suggestion on something that can't be put on standard (like this) ? So it will ease developer on which direction should they go.

Sorting/Collation

Searching/Matching/Comparison

any issue on Thai searching ? Does find dialog box works well for you currently ?

One requirement for Thai (or any CTL) searching (from Unicode specification) is that the result text must begin and end on cluster boundary. That is searching for "ที" will not match "ที่". This is the normal behavior in CTL-enabled word processor like OpenOffice.org or Microsoft Office.

Further issue is that whether searching for "กา" match "เกา"?

Soundex

related to Phonetic mark-up?

Phonetic mark-up well supports soundex search but soundex does not necessarily requires phonetic mark-up if we have an automatic phonetic generation. Soundex can be standardized given several methods:

  • Phonetic transcription method (see Phonetic mark-up)
  • Romanization method (see Thai transliteration/romanization)

Loose match

related to Normalization

  • Normalized text before matching ?
  • Ignore tone marks
    • กูเกิ้ล ~= กูเกิล
  • Ignore long-vowel and short-vowel mode
    • วิดีโอ (correct spelling) ~= วีดิโอ (wrong spelling)
  • Ignore double vowel (illtyped) e.g.
    • treat SARA I + SARA I as single SARA I
    • กิิน ~= กิน
  • Match frequently misspelled character pairs e.g.
    • LAKKHANGYAO (U+0E45) and SARA AA (U+0E32)
    • SARA AI MAIMUAN (U+0E43) and SARA AI MAIMALAI (U+0E44)

Loose match (NECTEC service) can be related to both

  • Word approximation - using e.g. edit distance
  • Soundex - using similar or exact-matched sound

Normalization

  • NIKHAHIT (U+0E4D) + SARA AA (U+0E32) -> SARA AM (U+0E33) (this combination is already in Unicode table)
  • NIKHAHIT + tone mark + SARA AA -> tone mark + SARA AM

Phonetic mark-up

phonetic mark-up specifications

  • for basic sound matching
  • for precision text-to-speech rendering
  • we can create a multi-precision scheme for the country!

criteria:

  • should the symbol used in the mark-up be limited to characters available on typical keyboard ? for ease of typing.

algorithms:

Thai Transliteration/Romanization algorithm

related to Phonetic mark-up?

Should this be in WTT 3.0 scope ?
Thai romanization method has been coded by Thai Royal Institute many years ago. It has been used somewhere like transliteration engines at NECTEC and CU. Romanization by RI is based mainly on phonetic transcription with a phonetic-roman character mapping table.

Thatsanee Charoenporn, Ananlada Chotimongkol, Virach Sornlertlamvanich, 1999. "Automatic romanization for Thai".
A Unified Model of Thai Romanization and Word Segmentation
and a romanization service

Standard word lists

Should WTT 3.0 provide some royalty-free standard word lists ?
To ease the implementation process (legal).

  • public domain, or royalty-free least-restricted free software/creative commons licensed.
  • list for word segmentation purpose
  • list for hyphenation purpose
  • list for word prediction purpose
  • list for word correction (spelling check) purpose

What criteria of words we are looking?

  • Statistic or linguistic
    • Most frequently used words (require a very large resource for general domain)
    • Standard dictionary words (e.g. Royal Institute)
  • Conceptual or morphological
    • Longer word for concept expression
    • Shorter word for segmentation purposes

Printer and Display optimization transitional codes

Add a New Comment
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License