Friday, September 7, 2012

How I fixed the IBus Cangjie IME table for good on Fuduntu

I type a lot of Chinese on my computer. I first learned Cangjie in my teens, and it has been my primary input method to this day. After reinstalling Fuduntu on my new laptop, imagine my despair that as I type up Chinese passages, Cangjie codes for common characters - that I have mostly taken for granted - suddenly becomes unrecognized. To add insult to injury, the Cangjie 3 table in SCIM has all these codes in there.

In Linux Cangjie is supported by SCIM (and later IBus) by their table engines. There are three versions of Cangjie tables - Cangjie 3, Cangjie 5, and "Cangjie-big" which covers characters in Unicode extension B block, where many written Cantonese characters belong, and which I handle a lot of. So at one time I have had all three tables loaded, and have had to flip between all of them in the course of typing up one passage. It is a nuisance and hampers my speed.

So I have a solution: Merge them into one master table and load it instead. Bring everything under one umbrella.

Unfortunately, for some reason, when I checked the fuduntu repo I was missing the ibus-table package, needed to build new table-based IMEs. I had to compile my own. Good news is you can get the source RPM from the newest Fedora, rebuild it on fuduntu, and install it. I'll leave it as an exercise for you, for now.


To build this master Cangjie table, I downloaded the standalone ibus-table-chinese source code and extracted it. Once unpacked, the Cangjie tables are in subfolder ibus-table-chinese-1.3.5-Source/tables/cangjie. Then I run these commands to combine them:

$ grep -n "BEGIN_TABLE" *.txt 
cangjie3.txt:115:BEGIN_TABLE 
cangjie5.txt:127:BEGIN_TABLE 
cangjie-big.txt:134:BEGIN_TABLE 
$ tail -n +115 cangjie3.txt > master 
$ tail -n +127 cangjie5.txt >> master 
$ tail -n +134 cangjie-big.txt >> master

The 115, 127 or 134 may change when new versions of these tables are released. Use the values you see when you run it.

This produces a raw merged file of all Cangjie tables, minus the headers.

Now clean it up by removing commented out codes and the "END_TABLE" tag that aren't removed by the commands above:

$ grep -v "###" master | grep -v "END_TABLE" > master2

Sort and deduplicate the intermediate file:

$ sort -rk3 master2 | sort --key 1,2 -u > master

The first sort is to make sure we get the line with highest frequency (field 3 of the table entries) to show first and be retained after sorting. The second sort produces only one row per table key, regardless of frequency.

Apply the header from cangjie-big.txt to this new master table (I like its clean "倉" icon), and complete it with the "END_TABLE" tag at the end:

$ tail -n +134 cangjie-big.txt > cangjie-master.txt
$ cat master >> cangjie-master.txt
$ echo "END_TABLE" >> cangjie-master.txt

Install this new table as root:
# ibus-table-createdb -n /usr/share/ibus-table/tables/cangjie-master.db -s cangjie-master.txt
Now restart IBus and add cangjie-master table under Preferences, Input Method. It appears as a Chinese input method.

I just did this today and so far I am able to enter all the Chinese I need to.

I would say I can use the same approach to get myself a master Cantonese/Jyutping table, just that I have no immediate need to.

Happy Chinese typing!


2 comments:

  1. Hi, I just found about your article.

    I'm the lead developer of IBus Cangjie, an effort to bring IBus users a great Cangjie/Quick input method.

    I'd be very interested if you could try it out, and let me know what you think. :)

    We've tried hard to offer **by default** all the characters that are important in Hong Kong... but if you use the 1.0 release that is packaged in Fedora, Debian or Ubuntu, then you'll realize that we completely failed. :)

    We're close to a 2.0 release though that (we think) will provide these by default, so if you want to try it out, grab it from Git.

    More details are at http://cangjians.github.io/

    ReplyDelete
  2. Is there a way to do this as well for Fcitx?

    ReplyDelete