Print Version May 11th, 2008
Similar Posts v.2.5b28 has just been posted.
Working on Similar Posts I have learned more than I care to know about the vagaries of MySQL, PHP, and Unicode. One particular issue that has so far resisted my attempts has been the satisfactory handling of content in Chinese, Korean, or Japanese (CJK).
Similar Posts uses the full-text indexes provided by MySQL to compare one post with another and the MySQL index is word-based. The CJK languages (I am told) are not based on discrete words — at least not words delimited by ‘white space’ — so they pose a big problem to full-text indexing.
My workaround (hack?/fiddle?/trick?) is to separate the CJK text into individual characters (while leaving single-byte encoded text alone) and use them as the basis for similarity matching. It is clearly not an ideal solution but I would love to hear from the users of WordPress blogs in Chinese, Korean, or Japanese if it is better than no solution at all.
The experiment has a couple of limitations: although not the ideal encoding for CJK languages, this method only works for now on blogs using UTF-8 encoding; also, to get around MySQL’s habit of ignoring words shorter than 4 characters long, each CJK ‘word’ is padded to that length, making for a rather large index.
To try this approach, use the Settings | Similar Posts | Manage the Index screen, set the option, and re-index. This setting overrides the other settings on that screen.
The reference to Pentecost in this post’s title is because today’s Feast celebrates the undoing of Babel.