Similar Posts and Pentecost – All Things Seen and Unseen

Similar Posts v.2.5b28 has just been posted.

Working on Similar Posts I have learned more than I care to know about the vagaries of MySQL, PHP, and Unicode. One particular issue that has so far resisted my attempts has been the satisfactory handling of content in Chinese, Korean, or Japanese (CJK).

Similar Posts uses the full-text indexes provided by MySQL to compare one post with another and the MySQL index is word-based. The CJK languages (I am told) are not based on discrete words — at least not words delimited by ‘white space’ — so they pose a big problem to full-text indexing.

My workaround (hack?/fiddle?/trick?) is to separate the CJK text into individual characters (while leaving single-byte encoded text alone) and use them as the basis for similarity matching. It is clearly not an ideal solution but I would love to hear from the users of WordPress blogs in Chinese, Korean, or Japanese if it is better than no solution at all.

The experiment has a couple of limitations: although not the ideal encoding for CJK languages, this method only works for now on blogs using UTF-8 encoding; also, to get around MySQL’s habit of ignoring words shorter than 4 characters long, each CJK ‘word’ is padded to that length, making for a rather large index.

To try this approach, use the Settings | Similar Posts | Manage the Index screen, set the option, and re-index. This setting overrides the other settings on that screen.

The reference to Pentecost in this post’s title is because today’s Feast celebrates the undoing of Babel.

2 replies on “Similar Posts and Pentecost”

Gregory says:

May 11th, 2008 at 8:38 pm

Brilliant!

Just stumbled across your blog looking for a caching solution for my various WordPress blogs. A Jesuit priest writes WordPress plugins — I was intrigued. All the more so when you attempted to tie full-text indexing to Pentecost. Well done.

Also — and I don’t know how much this helps you — but MySQL only sets the minimum index length to four characters by default. To change it, modify the ft_min_word_len variable in the my.cnf file on your server. (Some hosting providers lock this down, however, because of the drain on SQL resources.)

Anyway, thanks for the plugins — and the homilies.
Rob says:

May 11th, 2008 at 9:50 pm

Gregory: You described the problem — most shared hosting won’t let you near ft_min_word_len.

My health has barred me from preaching these days so I content myself with thinking the easier thoughts computers think 😉

Comments are closed.