Proposed "fix" for character encoding problems

Submitted by Anonymous on Sun, 10/07/2012 - 01:45
Written by
Quix0r

I have encountered a lot character problems with especially German Umlaute and sharp -s-. To "fix" this I had to hack a little function unhtml() in sources/functions.php:
- Add new first line global $lang;
- Replace $string = htmlspecialchars($string); with $string = htmlspecialchars(recode('iso8859-1..' . $lang['character_encoding'], $string), ENT_XHTML | ENT_SUBSTITUTE);

This takes care of any selected character set and I'm currently changing all tables to utf8_general_ci to have no troubles with "special" characters.

Hope you find this hack useful.

I assume you use UTF-8 as the character enoding. This may work as long as you don't apply any primitive string operations that don't support multibyte encodings. Plain PHP's substr() and more are not Unicode aware and will take every byte as one character, potentially breaking UTF-8 chars with 2 bytes.

This fix however should work with single-byte character sets, right? So it fixes some problems (they happen when htmlspecialchars() detects an error and ENT_SUBSTITUTE is not used).

And then you could fully switch to mb_foo() equivalent functions to have safe multi-byte support (don't do it yourself, use mb functions from PHP).

Haven't tested this change yet, but I just noticed ENT_SUBSTITUTE is PHP 5.4+ only, and we support PHP 4.3. I will check if this can be included anyway without breaking compatibility.

About the mbstring functions, this was discussed before. Lots of set ups didn't have mbstring installed so the code was never migrated. I hate(d) this too, but it has been like this for years.

Nowadays I am not going to change this, keeping in mind UseBB 1 is not actively developed, and I would be happier with a new major version or new project later in time, while stopping v1 maintenance completely.