Character Encoding Issues - Webmaster Sherpa

If you are working with certain non-English languages, you will invariably have to consider the issue of character encoding. I myself run a website for Japanese language study, so I have dealt with this issue a bit, though I am far from truly knowledgeable about it. Still, maybe some of what I have learned will be of use to you.

Encoding Basics

Note: Most of this was drawn from the useful tutorial on character code issues by Jukka “Yucca” Korpela. Please reference that site if you want more detailed information.

Generally, data on the Web is presented as octets (bytes), a small unit of data with a numerical value between 0 and 255, inclusively. Octets consist of eight bits (hence the name, from Latin octo ‘eight’), but that isn’t terribly important. What is important is that different conventions can be established for how an octet or a sequence of octets presents some data. In particular, we care about the presentation of character data.

In the simplest case, one octet corresponds to one character according to some mapping table (encoding). This means that at most 256 different characters can be represented and is sometimes referred to as single-byte or 8-bit encoding. Well-known single-byte encoding schemes are ASCII and the ISO Latin family of encodings and, in fact, previously the ASCII encoding was usually assumed by default. Nowadays ISO Latin 1 (ISO 8859-1)[1], which can be regarded as an extension of ASCII, is often the default, though that is starting to change.

If you are reading this then you no doubt realize that certain languages have characters that aren’t represented in the popular single-byte encoding schemes. These languages usually have their own encoding schemes (there are multiple when it comes to Japanese and Chinese) which use more than one octet/byte for representation, thus being referred to as multibyte encoding. One increasingly popular scheme which can handle most languages is a form of Unicode[2] known as UTF-8. In UTF-8, “character codes less than 128 (effectively, the ASCII repertoire) are presented “as such”, using one octet for each code (character) All other codes are presented, according to a relatively complicated method, so that one code (character) is presented as a sequence of two to four octets, each of which is in the range 128 – 255. This means that in a sequence of octets, octets in the range 0 – 127 (“bytes with most significant bit set to 0”) directly represent ASCII characters, whereas octets in the range 128 – 255 (“bytes with most significant bit set to 1″) are to be interpreted as encoded presentations of characters.”

Naturally, you are free to use any encoding scheme you wish, but my recommendation is to use UTF-8. Regardless, there are at least three issues to consider: character input, character output and database manipulation. The first, which is a function of the keyboard, operating system, text editor, etc., I leave to you to figure out. I presume you already have if you are even using a multibyte language.

Specifying Character Encoding (“charset”) Output

In order for your website to display information correctly you must tell the browser which encoding scheme you are using. Such information should be sent by the Web server along with the document itself, using so-called HTTP headers (MIME headers). The technical term used to denote a character encoding in the Internet media type context is “character set”, abbreviated “charset”.[3] In a webpage header, the complete line would be:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

The above example specifies, in addition to saying that the media type is text and subtype is html, that the character encoding is UTF-8.

Databases and Encoding

When it comes to character encoding, using a database adds two steps to the process: (1) encoding data when you populate the database and (2) decoding data when you retrieve it. Assuming you are using MySQL, addressing these steps requires an understanding of character sets and collations. I’ve basically just discussed character sets above; though do note that the actual names seem to differ sometimes between what you would specify for MySQL and what you would specify for a browser. A collation is a set of rules for comparing characters in a character set. While I haven’t studied this carefully, it seems logical that the safe bet is to ensure that the character set and the collation are always the same. You can read more about MySQL character sets and collations in section 8.1.1 of their manual.

MySQL server character set and collation defaults depend on the options used when you start mysqld. You can use --character-set-server for the character set. Along with it, you can add --collation-server for the collation. If you don’t specify a character set, that is the same as saying --character-set-server=latin1. If you specify only a character set (for example, latin1) but not a collation, that is the same as saying --character-set-server=latin1 --collation-server=latin1_swedish_ci because latin1_swedish_ci is the default collation for latin1.[4] Unless you are running a dedicated server, the default will probably have been set for you by your web host. You can use the popular phpMyAdmin program (which I recommend) to check the defaults. You can also use it to specify collation and character set for individual tables and fields as well. To change the default character set for a table via an SQL statement, use the following:

ALTER TABLE tbl_name DEFAULT CHARACTER SET charset_name;

The word DEFAULT is optional. The default character set is the character set that will be used if you do not specify the character set for columns that you add to a table later (for example, with ALTER TABLE ... ADD column).

VERY IMPORTANT: it is important to understand that even though you can specify a character set and/or collation for your MySQL tables, you can actually add data to those tables that is encoded with an entirely different encoding scheme! I think this is where many databases get into encoding trouble.

Working with Encoding, MySQL and PHP[5]

If you’ve read above, it should be clear that the ideal situation is that your database character set and collation as well as your browser character set are all the same (and that your table data is all encoded in that same character set). But, what do you do if they aren’t or if you need to work with more than one character set in your PHP scripts? Well, you can use two PHP statements (typically when you create your database connection) to affect the connection character sets when you run your script:

SET NAMES 'charset_name' SET CHARACTER SET charset_name

SET NAMES indicates what character set the client will use to send SQL statements to the server. Thus, SET NAMES ‘UTF8’ tells the server “future incoming messages from this client are in character set UTF-8.” It also specifies the character set that the server should use for sending results back to the client. (For example, it indicates what character set to use for column values if you use a SELECT statement.)

A SET NAMES 'x' statement is equivalent to these three statements:

SET character_set_client = x; SET character_set_results = x; SET character_set_connection = x;

Setting character_set_connection to x also sets collation_connection to the default collation for x. It is not necessary to set that collation explicitly. To specify a particular collation for the character sets, use the optional COLLATE clause:

SET NAMES 'charset_name' COLLATE 'collation_name'

SET CHARACTER SET is similar to SET NAMES but sets character_set_connection and collation_connection to character_set_database and collation_database. A SET CHARACTER SET x statement is equivalent to these three statements:

SET character_set_client = x; SET character_set_results = x; SET collation_connection = @@collation_database;

Setting collation_connection also sets character_set_connection to the character set associated with the collation (equivalent to executing SET character_set_connection = @@character_set_database). It is not necessary to set character_set_connection explicitly.

Now, as I mentioned, your MySQL will have a default character set and collation, usually set by your web host. If that matches the data in your database and the character set you are specifying in your headers then you shouldn’t need the SET commands above. Since so many sites use English or Western European languages that are supported with the ISO Latin encoding and since that is more or less the typical default for most MySQL implementations, few people run into a problem. But, once you switch to UTF-8 or another multibyte encoding, chances are the MySQL default will still be in ISO Latin and you will most definitely need the SET commands.

Another case where you will likely need those SET commands is if you are trying to support multiple encodings within one script. Since you can really only specify headers once, you will have to choose one encoding, typically the one that is most flexible, UTF-8. But, if you try to display ISO Latin text (e.g., pulled from a database in that format) in UTF-8 certain characters might not display properly. Thus, you would want to use one of the SET commands to retrieve the data in its ISO Latin encoding and then use the mb_convert_encoding PHP command to convert the retrieved data inside your script. You may know that there is also a utf8_encode function in php but I have found it doesn’t work as reliably so I prefer to stick with mb_convert_encoding and make sure I specify both the from and to encoding formats. For example:

mb_convert_encoding($content, "UTF-8", "iso-2022-jp");

Importing/Converting Backup Database Data[6]

If your MySQL database uses a default encoding scheme like ISO Latin (most likely latin1_swedish_ci), you may at some point want to (or be forced to) switch to something else, typically UTF-8. But, you may want to keep some or all of your tables in their original character set. In that case, be careful about how you perform an import to accomplish this since typically an import will use the default character set and collation of the database you export from. You can override this however, by adding the following at the end of your CREATE TABLE statements:

DEFAULT CHARACTER SET latin1 COLLATE latin1_general_ci

Note, of course, that you could use latin1_swedish_ci or whatever is relevant to your situation.

But, what if you actually do want to switch to the new (UTF-8) encoding? I think there are two choices. First, you can try to convert your data when you import it. I haven’t done this myself, but apparently if your PHP installation supports iconv then you can use something like:

iconv -f utf-8 -t utf-8 < olddatabase.sql

Of course, you would certainly want to have confidence that the data you are importing is clean and all of the same encoding scheme.

The second choice is to import your data in its original encoding and then convert afterwards. You could do this either by writing your own conversion script using mb_convert_encoding (see sample script below) or you can use a SQL statement like this:

ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;

For a column that has a data type of VARCHAR or one of the TEXT types, CONVERT TO CHARACTER SET will change the data type as necessary to ensure that the new column is long enough to store as many characters as the original column. For example, a TEXT column has two length bytes, which store the byte-length of values in the column, up to a maximum of 65,535. For a latin1 TEXT column, each character requires a single byte, so the column can store up to 65,535 characters. If the column is converted to utf8, each character might require up to 3 bytes, for a maximum possible length of 3 × 65,535 = 196,605 bytes. That length will not fit in a TEXT column’s length bytes, so MySQL will convert the data type to MEDIUMTEXT, which is the smallest string type for which the length bytes can record a value of 196,605. Similarly, a VARCHAR column might be converted to MEDIUMTEXT.

To avoid data type changes of the type just described, do not use CONVERT TO CHARACTER SET. Instead, use MODIFY to change individual columns. For example:

ALTER TABLE t MODIFY latin1_text_col TEXT CHARACTER SET utf8; ALTER TABLE t MODIFY latin1_varchar_col VARCHAR(M) CHARACTER SET utf8;

Warning: The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:

ALTER TABLE t1 CHANGE c1 c1 BLOB; ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;

The reason this works is that there is no conversion when you convert to or from BLOB columns.

If you choose to go the route of using a custom script to do your conversion, below is one I wrote that might be of use as a starting point. In my case all the entries were of the same (EUC) encoding so it was fairly straightforward. I think that if it hadn’t been, the script could be modified to first convert all entries into the supposed original encoding and then convert them to the target encoding, but I haven’t tested that.

<html>
<head>
<title>Convert Database Script</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>

<?php
set_time_limit(120);
error_reporting(E_ERROR | E_WARNING | E_PARSE);
ini_set("display_errors","1");

include ("db.php");

$db->query("SET NAMES 'UTF8'");
$db->query("SET character_set_results='latin1'");
$db->query("SET character_set_client = 'UTF8'");

function convert ($table, $idfield="ID", $fields, $fromcharset="Shift_JIS", $tocharset="UTF-8") {
	global $db;
	$sql = "SELECT * FROM $table";
	//echo $sql . "<br />";
	$get = $db->get_results($sql);
	foreach ($get as $row) {
		$id = $row->$idfield;
		$update_sql = "UPDATE $table SET";
		foreach ($fields as $field) {
			$value = $row->$field;
			$fieldconverted = mb_convert_encoding($value, $tocharset, $fromcharset);
			$update_sql .= " " . $field . "="" . addslashes(stripslashes($fieldconverted)) . "",";
		}
		$update_sql = substr($update_sql,0,strlen($update_sql)-1);
		$update_sql .= " WHERE " . $idfield . "='" . $id . "'";
		echo $update_sql . "<hr>";
		//$db->query($update_sql);
	}
}

convert ("tablename","ID",array("field1","field2"),"EUC-JP");
?>

</body>
</html>

<html>

<head>

<title>Convert Database Script</title>

</head>

<body>

<?php

set_time_limit(120);

error_reporting(E_ERROR | E_WARNING | E_PARSE);

ini_set("display_errors","1");

include ("db.php");

$db->query("SET NAMES 'UTF8'");

$db->query("SET character_set_results='latin1'");

$db->query("SET character_set_client = 'UTF8'");

function convert ($table, $idfield="ID", $fields, $fromcharset="Shift_JIS", $tocharset="UTF-8") {

global $db;

$sql = "SELECT * FROM $table";

//echo $sql . "<br />";

$get = $db->get_results($sql);

foreach ($get as $row) {

$id = $row->$idfield;

$update_sql = "UPDATE $table SET";

foreach ($fields as $field) {

$value = $row->$field;

$fieldconverted = mb_convert_encoding($value, $tocharset, $fromcharset);

$update_sql .= " " . $field . "="" . addslashes(stripslashes($fieldconverted)) . "",";

}

$update_sql = substr($update_sql,0,strlen($update_sql)-1);

$update_sql .= " WHERE " . $idfield . "='" . $id . "'";

echo $update_sql . "<hr>";

//$db->query($update_sql);

}

convert ("tablename","ID",array("field1","field2"),"EUC-JP");

</body>

</html>

Regardless of what method you use to convert encoding in your database, the biggest potential problem in any conversion occurs if you have more than one encoding used in a table. In that case, the process is much more difficult. I think in such a situation the best thing to do is use a custom conversion script, but it could potentially get complicated. To see some advice on dealing with this, reference “Turning MySQL data in latin1 to utf8 utf-8” by Derek Sivers.

[1] In addition to the ASCII characters, ISO Latin 1 contains various accented characters and other letters needed for writing languages of Western Europe, and some special characters. These characters occupy code positions 160 – 255, and they are:
¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

[2] Unicode defines a character repertoire and character code intended to be fully compatible with ISO 10646, an international standard which defines UCS (Universal Character Set), which itself is a very large and growing character repertoire. Currently tens of thousands of characters have been defined, and new amendments are defined fairly often.

[3] The official registry of “charset” (i.e., character encoding) names, with references to documents defining their meanings, is kept by IANA. You can also reference a tabular presentation of the registry, ordered alphabetically by “charset” name and accompanied with some hypertext references.

[4] 8.1.3.1. Server Character Set and Collation

[5] information on setting MySQL settings are taken from 8.1.4 Connection Character Sets and Collations

[6] Reference the MySQL manual, 11.1.2. ALTER TABLE Syntax

Subject: Site Design (Architecture, Coding, UI, etc.)

Like this content? Why not share it?