When we display text, the data that represents each character is often contained within a single byte-- or eight bits. However, there are many languages (and special characters!) in the world that require more than eight bits to be properly displayed. These use multi-byte characters, and to to display them, we'll need to store them in our database in a way that is compatible.
For the sake of this article, I've set up a test site that uses a database configuration that isn't compatible with multi-byte characters. In this case, it's using the MySQL default latin1 character set with collation set to latin_swedish_ci. It works great for English, but let's try adding some Traditional Chinese characters into the Edit Content window:
Looks fine so far. When we update the block, though, and store our text in the database, all we see are question marks:
This tells us there's likely something wrong with the database collation: our latin character set and latin collation can't descibe how to display the mutli-byte characters in our Chinese language text.
Collation? Character Sets?
A character set is the group of characters used to store your data. These sets include the symbols for specific languages or regions. Collation refers to the rules that the database uses to organize your data. Each of these settings can be specified in your MySQL database.
While the default MySQL database encoding (latin1 / latin1_swedish_ci) works great for sites containing only single-byte English text, it can't handle the multi-byte characters in the Traditional Chinese example above.
What encoding is best for my database?
The accepted standard for multilingual websites is UTF-8. The collation option we want for MySQL is utf8_general_ci. This will allow you to display the widest range of international languages and symbols.
How do I change how my database is encoded?
Before you alter anything in your database, it's a good idea to make a backup copy, just in case something goes awry. Once you're ready, you can use either command-line SQL or a more user-friendly interface like phpMyAdmin to adjust collation settings.
In command-line MySQL, you'd run this command (replacing db_name with the name of your actual database):
ALTER DATABASE db_name CHARACTER SET utf8 COLLATE utf8_general_ci;
Using phpMyAdmin, you'd browse to your database and open the Operations tab. At the bottom of the window, you'll see a dropdown menu that allows you to easily switch collation:
Once we've got UTF-8 encoding set, our Traditional Chinese characters look much better:
Other things to check
If switching your database collation to utf8_general_ci doesn't fix the problem, you might want to check the collation on individual tables' fields-- all of which should be set to utf8_general_ci as well:
In the image abode, we see that my arHandle field has the correct utf8_general_ci encoding; if it was set to latin1_swedish_ci (or something else), we may need to correct that.