PHP and MySql: getting ready for multi language applications with utf8

Thursday, 08 December 2011

This explains how to write PHP applications ready for international use.

Utf-8 is a wide-used and well-suported character encoding which allows for special characters, like accented characters, characters with umlauts, cedillas and so on.

Here's a few steps to follow to ensure your data gets correctly stored in utf8 encoding:

place this at the very top of your 'entry' PHP script or your config PHP script(to make sure the code gets executed site-wide):

<?php //set INTL encoding for PHP sources and regexp


When outputting html, make sure you set document encoding by specifying a HTTP header with PHP

<?php header('Content-Type: text/html; charset=UTF-8');

or a meta tag in your HTML <head> 'section':

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Now for the MySQL part:

MySql uses the so-called 'collations' to determine how the database stores, sorts and compares strings.

The catch here is that you may configured the tables to store text fields in utf8, but on the other hand the default server collation may be set to sometnigh else(for example, 'latin1_swedish_ci'). Note the '_ci' suffix, it stands for 'Case inSensiTive'.

What I do is set the schema collation to utf8_unicode_ci,  which sorts strings well in most languages.

As for PHP, I make sure the very first of my SQL statements(after the database connecion was setup) is



 This will explicitly set MySQL connection collation. So we have utf8 encoded PHP code, utf8 regex engine for PHP, we output utf8 HTML, and get inputs from HTML forms(you guessed) in utf8.

Setting NAMES to utf8 ensures we pass data back and forth to MySQL in the right encoding. If you don't set the NAMES, you may connect with a different collation(server's default), and malformed text will come your way, because automatic transcoding occurs automatically at the database level(e.g. from 'utf8' to 'latin1' character set or God nows what).

Thanks, and have fun!

