php - UTF-8 all the way through -


i'm setting new server, , want support utf-8 in web application. have tried in past on existing servers , seem end having fall iso-8859-1.

where need set encoding/charsets? i'm aware need configure apache, mysql , php - there standard checklist can follow, or perhaps troubleshoot mismatches occur?

this new linux server, running mysql 5, php 5 , apache 2.

data storage:

  • specify utf8mb4 character set on tables , text columns in database. makes mysql physically store , retrieve values encoded natively in utf-8. note mysql implicitly use utf8mb4 encoding if utf8mb4_* collation specified (without explicit character set).

  • in older versions of mysql (< 5.5.3), you'll unfortunately forced use utf8, supports subset of unicode characters. wish kidding.

data access:

  • in application code (e.g. php), in whatever db access method use, you'll need set connection charset utf8mb4. way, mysql no conversion native utf-8 when hands data off application , vice versa.

  • some drivers provide own mechanism configuring connection character set, both updates own internal state , informs mysql of encoding used on connection—this preferred approach. in php:

    • if you're using pdo abstraction layer php ≥ 5.3.6, can specify charset in dsn:

      $dbh = new pdo('mysql:charset=utf8mb4'); 
    • if you're using mysqli, can call set_charset():

      $mysqli->set_charset('utf8mb4');       // object oriented style mysqli_set_charset($link, 'utf8mb4');  // procedural style 
    • if you're stuck plain mysql happen running php ≥ 5.2.3, can call mysql_set_charset.

  • if driver not provide own mechanism setting connection character set, may have issue query tell mysql how application expects data on connection encoded: set names 'utf8mb4'.

  • the same consideration regarding utf8mb4/utf8 applies above.

output:

  • if application transmits text other systems, need informed of character encoding. web applications, browser must informed of encoding in data sent (through http response headers or html metadata).

  • in php, can use default_charset php.ini option, or manually issue content-type mime header yourself, more work has same effect.

input:

  • unfortunately, should verify every received string being valid utf-8 before try store or use anywhere. php's mb_check_encoding() trick, have use religiously. there's no way around this, malicious clients can submit data in whatever encoding want, , haven't found trick php reliably.

  • from reading of current html spec, following sub-bullets not necessary or valid anymore modern html. understanding browsers work , submit data in character set specified document. however, if you're targeting older versions of html (xhtml, html4, etc.), these points may still useful:

    • for html before html5 only: want data sent browsers in utf-8. unfortunately, if go the way reliably add accept-charset attribute <form> tags: <form ... accept-charset="utf-8">.
    • for html before html5 only: note w3c html spec says clients "should" default sending forms server in whatever charset server served, apparently recommendation, hence need being explicit on every single <form> tag.

other code considerations:

  • obviously enough, files you'll serving (php, html, javascript, etc.) should encoded in valid utf-8.

  • you need make sure every time process utf-8 string, safely. is, unfortunately, hard part. you'll want make extensive use of php's mbstring extension.

  • php's built-in string operations not default utf-8 safe. there things can safely normal php string operations (like concatenation), things should use equivalent mbstring function.

  • to know you're doing (read: not mess up), need know utf-8 , how works on lowest possible level. check out of links utf8.com resources learn need know.


Comments

Popular posts from this blog

python - Healpy: From Data to Healpix map -

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -