Detect bad encoding javascript html utf-8t

Wednesday, January 31, 2007

Detect bad encoding javascript html utf-8t

(Encore du technique, merci pour votre patience).

(More technical stuff, thanks for your understanding).

Problem: you have a page that has accented characters, that is supposed to be opening in utf-8 charset encoding, but sometimes doesn't (maybe I'm the only person that ever had this problem..., in my case it was the HTML editor Xinha, opened in a popup).

How to at least warn your users that there is a problem? Can javascript detect the encoding of the page in which it is placed?

Well, no, and yes.

This is the solution I found, which is rather elegant (I feel):
if ('Ã©'.length==2) alert('Houston, we have a problem');

How does it work?

I may be wrong, but as I understand it, 'normal' text takes up one byte per character. Unicode takes up two bytes per character. UTF-8 takes up one byte, except for 'problem' characters, for which it users two bytes.

So if a utf-8 encoded page is badly intepreted as being 'normal', an 'é' character (which takes up two bytes), will be interpreted as two one-byte characters.

Conversely, if the browser thinks the page is utf-8, it will interpret the 'Ã©' combination as one character.

So if javascript tells you the length is two, you know that you're not in UTF-8.

(Last detail, and here I'm in unknown territory - contrary to the postscript in my previous post about utf-8, you shouldn't save web pages as UTF-8, but normal files, but with the charset meta tag set tu utf-8. I think I'm right about that, but no idea why...)

3 comments yet :

AnonymousMarch 5, 2007 at 7:08 AM
Ah, I know how you got into this problem.... :-)
ReplyDelete
Replies
CédricFebruary 4, 2010 at 8:38 AM
The information given in meta/header are just there to give information to the browser.

if you set your meta as utf-8 but your file is ansi the browser will not be able to display it correctly
ReplyDelete
Replies
AnonymousMay 4, 2011 at 1:45 PM
astring.match(/[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2}/) is viable method to detect ISO coded UTF-8 string.
ReplyDelete
Replies

Add comment