encodings-discussion.txt

download original
08:47 < multi_io> what kind of encoding is this: utf-8, with each multi-byte
                  sequence preceded by c2 81?
08:51 < dalias> wtf
08:51 < dalias> c2 81 = U+0081
08:51 < dalias> which is a control character
08:51 < multi_io> yeah, Emacs saved the file that way
08:52 < dys`> maybe an utf-8 multi-byte sequence interpreted as latin-1 and
              then encoded in utf-8 a second time?
08:52 < dalias> dys`, nope, shouldn't cause that
08:52 < dalias> multi_io, can you give a better description / more context?
08:52 < dalias> if you described it slightly wrong dys` might be correct
08:54 < multi_io> I don't know the exact circumstances, unfortunately
08:55 < dalias> i mean just show me some more bytes of context around the
                corrupt data
08:56 < multi_io> well, the locale was en_US.UTF-8, the file was only ever
                  edited with Emacs, no other tools, and I didn't do anything
                  unusual [tm]
08:56 < multi_io> dalias: wait
08:56 < multi_io> dalias: can you read German?
08:56 < dalias> i can't tell the meaning, but i can examine the characters..
08:57 < dalias> and i might could guess the meaning
09:02 < multi_io> well, dys` is prolly right somewhat
09:03 < dalias> if he's right then repairing it is not so hard
09:03 < dalias> but i could tell you better if i could see an excerpt
09:03 < multi_io> the word is "Einflussgrößen" ("influencing variables"), the
                  sequence is 45 69 6e 66 6c 75 73 73 67 72 c2 81 c3 b6 c2 81
                  c3 9f 65 6e
09:03 < multi_io> without the two c2 81's that would be the correct
                  utf-8 AFAICT
09:04 < dalias> hmm it looks like your system is NOT utf-8
09:04 < dalias> because the text you sent over irc was latin1..
09:04 < multi_io> um yeah, I typed it manually
09:04 < dalias> ?
09:05 < multi_io> oh, that you mean
09:05 < dalias> ya
09:05 < multi_io> nah, the IRC client runs on a FreeBSD with locale
                  de_DE.ISO8859-15
09:05 < multi_io> but that's not where the Emacs runs
09:05 < dalias> ah
09:07 < multi_io> I think I did this: I copied the word from a file that
                  already looked broken in Emacs into a new buffer to "examine
                  it seperately"! When saving that new buffer, the c2 81's
                  appeared
09:07 < dalias> ah
09:08 < dalias> well what are you trying to do now?
09:08 < dalias> just discover what happened?
09:08 < dalias> or fix important data?
09:09 < multi_io> in the original file, the word was encoded as 45 69 6e 66 6c
                  75 73 73 67 72 81 f6 81 df 65 6e
09:09 < multi_io> which is totally bogus, AFAICS
09:10 < multi_io> it's not legal UTF-8 according to
http://software.hixie.ch/utilities/cgi/unicode-decoder/utf8-decoder
09:10 < dalias> that has nothing to do with utf-8
09:10 < dalias> it's latin-1 with bogus 0x81 characters inserted
09:12 < multi_io> I copied the file from another system (on which it looks good
                  in Emacs), then opened it here in Emacs, and the word is
                  displayed as Einflussgr\201ö\201ßen
09:12 < multi_io> it's not important data, it'S just an annoyance :-P
09:15 < dys`> i get the above byte sequence by yanking "Einflussgrößen" into a
              new buffer and saving it with C-x RET c emacs-mule RET C-x C-s
09:16 < dalias> multi_io, bleh send utf-8 on irc not latin-1 :P
09:16 < dalias> anyway
09:16 < dalias> i figured out what the encoding is
09:16 < dalias> 0x81 is the C1 control code for "high octet present"
09:17 < dalias> that is, in a (nasty broken legacy) environment where bit 8
                means meta, 0x81 is the meta-control code indicating that the
                next byte should be interpreted as a character with the high
                bit set, and not as meta+lowchar
09:18 < multi_io> dys`: thanks, I can reproduce that
09:19 < dalias> ah does emacs-mule encoding use that nastiness?
09:19 < dys`> dalias: looks like it
09:19 < multi_io> dalias: thanks. Does that mean that "emacs-mule" is a "nasty
                  broken legacy environment" ? :-P
09:19 < dalias> yes
09:19 < dalias> very much so
09:25 < multi_io> alright then, thanks. Now, could I convert that file from
                  emacs-mule to UTF-8?
09:32 < dalias> multi_io, sure i don't see why not
09:32 < dalias> as long as you make sure emacs _loads_ it as emacs-mule encoding
09:33 < dalias> and not as latin-1
09:33 < dalias> if emacs loads it as latin-1, then 0x81 is just U+0081
09:34 < dalias> and it will get saved like that when you save as utf-8
09:35 < multi_io> heh, opening it with C-x RET c emacs-mule C-x C-f filename
                  works
09:35 < dalias> :)
09:35 < dalias> and please, let's let emacs-mule encoding die... :) we have
                utf-8 for a reason
09:36 < multi_io> well, I never requested it!  I use a UTF-8 locale like a good
                  boy :-P
09:36 < dalias> heh
09:37 < dalias> the history of character encoding is really a nasty matter
09:38 < dalias> back around the beginning of the 90's there were two equally
                idiotic philosophies
09:38 < dalias> the iso-2022/mule/etc. one of using escape codes and stateful
                encoding to "switch" between different national character sets
09:39 < dalias> that was also the ucs(iso10646) philosophy at the time
09:40 < dalias> and the unicode philosophy, of throwing away the whole internet
                and unix and whatnot and redefining "char" to be 16bit
09:41 < dalias> all it took to fix the idiocy of both was the brilliance of ken
                thompson
09:41 < dalias> but both continued to fight against sanity and simplicity for
                the next ~10 years...
09:42 < dalias> it's a real shame. had they realized back then, linux would
                have been all-utf-8 from day one
09:42 < dalias> oh i'm sorry, day 365 or so
09:43 < dalias> linux is about a year older than utf-8
back to emacs