unicode - Is there a non-heuristic way of finding the encoding of a string (ie. list)? -

- April 15, 2012

for iodevices 1 can use io:getopts/1 example couldn't find method plain strings.

for example,

manpage = os:cmd("man ls").            %   [76,83,40,49,41,32,32,32,32,32,32,32,32,32,32,32,32,32,32, %   32,32,32,32,32,32,32,32,32,32|...]  io:format("~p~n",[manpage]).          %   [76,83,40,49,41,(...)  io:format("~ts~n",[manpage]). %   ls(1)                   user commands         ls(1) %   name %          ls - list directory contents

the documentation on using unicode in erlang mentions heuristic ways may out of date because according examples io_lib:format/2 ~ts control characters produces utf-8 output. trying erlang 18.0:

bullet = "\x{2022}". %   [8226]  io:format("~ts~n", [bullet]). %   • %   ok io:format("~ts~n", ["•"]).    %   • %   ok  io_lib:format("~ts~n", [bullet]). %   [[8226],"\n"]

i know use unicode:characters_to_binary/{1,2,3} because accepts latin1 or utf8 encoded input , spits out unicode encoded output curious if there way.

interestingly, unicode:characters_to_binary/1 works fine whereas unicode:characters_to_list/1 not (or misusing it).

unicode:characters_to_binary(manpage).                        %   <<"ls(1)   user commands   ls(1)\n\n\n\nname\n  "...>>  unicode:characters_to_list(manpage).   %   [76,83,40|...]  unicode:characters_to_list(manpage, latin1). %   {error,"ls(1)   user commands  ls(1",      [8208,10,32|...]}

there heuristic ways determine character encoding, unfortunately. there brief explanation why here.

that said, in particular case specify above real question encoding system (not erlang) shell set to. can find out checking environment directly (though going platform-specific solution -- i'm writing debian-derived system uses bash):

1> lang = os:cmd("echo $lang"). "ja_jp.utf-8\n" 2> {_, enc} = lists:split(6, lang). {"ja_jp.","utf-8\n"} 3> encoding = string:strip(enc, right, $\n). "utf-8"

this is, however, rather crap solution. totally non-portable , there no guarantee environment follows rules , puts 5-character language/region, dot, encoding $lang environment variable. i'm pretty sure doesn't work, example, on @ least versions of solaris, , on aix think way @ encoding checking $lc_ctype or similar (or maybe that's backwards... or... see, fact i don't remember quirk this indication enough unreliable).

another way use locale command , have give charset directly:

4> os:cmd("locale charmap"). "utf-8\n"

that trailing newline annoying me, so...

5> string:strip(os:cmd("locale charmap"), right, $\n). "utf-8"

that said, locale command not exist everywhere. in case, combination of checking locale output data environment , environment variables should trick, though make portable need arm system few ways this. fortunately most systems utf8 default now, exception of windows, @ least windows mostly internally standardized.

(if you're dealing man pages... keep in mind man pages have control characters embedded in them markup, while text-only output of man page expect, actual manpage data interpreted man marked up. depending on doing may easier manipulate manpage archive data directly.)

Search This Blog

Core code

unicode - Is there a non-heuristic way of finding the encoding of a string (ie. list)? -

Comments

Post a Comment

Popular posts from this blog

php - Admin SDK -- get information about the group -

Python Error - TypeError: input expected at most 1 arguments, got 3 -

qt - Passing a QObject to an Script function with QJSEngine? -