unicode - Is there a non-heuristic way of finding the encoding of a string (ie. list)? -
for iodevices 1 can use io:getopts/1
example couldn't find method plain strings.
for example,
manpage = os:cmd("man ls"). % [76,83,40,49,41,32,32,32,32,32,32,32,32,32,32,32,32,32,32, % 32,32,32,32,32,32,32,32,32,32|...] io:format("~p~n",[manpage]). % [76,83,40,49,41,(...) io:format("~ts~n",[manpage]). % ls(1) user commands ls(1) % name % ls - list directory contents
the documentation on using unicode in erlang mentions heuristic ways may out of date because according examples io_lib:format/2
~ts
control characters produces utf-8 output. trying erlang 18.0:
bullet = "\x{2022}". % [8226] io:format("~ts~n", [bullet]). % • % ok io:format("~ts~n", ["•"]). % • % ok io_lib:format("~ts~n", [bullet]). % [[8226],"\n"]
i know use unicode:characters_to_binary/{1,2,3}
because accepts latin1 or utf8 encoded input , spits out unicode encoded output curious if there way.
interestingly, unicode:characters_to_binary/1
works fine whereas unicode:characters_to_list/1
not (or misusing it).
unicode:characters_to_binary(manpage). % <<"ls(1) user commands ls(1)\n\n\n\nname\n "...>> unicode:characters_to_list(manpage). % [76,83,40|...] unicode:characters_to_list(manpage, latin1). % {error,"ls(1) user commands ls(1", [8208,10,32|...]}
there heuristic ways determine character encoding, unfortunately. there brief explanation why here.
that said, in particular case specify above real question encoding system (not erlang) shell set to. can find out checking environment directly (though going platform-specific solution -- i'm writing debian-derived system uses bash):
1> lang = os:cmd("echo $lang"). "ja_jp.utf-8\n" 2> {_, enc} = lists:split(6, lang). {"ja_jp.","utf-8\n"} 3> encoding = string:strip(enc, right, $\n). "utf-8"
this is, however, rather crap solution. totally non-portable , there no guarantee environment follows rules , puts 5-character language/region, dot, encoding $lang
environment variable. i'm pretty sure doesn't work, example, on @ least versions of solaris, , on aix think way @ encoding checking $lc_ctype
or similar (or maybe that's backwards... or... see, fact i don't remember quirk this indication enough unreliable).
another way use locale
command , have give charset directly:
4> os:cmd("locale charmap"). "utf-8\n"
that trailing newline annoying me, so...
5> string:strip(os:cmd("locale charmap"), right, $\n). "utf-8"
that said, locale
command not exist everywhere. in case, combination of checking locale output data environment , environment variables should trick, though make portable need arm system few ways this. fortunately most systems utf8 default now, exception of windows, @ least windows mostly internally standardized.
(if you're dealing man pages... keep in mind man pages have control characters embedded in them markup, while text-only output of man page expect, actual manpage data interpreted man
marked up. depending on doing may easier manipulate manpage archive data directly.)
Comments
Post a Comment