summaryrefslogtreecommitdiff
path: root/doc/bugs/json_should_be_utf8_regardless_of_locale.mdwn
blob: ac7275a77ab2aa41cafa701aaaaf77820f3cba61 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
json is defined as always utf-8. However, when LANG=C,
git-annex --json currently outputs "file":"���������"
instead of "file":"äöü東" for that utf-8 filename. --[[Joey]]

This can also affect keys when they contain some non-utf8 from eg the
extension. And metadata keys and values can contain non-utf8 and also get
converted to json with similar results.

Note that git-annex can operate on non-utf8 filenames and keys; 
it's not defined what the json contains then, and it currently contains
similar garbage.

This happens because aeson's instance of ToJSON for Char uses
Text.singleton, and Text does not handle ghc's filesystem encoding
for String. Instead it defaults to `\65533` for each byte encoded with the
filesystem encoding.

So, git-annex will need to convert filenames and keys and anything else
that might use the filesystem encoding to Text itself in some
way that does respect the filesystem encoding. Ie, use encodeBS to convert
it to a ByteString and then Data.Text.Encoding.decodeUtf8. 

> [[done]] that. --[[Joey]]

What about git-annex commands that take json as input,
when run in a non-utf8 locale? Tested that, it is handled ok. --[[Joey]]