Ticket #3783 (closed defect: fixed)
mcview: file interpreted as latin1 instead of utf8
Reported by: | egmont | Owned by: | andrew_b |
---|---|---|---|
Priority: | major | Milestone: | 4.8.20 |
Component: | mcview | Version: | 4.8.17 |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Branch state: | merged | Votes for changeset: | committed-master |
Description
In a fully UTF-8 environment, create this file which contains some Latin-1 characters:
echo $'copyright \xA9 plusminus \xB1' > latin1.txt
Verify that indeed simply sending out the file's contents to the terminal results in replacement symbols, as expected:
$ cat latin1.txt copyright � plusminus � $ cat latin1.txt | iconv --from latin1 --to utf8 copyright © plusminus ±
Now execute
mcview latin1.txt
and notice that the file's contents are displayed according to Latin-1, that is:
copyright © plusminus ±
Press Alt-E to confirm that the codeset is indeed UTF-8.
This is incorrect, if UTF-8 is chosen then replacement symbols should be shown instead of the copyright and plus-minus signs.
mcedit, as well as the standard panels (if filenames contain such bytes) don't suffer from this bug.
Attachments
Change History
comment:2 Changed 8 years ago by egmont
4d65a731c28d53a536a044a85e82223a7aa46bd5 is the first bad commit
mcview: refactoring of mcview_get_utf().
comment:3 follow-up: ↓ 6 Changed 8 years ago by egmont
@aborodin:
How much do you remember this change? As far as I can see, I believe your only intent was to swap the last (out) parameter and the return value, other than that you meant to leave the behavior unaltered, am I right?
comment:4 Changed 8 years ago by zaytsev
I'm not Andrew, but I think that this was indeed his only intent, only I can't yet see what's wrong here and I need to run to cook for tomorrow... :-/
comment:5 Changed 8 years ago by egmont
I couldn't spot the bug either just by simply looking... will need to dig deeper... but not now :)
comment:6 in reply to: ↑ 3 Changed 8 years ago by andrew_b
Replying to egmont:
How much do you remember this change? As far as I can see, I believe your only intent was to swap the last (out) parameter and the return value, other than that you meant to leave the behavior unaltered, am I right?
Yes, you are. I wanted to make unification of function prototypes: mcview_get_utf() with other mcview_get_*().
comment:7 Changed 8 years ago by egmont
Cool, thanks. I like this intent, the new interface is definitely nicer. I'll try to find where it went wrong (unless someone is faster than me :))
comment:8 Changed 8 years ago by zaytsev
... so, if only there were enough unit tests for this function... ;-) ...
comment:9 Changed 8 years ago by egmont
Rather than (or in addition to) unit tests, it's the docs for a really nontrivial case that's missing ;)
4.8.16:
src/viewer/datasource.c, mcview_get_utf(), the last "if (res < 0)" branch contains:
ch = *str;
str is a signed char* containing the lone invalid UTF-8 byte, e.g. in case of the copyright symbol it's -87. It's then assigned to the signed integer (/me wonders why not gunichar, nevermind) and then returned, so it's still -87.
Then in src/viewer/ascii.c mcview_display_line(), after the "Nonprintable, or lonely spacing mark" comment it is detected as nonprintable and hence replaced with a dot.
4.8.17:
That line was modified to:
*ch = (unsigned char) (*str);
which interprets the lone invalid UTF-8 byte as unsigned 169 and is assigned to the Unicode character, this is semantically the Latin-1 -> Unicode conversion. The rest goes on as if this was read as a valid 1-byte character.
The pattern of denoting and carrying invalid UTF-8 bytes as negative numbers is really weird to me. I'm inclined to say that the old design worked "accidentally". At least I don't recall ever thinking about it during my viewer rewrite.
Removing the explicit cast fixes the bug, since it makes the new code equivalent to the old one. It's good enough for now I guess. Plus, I really think this weird design should be documented :)
comment:11 Changed 8 years ago by andrew_b
- Owner set to andrew_b
- Status changed from new to accepted
- Votes for changeset set to andrew_b
- Version changed from 4.8.19 to 4.8.17
- Branch state changed from no branch to on review
Branch: 3783_mcview_invalid_utf8
changeset:5e238fae950cbe3a5cf36aa0393c89b8adee9bc5
comment:13 Changed 8 years ago by andrew_b
- Status changed from accepted to testing
- Votes for changeset changed from andrew_b to committed-master
- Resolution set to fixed
- Branch state changed from approved to merged
Merged to master: [ed0ed4ac7212c4704a4a59582b52991fd24816f0].
comment:15 Changed 7 years ago by zaytsev
Ticket #3844 has been marked as a duplicate of this ticket.
Broke somewhere between 4.8.16 and 4.8.17. Running git bisect...