Ticket #3783 (closed defect: fixed)

Opened 5 months ago

Last modified 2 months ago

mcview: file interpreted as latin1 instead of utf8

Reported by: egmont Owned by: andrew_b
Priority: major Milestone: 4.8.20
Component: mcview Version: 4.8.17
Keywords: Cc:
Blocked By: Blocking:
Branch state: merged Votes for changeset: committed-master

Description

In a fully UTF-8 environment, create this file which contains some Latin-1 characters:

echo $'copyright \xA9 plusminus \xB1' > latin1.txt

Verify that indeed simply sending out the file's contents to the terminal results in replacement symbols, as expected:

$ cat latin1.txt
copyright � plusminus �
$ cat latin1.txt | iconv --from latin1 --to utf8
copyright © plusminus ±

Now execute

mcview latin1.txt

and notice that the file's contents are displayed according to Latin-1, that is:

copyright © plusminus ±

Press Alt-E to confirm that the codeset is indeed UTF-8.

This is incorrect, if UTF-8 is chosen then replacement symbols should be shown instead of the copyright and plus-minus signs.

mcedit, as well as the standard panels (if filenames contain such bytes) don't suffer from this bug.

Attachments

mc-3783-mcview-invalid-utf8.patch (1.3 KB) - added by egmont 3 months ago.

Change History

comment:1 Changed 5 months ago by egmont

Broke somewhere between 4.8.16 and 4.8.17. Running git bisect...

comment:2 Changed 5 months ago by egmont

4d65a731c28d53a536a044a85e82223a7aa46bd5 is the first bad commit

mcview: refactoring of mcview_get_utf().

comment:3 follow-up: ↓ 6 Changed 5 months ago by egmont

@aborodin:

How much do you remember this change? As far as I can see, I believe your only intent was to swap the last (out) parameter and the return value, other than that you meant to leave the behavior unaltered, am I right?

comment:4 Changed 5 months ago by zaytsev

I'm not Andrew, but I think that this was indeed his only intent, only I can't yet see what's wrong here and I need to run to cook for tomorrow... :-/

comment:5 Changed 5 months ago by egmont

I couldn't spot the bug either just by simply looking... will need to dig deeper... but not now :)

comment:6 in reply to: ↑ 3 Changed 5 months ago by andrew_b

Replying to egmont:

How much do you remember this change? As far as I can see, I believe your only intent was to swap the last (out) parameter and the return value, other than that you meant to leave the behavior unaltered, am I right?

Yes, you are. I wanted to make unification of function prototypes: mcview_get_utf() with other mcview_get_*().

comment:7 Changed 5 months ago by egmont

Cool, thanks. I like this intent, the new interface is definitely nicer. I'll try to find where it went wrong (unless someone is faster than me :))

comment:8 Changed 5 months ago by zaytsev

... so, if only there were enough unit tests for this function... ;-) ...

comment:9 Changed 5 months ago by egmont

Rather than (or in addition to) unit tests, it's the docs for a really nontrivial case that's missing ;)

4.8.16:

src/viewer/datasource.c, mcview_get_utf(), the last "if (res < 0)" branch contains:

ch = *str;

str is a signed char* containing the lone invalid UTF-8 byte, e.g. in case of the copyright symbol it's -87. It's then assigned to the signed integer (/me wonders why not gunichar, nevermind) and then returned, so it's still -87.

Then in src/viewer/ascii.c mcview_display_line(), after the "Nonprintable, or lonely spacing mark" comment it is detected as nonprintable and hence replaced with a dot.

4.8.17:

That line was modified to:

*ch = (unsigned char) (*str);

which interprets the lone invalid UTF-8 byte as unsigned 169 and is assigned to the Unicode character, this is semantically the Latin-1 -> Unicode conversion. The rest goes on as if this was read as a valid 1-byte character.

The pattern of denoting and carrying invalid UTF-8 bytes as negative numbers is really weird to me. I'm inclined to say that the old design worked "accidentally". At least I don't recall ever thinking about it during my viewer rewrite.

Removing the explicit cast fixes the bug, since it makes the new code equivalent to the old one. It's good enough for now I guess. Plus, I really think this weird design should be documented :)

Last edited 5 months ago by egmont (previous) (diff)

Changed 3 months ago by egmont

comment:10 Changed 3 months ago by egmont

  • Milestone changed from Future Releases to 4.8.20

comment:11 Changed 3 months ago by andrew_b

  • Status changed from new to accepted
  • Owner set to andrew_b
  • Votes for changeset set to andrew_b
  • Version changed from 4.8.19 to 4.8.17
  • Branch state changed from no branch to on review

Branch: 3783_mcview_invalid_utf8
changeset:5e238fae950cbe3a5cf36aa0393c89b8adee9bc5

comment:12 Changed 2 months ago by andrew_b

  • Branch state changed from on review to approved

comment:13 Changed 2 months ago by andrew_b

  • Status changed from accepted to testing
  • Votes for changeset changed from andrew_b to committed-master
  • Resolution set to fixed
  • Branch state changed from approved to merged

comment:14 Changed 2 months ago by andrew_b

  • Status changed from testing to closed
Note: See TracTickets for help on using tickets.