Ticket #3259 (closed defect: fixed)

Opened 3 years ago

Last modified 2 years ago

mcview hex edit: UTF-8 chars not updated

Reported by: egmont Owned by: andrew_b
Priority: minor Milestone: 4.8.14
Component: mcview Version: master
Keywords: Cc:
Blocked By: Blocking:
Branch state: merged Votes for changeset: committed-master


mcview hex edit mode, in fully UTF-8 environment

UTF-8 characters show up nicely in the right column (followed by a space to align with the number of bytes).

However, as soon as you edit the file (either by the hex codes, or by moving to the right column and entering an UTF-8 character), that character doesn't show up there.

Expected: during editing the file, the right hand side should continuously be updated to look exactly as if that was the original contents of the file.


mc-3259-hex-update.patch (2.4 KB) - added by egmont 3 years ago.
Quick fix
mc-3259-hex-update-v2.patch (6.9 KB) - added by egmont 3 years ago.
Fix v2
mc-3259-hex-update-v3.patch (10.1 KB) - added by egmont 3 years ago.
Fix v3

Change History

comment:1 Changed 3 years ago by egmont

hex.c mcview_display_hex() does the following: It takes the next Unicode character from the original file, then overwrites the bytes from the edit buffer and composes the Unicode again. This works as long as character boundaries don't change, but fail when they do.

The logic is incorrect: handling whether the given byte should be taken from the original file or from the edit buffer should be underneath any call that assembles UTF-8 bytes into Unicode.

Changed 3 years ago by egmont

Quick fix

comment:2 Changed 3 years ago by egmont

Quick fix attached.

The intermediate bytes of UTF-8 are skipped because the UTF-8 parser returns -1 and tty_print_anychar() does nothing. This in turn means that the correct attributes (background color or underline) are not applied. This would be nice to fix (although a bit problematic with CJK).

Also, it would be nice to make a difference between continuation bytes of UTF-8 (drawn as spaces, as currently) and actual invalid UTF-8 in the file (should be displayed with dots). This would require to look back by a few bytes from the beginning of the viewport.

Anyway, my patch is already a significant improvement over the current code.

Changed 3 years ago by egmont

Fix v2

comment:3 Changed 3 years ago by egmont

Second fix attached. This one knows if a byte is a contination byte of a previously started UTF-8 sequence (prints space, with a correct attribute if the byte was modified) or invalid UTF-8 (prints dot, again with the correct attribute if the byte was modified). It takes care not to print a space over the second byte of a CJK, as that would wipe out the CJK char.

Changed 3 years ago by egmont

Fix v3

comment:4 Changed 3 years ago by egmont

One more change. If you changed only a single byte of a UTF-8 (e.g. C3 A9 -> C3 AD), only the corresponding cell on the right was marked as modified (in this case "é " -> "í ", whereas only the space was marked as modified [underlined, different color (not really visible in case of a space :)) etc. according to the skin], the preceding letter which changed from é to í was not marked so).

Also, there was no way of marking on the right side whether the second byte of a CJK changed, since the corresponding cell is occupied by the wide character, marking if the first byte was changed.

I modified the code to mark the whole string representing a UTF-8 character (e.g. "í ") as modified if any of its bytes were modified. I think this makes more sense if you're requesting UTF-8 view anyways, and is also consistent in CJK.

If this is not the behavior someone's looking for, they probably don't want UTF-8 either but an 8-bit charset (which they get with Alt+E). Even in UTF-8 mode, the left hand hex view part still shows on a per-byte basis if they were modified. In UTF-8 mode, the right hand side shows characters (rather than bytes), and hence also shows for the characters (rather than bytes) if they were modified.

comment:5 Changed 2 years ago by egmont

Friendly ping... a patch has been attached for 6 months now. Could you please take a look? It would be nice to release this fix along with the mcviewer rewrite.

comment:6 Changed 2 years ago by andrew_b

  • Status changed from new to accepted
  • Owner set to andrew_b
  • Votes for changeset set to andrew_b
  • Branch state changed from no branch to approved
  • Milestone changed from Future Releases to 4.8.14

Thanks! Applied with some trivial modifications: indentation and reduce variable scope.

Branch: 3259_hexedit_utf8

comment:7 Changed 2 years ago by andrew_b

  • Status changed from accepted to testing
  • Votes for changeset changed from andrew_b to committed-master
  • Resolution set to fixed
  • Branch state changed from approved to merged

comment:8 Changed 2 years ago by andrew_b

  • Status changed from testing to closed
Note: See TracTickets for help on using tickets.