Ticket #3843 (closed defect: fixed)
mcedit, 8-bit terminal, encoding=UTF-8: characters between 0x80 and 0xff are broken
Reported by: | lzsiga | Owned by: | zaytsev |
---|---|---|---|
Priority: | major | Milestone: | 4.8.20 |
Component: | mcedit | Version: | master |
Keywords: | mcedit utf8 | Cc: | egmont |
Blocked By: | Blocking: | ||
Branch state: | merged | Votes for changeset: | committed-master |
Description
Hi, this problem is present in mc-4.8.19 (reproduced on Debian/Linux? and AIX), it affects the editor.
Problem description:
I use a 8-bit terminal-emulator with LC_CYTPE=hu_HU.ISO-8859-2 (for Hungarian language), and it works properly; in mcedit's 'Choose encoding' dialog I select 'UTF-8' (as I want to create a file in UTF-8).
Then I try some accented letters in the editor, such as á é ő ű. They are all in ISO-8859-2 (codes e1, e9, f5, fb), but only first two are in ISO-8859-1; the unicodes are: U+E1, U+E9, U+0151, U+0171
The problem is that only the two latter characters are properly displayed and stored in file; for the two former, editor displays dots instead of them; and saving into file, instead of UTF8-sequences (c3e1, c3e9) it stores single-bytes (the ISO-8859-2 codes: e1 e9)
I think I found the source of the problem in src/editor/edit.c, line 3559
if (char_for_insertion > 255 && !mc_global.utf8_display)
It ignores characters between 128 and 255 even if 'UTF-8' is selected (it is mc_global.source_codepage==12 in my case)
The change I suggest is this:
if ((char_for_insertion > 255 || (char_for_insertion > 127 && str_isutf8 (get_codepage_id (mc_global.source_codepage)))) && !mc_global.utf8_display) {
I tested it on linux and AIX in different contexts (8-bit emulator vs unicode-emulator; 8-bit file-encoding vs unicode), and it seemed working in all cases.
(I admit, the method of checking whether mc_global.source_codepage is UTF-8 or not is a bit clumsy, but I couldn't find a simpler method.)
Attachments
Change History
Changed 7 years ago by lzsiga
- Attachment edit.patch added
comment:1 Changed 7 years ago by zaytsev
- Cc egmont added
- Priority changed from trivial to major
At first I was skeptical about this patch, but then it occurred to me that it actually does make complete sense. If target encoding is UTF-8, then everything above 0x9F must be encoded, even if it has the code point < 255 and representable in the current encoding as a character between 0xA0 and 0xFF.
I also managed to reproduce it using CP1251 (Cyrillic) locale, there characters « & » are representable in the higher table and get inserted literally when target is UTF-8, although they should be encoded.
Egmont, WDYT?
comment:2 Changed 7 years ago by egmont
I can take a look next week. Feel free to ping me if I forget about it.
comment:3 Changed 7 years ago by zaytsev
- Milestone changed from Future Releases to 4.8.20
Ok, then I'm retargeting to this milestone, so that I won't forget to ping you :)
comment:4 Changed 7 years ago by zaytsev
I've tested the patch on master while I was at it, and it works, but now I'm wondering whether the original condition makes sense at all??? I have no idea how the editor works, but it feels like if charset conversions are enabled AND current display encoding is NOT UTF-8, it encodes everything >255 as UTF-8 no matter what target encoding is.
To my mind, the logical behavior should be actually encode >127 as UTF-8 if the target encoding is UTF-8, and if it's singlebyte (we don't have other multibytes on the list), insert a replacement character (\0 ?) for everything >255. I still don't completely understand what it has to do with display encoding, but sadly no more time to investigate now...
comment:5 Changed 7 years ago by zaytsev
Okay, I guess checking for display encoding has to do with input already being UTF-8, so the correct patch seems to be as follows. Still, apparently there is an independent bug:
- Switch locale to CP1251
- Edit a file with a target encoding of UTF-8
- Position the cursor at the end of the file
- Insert characters
- They are inserted at the wrong place, because the bytes are not counted correctly.
Now off to work!
comment:6 Changed 7 years ago by lzsiga
Hi,
I'd like attach a table about the four possibilities when arriving to the line in question (editor.c:3359)
terminal file mc_global. mc_global. char_to_insert .utf8_display .source_codepage -------- ---- ------------- ---------------- -------------- 8-bit 8-bit 0 2 iso-code 00..ff 8-bit utf8 0 12 unicode 00..10ffff utf8 8-bit 1 2 iso-code 00..ff utf8 utf8 1 12 utf8 1-4 bytes 00..ff
In the last case function edit_execute_cmd is called 1-4 times
(once for each byte of the utf8-sequence)
comment:7 Changed 7 years ago by zaytsev
Egmont, ping! I think my patch is in line with the table by lzsiga.
comment:9 Changed 7 years ago by egmont
Sorry I'm absolutely overloaded these days. Do you have any particular concern or question? If you're confident enough with your patch, just would prefer to have a second eye take a look at it, you should rather go ahead and commit :)
comment:10 Changed 7 years ago by zaytsev
Sure I have a particular concern: I think the patch is fine, but I completely lack the understanding of how the editor is supposed to work, so I feel uncomfortable committing it without a second opinion.
If I read the code correctly, currently, for code points > 255, if the current display encoding is NOT utf-8, they are encoded as utf-8, no matter what is actually the target encoding of the file being edited. This sounds crazy to me for two reasons:
- Why encoding into utf-8 if target encoding is NOT utf-8?
- Even if you decide to encode into utf-8, why miss code points 127 < x <= 255?
My patch changes it such that code points > 127 will be encoded as utf-8 only if the target encoding is utf-8 AND display encoding is NOT utf-8 (because otherwise they would have been already encoded).
It is not clear to me what exactly would happen under other scenarios after my changes, but I hope that at least it will not make the matters worse than it already is.
It's on those two points (did I fix it correctly for non-utf displays and did I break anything else) that I'd like to get some input.
comment:11 Changed 7 years ago by zaytsev
Hey Egmont, I'm just back from vacation, is it looking any better for you now and maybe you could have a look into this? --Yury
comment:12 Changed 7 years ago by egmont
Hi guys,
Truly sorry for the delay.
Slightly irrelevant: the opening post says
in mcedit's 'Choose encoding' dialog I select 'UTF-8'
for me that's the default there which I don't think makes sense. If someone fires up mcedit in a Latin2 environment, newly created files should be encoded in Latin2 by default, not UTF-8. What do you think?
Back to our main topic:
lzsiga, you made an excellent investigation, including the table in comment 6 (thanks a lot / köszi szépen! :)) As its last column shows, the current situation is already pretty messed up. Like, char_to_insert is sometimes the Unicode (UCS) value, sometimes each byte of the UTF-8 separately one after the other; it's such a mixture of semantics that shouldn't be done in nice code written from scratch. I know it's lot of legacy code from the 8-bit days with a (relative to that) amazing UTF-8-ification at some point.
I'm inclined to say that this code (and probably its greater area - maybe entire mc? haha) probably deserves a big UTF-8 cleanup. Or rewrite from scratch :P But not now.
As for quick fix, edit2.patch seems to be okay for me and seems to do the best we can do there locally.
One thing though: I'd definitely add a comment there pointing to this ticket.
Thanks a lot to all of you guys for working on this bug!
comment:13 Changed 7 years ago by egmont
Note:
In 8-bit mode (when both the terminal and the file encoding are 8-bit), for the input stream, mc does not follow the "new" philosophy of Unicode that it's the human-facing letter that matters. It still follows the "old" philosophy that just thinks in terms of bytes.
This may or may not be a good thing. Especially that it seems to follow the "new" philosophy for the output.
Example: Set up a Latin2 environment as described in the original post. Edit a file whose encoding you set to Latin1.
Type letters that differ in the two, e.g. ő (o double accent; U+151 aka 0xF5 in Latin2). In the screen ^A appears, for whatever reason this seems to denote characters that are unrepresentable (a 0xF5 byte made it into the file, which is õ (o tilda; U+F5 aka 0xF5 in Latin1) which is not representable in the current Latin2 terminal.
If you switch the encoding to Latin2, ő (o double accent) will appear.
Regardless of whether you switched the encoding or not, upon saving the file 0xF5 will be placed there.
So, when shown on the screen, it's the letter that matters (it could just transfer a 0xF5 as a 0xF5, but no, it's clever to know that it represents a different glyph due to different encoding, and hence it uses a replacement symbol). This logic is not used for the input though, the arriving byte is taken as-is.
comment:14 Changed 7 years ago by egmont
... it does perform the mapping on the input though if the same character is available in the other charset as well, just at a different position (e.g. replace Latin1 by CP852 in the above example).
comment:15 Changed 7 years ago by zaytsev
- Owner set to zaytsev
- Status changed from new to accepted
- Branch state changed from no branch to on review
Branch: 3843_8bit_to_utf8_broken
Initial changeset:5ee452eba93755bc7c2d0fb1e4226e8a2c56d64b
comment:16 follow-up: ↓ 17 Changed 7 years ago by zaytsev
Hey Egmont, thank you very much for looking into this! I agree that the editor deserves a rewrite (hmmm, who did one for the viewer, do you happen to know him ;-) ?), and there is this topic of syntax highlighting (and now yet another crazy idea, maybe offload it to Lua ;-) )... but with my new job I can't even manage to find enough time to review Andrew's changes and help him more with maintenance, not speaking of larger projects :-( In the mean time, I understood you're of the opinion my patch is "safe" to commit, hence the branch.
comment:17 in reply to: ↑ 16 Changed 7 years ago by egmont
Replying to zaytsev:
I agree that the editor deserves a rewrite (hmmm, who did one for the viewer, do you happen to know him ;-) ?)
Don't even think about it :P :D
In the mean time, I understood you're of the opinion my patch is "safe" to commit, hence the branch.
Yup, definitely, thanks!
I'll open a new bug for my additional findings.
comment:18 Changed 7 years ago by zaytsev
- Votes for changeset set to egmont
- Branch state changed from on review to approved
comment:19 Changed 7 years ago by zaytsev
- Status changed from accepted to testing
- Votes for changeset changed from egmont to committed-master
- Resolution set to fixed
- Branch state changed from approved to merged
comment:20 Changed 7 years ago by zaytsev
- Status changed from testing to closed
I'll open a new bug for my additional findings.
See also the multibyte bug I found & described above, but maybe not worth spending the energy on it and I don't have any to spend anyways :(
suggested change in src/editor/edit.c