id summary reporter owner description type status priority milestone component version resolution keywords cc blockedby blocking branch_state votes 3257 Charset handling buggy and extremely complicated egmont "With --enable-charset, set your terminal and locale to some 8-bit non-ISO (e.g. KOI8-* CP* or Windows*), and in mcview set the file's encoding to the same. Chars 128-159 are replaced by dot, even though these are printable characters in these encodings. mcview (plain.c) calls is_printable() which in turn calls is_8bit_printable() which calls (assuming xterm-like emulator) is_iso_printable which hardwires that 128-159 are non-printable. It's a faulty design: the method is_printable() tries to answer the question without having a clue about the character set. --------------------------------------------------------------------------------------------- The underlying cause is a very complicated code where there are micro-branches handling 5 different cases. E.g. from mcview plain.c: {{{ #ifdef HAVE_CHARSET if (mc_global.utf8_display) { if (!view->utf8) c = convert_from_8bit_to_utf_c ((unsigned char) c, view->con if (!g_unichar_isprint (c)) c = '.'; } else if (view->utf8) c = convert_from_utf_to_current_c (c, view->converter); else { c = convert_to_display_c (c); if (!is_printable (c)) c = '.'; } #else /* HAVE_CHARSET */ if (!is_printable (c)) c = '.'; #endif /* HAVE_CHARSET */ }}} It's very complicated and error prone. The 5 totally separate cases are: - mc with charset support: - (1) file is utf8, display is utf8 - (2) file is utf8, display is 8bit - (3) file is 8bit, display is utf8 - (4) file is 8bit, display is 8bit (but they can be different encodings, a 256-byte conversion table is used) - (5) mc without charset support (bytes from the file are directly passed thru to the terminal) This is not the only place where such a terrible #ifdef and if() branching is done, there are plenty, they are hard (practically impossible) to test, and they are apparently buggy. This approach also implies that you cannot use handy methods like g_unichar_whatever anywhere outside of such ifdef/if's because basically whenever you have an ""int c"", you have no clue which character it represents. Even figuring out if it's printable would need to carry its charset as extra context, defeating the purpose of Unicode. (In #3250 I often need to check things like whether the character is CJK or combining, these influence line wrapping and such. I can't do with a single g_unichar_xx() call, I need to write helper methods for each desired g_unichar_xx() with #ifdef/if()s.) My recommendation is to simplify to at most 2 cases (by merging (1)-(4)), but preferably to only 1 case (by removing (5) if feasible), by forcing Unicode internally. This could be done if instead of a 1-phase conversion we went for a 2-phase conversion. Most data has the lifecycle that first we receive it from some source of arbitrary encoding (e.g. file name, file contents), then process it, then finally display it (in some other arbitrary encoding). Doing a one-shot conversion means that no matter where you do it, your data might be in any of multiple encodings while you process it and you'd need to carry that encoding and do micro-branches all the time. Instead, the data should be converted to Unicode as soon as we read it, so that processing always happens in Unicode, and should be converted to the terminal's charset as the last step just when we print it. Of course, in this approach if the terminal encoding and the file encoding are the same, you do a back-n-forth conversion. But, in turn, the whole code becomes cleaner and less buggy. About the 5th case: unless there are systems which don't support --enable-charset and yet we'd still want mc to run on them, I'd vote for ditching this possibility. Nowadays when every new system is set up with default UTF-8 locale, (1) is I assume by far the most frequent use case. The desired change wouldn't introduce any runtime cost with (1)-(2), what happens inside mc would mostly stay the same. With (3)-(4)-(5) we'd perform some operations (e.g. search in file) in Unicode rather than 8-bit. In cases (4)-(5) we'd do an extra back-n-forth conversion, I think this is a price we can pay for having a source that's easier to maintain and test, and has fewer bugs. " defect new major Future Releases mc-core master no branch