Ticket #3257 (new defect)

Opened 10 years ago

Last modified 8 years ago

Charset handling buggy and extremely complicated

Reported by: egmont Owned by:
Priority: major Milestone: Future Releases
Component: mc-core Version: master
Keywords: Cc:
Blocked By: Blocking:
Branch state: no branch Votes for changeset:


With --enable-charset, set your terminal and locale to some 8-bit non-ISO (e.g. KOI8-* CP* or Windows*), and in mcview set the file's encoding to the same.

Chars 128-159 are replaced by dot, even though these are printable characters in these encodings.

mcview (plain.c) calls is_printable() which in turn calls is_8bit_printable() which calls (assuming xterm-like emulator) is_iso_printable which hardwires that 128-159 are non-printable.

It's a faulty design: the method is_printable() tries to answer the question without having a clue about the character set.

The underlying cause is a very complicated code where there are micro-branches handling 5 different cases. E.g. from mcview plain.c:

            if (mc_global.utf8_display)
                if (!view->utf8)
                    c = convert_from_8bit_to_utf_c ((unsigned char) c, view->con
                if (!g_unichar_isprint (c))
                    c = '.';
            else if (view->utf8)
                c = convert_from_utf_to_current_c (c, view->converter);
                c = convert_to_display_c (c);
                if (!is_printable (c))
                    c = '.';
#else /* HAVE_CHARSET */
            if (!is_printable (c))
                c = '.';
#endif /* HAVE_CHARSET */

It's very complicated and error prone.

The 5 totally separate cases are:

  • mc with charset support:
    • (1) file is utf8, display is utf8
    • (2) file is utf8, display is 8bit
    • (3) file is 8bit, display is utf8
    • (4) file is 8bit, display is 8bit (but they can be different encodings, a 256-byte conversion table is used)
  • (5) mc without charset support (bytes from the file are directly passed thru to the terminal)

This is not the only place where such a terrible #ifdef and if() branching is done, there are plenty, they are hard (practically impossible) to test, and they are apparently buggy.

This approach also implies that you cannot use handy methods like g_unichar_whatever anywhere outside of such ifdef/if's because basically whenever you have an "int c", you have no clue which character it represents. Even figuring out if it's printable would need to carry its charset as extra context, defeating the purpose of Unicode. (In #3250 I often need to check things like whether the character is CJK or combining, these influence line wrapping and such. I can't do with a single g_unichar_xx() call, I need to write helper methods for each desired g_unichar_xx() with #ifdef/if()s.)

My recommendation is to simplify to at most 2 cases (by merging (1)-(4)), but preferably to only 1 case (by removing (5) if feasible), by forcing Unicode internally.

This could be done if instead of a 1-phase conversion we went for a 2-phase conversion. Most data has the lifecycle that first we receive it from some source of arbitrary encoding (e.g. file name, file contents), then process it, then finally display it (in some other arbitrary encoding). Doing a one-shot conversion means that no matter where you do it, your data might be in any of multiple encodings while you process it and you'd need to carry that encoding and do micro-branches all the time.

Instead, the data should be converted to Unicode as soon as we read it, so that processing always happens in Unicode, and should be converted to the terminal's charset as the last step just when we print it.

Of course, in this approach if the terminal encoding and the file encoding are the same, you do a back-n-forth conversion. But, in turn, the whole code becomes cleaner and less buggy.

About the 5th case: unless there are systems which don't support --enable-charset and yet we'd still want mc to run on them, I'd vote for ditching this possibility.

Nowadays when every new system is set up with default UTF-8 locale, (1) is I assume by far the most frequent use case. The desired change wouldn't introduce any runtime cost with (1)-(2), what happens inside mc would mostly stay the same. With (3)-(4)-(5) we'd perform some operations (e.g. search in file) in Unicode rather than 8-bit. In cases (4)-(5) we'd do an extra back-n-forth conversion, I think this is a price we can pay for having a source that's easier to maintain and test, and has fewer bugs.

Change History

comment:1 Changed 10 years ago by egmont

There should also be a clear distinction between a character being printable by definition, and a character being representable in the current display charset. There might be cases where they should be treated differently.

E.g. I intentionally decided not to allow nroff formatting for non-printable characters. If 'x' is a non-printable Unicode character, "_\bx" is formatted as "_.." (a replacement character for the backspace and a replacement character for the non-printable one). If 'x' is printable, but can't be converted to the display's charset, the desired rendering is an underlined ".".

comment:2 Changed 8 years ago by andrew_b

Since printable chars are locale-dependent, is_printable should be a member of str_class.

Note: See TracTickets for help on using tickets.