mc uses glib for PCRE search by default.
that's incorrect, because glib does not support PCRE_UTF8 compile flag, so \w and other "localizable" regex instructions don't work as expected.
mc should use libpcre by default + PCRE_UTF8 flag.

Also, mc does not support escape sequences in replacement strings ("\n", "\t" and etc), but it should.

Attaching a patch implementing these features.


How important is to remove these comments???

that's incorrect, because glib does not support PCRE_UTF8 compile flag, so \w and other "localizable" regex instructions don't work as expected.

What you mean? perl-compatible regexp'es of Glib library always turn on UTF-8 mode.

From http://library.gnome.org/devel/glib/stable/glib-Perl-compatible-regular-expressions.html:

Note that, unless you set the G_REGEX_RAW flag, all the strings passed 
to these functions must be encoded in UTF-8. The lengths and the positions
inside the strings are in bytes and not in characters, so, for instance, 
"\xc3\xa0" (i.e. "à") is two bytes long but it is treated as a single 
character. If you set G_REGEX_RAW the strings can be non-valid UTF-8 
strings and a byte is treated as a character, so "\xc3\xa0" is two bytes
and two characters long. 

Also, mc does not support escape sequences in replacement strings ("\n", "\t" and etc), but it should.

Show version of your mc, please:

mc -V

I have worked special sequences. Test case:

  • run mcedit
  • press SPACE key twice, type dddd; press ENTER key
  • press TAB key twice, type dddd; press ENTER key
  • go at top of editing file
  • press F7 key
  • enter '\t\tddd' into search string
  • select 'Regular expression' search type
  • press ENTER key

If no any matches found at this point - we will continue discuss.

Я имел ввиду escape-последовательности в строке ЗАМЕНЫ, а не в строке, которую ищем - там-то они работают.
Тест - ввести в строку поиска например "\x0d", а в строку замены например "\n". Везде в тексте \r будет заменён на просто символ "n", что не есть логично.
Сейчас обновился до из Debian'а, баг до сих пор присутствует.

How important is to remove these comments???

Why do you need to remove them?
Is documented code worse than undocumented?

Any progress here? Ticket is actual for the latest (git) version, so I posted a correct patch for it.

Comment 3 in english:
I mean escape sequence support inside _replacement_ strings, not inside the _search_ regex - of course they work inside regex.
Test is simple - enter, for example, "\x0d" as search string and "\n" as the replacement string, and \r will be replaced by just "n" character.

vitalif, can you switch to branch and see my changes? Can you run 'make check' in branch?

I have created some unit tests for your algorithm... but I don't sure that algorithm is clean... or may be tests is wrong :)

Last edited 14 years ago by slavazanko (previous) (diff)

Despite of the fact that algorithm now ignores the absence of closing curly brace '}' (which probably isn't 100% correct), this should be checked in tests for replace_handle_esc_seq function, not process_escape_sequence - it is the replace_handle_esq_seq who decides whether it is an escape sequence or not.

Also, \x{4344} is usually a code for wide character (UTF-8), and not for "CD". So we can either ignore the higher bits, or generate wide character codes... The second would be convenient, but would also introduce a hard-coded UTF-8 charset.

comment:11 Changed 13 years ago by vitalif

I've attached a patch addressing these issues...
Also it removes *next_char parameter from handle_esc_seq, I think it's pointless there and just adds a non-trivial interface.

Check } curly braces, fix tests, add \x{4344} utf-8 wide chars in utf-8 match mode

comment:19 Changed 13 years ago by vitalif

Sorry for silence, and big thanks for approving the patch :)

