Ticket #1882 (closed enhancement: fixed)

Opened 9 years ago

Last modified 7 years ago

PCRE search: escape sequence support in replacements, UTF8 support (just a flag for libPCRE)

Reported by: vitalif Owned by: slavazanko
Priority: minor Milestone: 4.8.0-pre1
Component: mc-search Version: master
Keywords: Cc:
Blocked By: Blocking:
Branch state: merged Votes for changeset: committed-master

Description

mc uses glib for PCRE search by default.
that's incorrect, because glib does not support PCRE_UTF8 compile flag, so \w and other "localizable" regex instructions don't work as expected.
mc should use libpcre by default + PCRE_UTF8 flag.

Also, mc does not support escape sequences in replacement strings ("\n", "\t" and etc), but it should.

Attaching a patch implementing these features.

Attachments

mc-4.7.0pre-searchPCRE-escape-sequences-and-UTF8.patch (9.6 KB) - added by vitalif 9 years ago.
patch for src/search/regex.c
mc-4.7.0.1-regex-replace-escape-sequences.diff (9.3 KB) - added by vitalif 9 years ago.
New version of patch
mc-git20110403-regex-replace-escape-sequences.diff (10.0 KB) - added by vitalif 8 years ago.
This same patch, ported to git version 12eb8b62…
checkcurlybrace-widechars-tests.diff (19.0 KB) - added by vitalif 8 years ago.
Check } curly braces, fix tests, add \x{4344} utf-8 wide chars in utf-8 match mode

Change History

Changed 9 years ago by vitalif

patch for src/search/regex.c

comment:1 follow-up: ↓ 4 Changed 9 years ago by zaytsev

How important is to remove these comments???

comment:2 Changed 9 years ago by slavazanko

that's incorrect, because glib does not support PCRE_UTF8 compile flag, so \w and other "localizable" regex instructions don't work as expected.

What you mean? perl-compatible regexp'es of Glib library always turn on UTF-8 mode.

From http://library.gnome.org/devel/glib/stable/glib-Perl-compatible-regular-expressions.html:

Note that, unless you set the G_REGEX_RAW flag, all the strings passed 
to these functions must be encoded in UTF-8. The lengths and the positions
inside the strings are in bytes and not in characters, so, for instance, 
"\xc3\xa0" (i.e. "à") is two bytes long but it is treated as a single 
character. If you set G_REGEX_RAW the strings can be non-valid UTF-8 
strings and a byte is treated as a character, so "\xc3\xa0" is two bytes
and two characters long. 

Also, mc does not support escape sequences in replacement strings ("\n", "\t" and etc), but it should.

Show version of your mc, please:

mc -V

I have worked special sequences. Test case:

  • run mcedit
  • press SPACE key twice, type dddd; press ENTER key
  • press TAB key twice, type dddd; press ENTER key
  • go at top of editing file
  • press F7 key
  • enter '\t\tddd' into search string
  • select 'Regular expression' search type
  • press ENTER key

If no any matches found at this point - we will continue discuss.

comment:3 Changed 9 years ago by vitalif

Я имел ввиду escape-последовательности в строке ЗАМЕНЫ, а не в строке, которую ищем - там-то они работают.
Тест - ввести в строку поиска например "\x0d", а в строку замены например "\n". Везде в тексте \r будет заменён на просто символ "n", что не есть логично.
Сейчас обновился до 4.7.0.1-1 из Debian'а, баг до сих пор присутствует.

comment:4 in reply to: ↑ 1 Changed 9 years ago by vitalif

Replying to zaytsev:

How important is to remove these comments???

Why do you need to remove them?
Is documented code worse than undocumented?

Changed 9 years ago by vitalif

New version of patch

comment:5 Changed 9 years ago by zaytsev

You seemed to remove them, that's why I asked. But it looks like your previous patch was reversed, so you actually added them. I think that the new patch is OK.

Changed 8 years ago by vitalif

This same patch, ported to git version 12eb8b62...

comment:6 Changed 8 years ago by vitalif

Any progress here? Ticket is actual for the latest (git) version, so I posted a correct patch for it.

Comment 3 in english:
I mean escape sequence support inside _replacement_ strings, not inside the _search_ regex - of course they work inside regex.
Test is simple - enter, for example, "\x0d" as search string and "\n" as the replacement string, and \r will be replaced by just "n" character.

comment:7 Changed 8 years ago by slavazanko

  • Status changed from new to accepted
  • Owner set to slavazanko

comment:8 Changed 8 years ago by slavazanko

  • Version changed from 4.7.0-pre4 to master
  • Type changed from defect to enhancement
  • severity changed from no branch to on review
  • Milestone changed from 4.7 to 4.8.0-pre1
Last edited 8 years ago by angel_il (previous) (diff)

comment:9 Changed 8 years ago by slavazanko

  • severity changed from on review to on rework

vitalif, can you switch to branch and see my changes? Can you run 'make check' in branch?

I have created some unit tests for your algorithm... but I don't sure that algorithm is clean... or may be tests is wrong :)

Last edited 8 years ago by slavazanko (previous) (diff)

comment:10 Changed 8 years ago by vitalif

Despite of the fact that algorithm now ignores the absence of closing curly brace '}' (which probably isn't 100% correct), this should be checked in tests for replace_handle_esc_seq function, not process_escape_sequence - it is the replace_handle_esq_seq who decides whether it is an escape sequence or not.

Also, \x{4344} is usually a code for wide character (UTF-8), and not for "CD". So we can either ignore the higher bits, or generate wide character codes... The second would be convenient, but would also introduce a hard-coded UTF-8 charset.

comment:11 Changed 8 years ago by vitalif

I've attached a patch addressing these issues...
Also it removes *next_char parameter from handle_esc_seq, I think it's pointless there and just adds a non-trivial interface.

Last edited 8 years ago by vitalif (previous) (diff)

Changed 8 years ago by vitalif

Check } curly braces, fix tests, add \x{4344} utf-8 wide chars in utf-8 match mode

comment:12 Changed 8 years ago by slavazanko

  • severity changed from on rework to on review

Review branch, please

comment:13 Changed 8 years ago by andrew_b

  • Votes for changeset set to andrew_b

comment:14 Changed 8 years ago by slavazanko

  • severity changed from on review to no branch
  • Branch state set to on review

comment:15 Changed 8 years ago by slavazanko

Branch was rebased. New initial changeset:aa2d0b07edcfc884f0d778eaa89c201094c43458

Last edited 8 years ago by slavazanko (previous) (diff)

comment:16 Changed 8 years ago by angel_il

  • Votes for changeset changed from andrew_b to andrew_b angel_il
  • Branch state changed from on review to approved

comment:17 Changed 8 years ago by slavazanko

  • Status changed from accepted to testing
  • Votes for changeset changed from andrew_b angel_il to committed-master
  • Resolution set to fixed
  • Branch state changed from approved to merged

Merge changeset:ba09144ce5210c1ae5a07389d4de5085938601c3
Show commits in branch:

git log --pretty=oneline bb12a12..e1edfd7

comment:18 Changed 8 years ago by slavazanko

  • Status changed from testing to closed

comment:19 Changed 7 years ago by vitalif

Sorry for silence, and big thanks for approving the patch :)

Note: See TracTickets for help on using tickets.