Ticket #1882 (closed enhancement: fixed)

Opened 10 years ago

Last modified 8 years ago

PCRE search: escape sequence support in replacements, UTF8 support (just a flag for libPCRE)

Reported by: vitalif Owned by: slavazanko
Priority: minor Milestone: 4.8.0-pre1
Component: mc-search Version: master
Keywords: Cc:
Blocked By: Blocking:
Branch state: merged Votes for changeset: committed-master

Description

mc uses glib for PCRE search by default.
that's incorrect, because glib does not support PCRE_UTF8 compile flag, so \w and other "localizable" regex instructions don't work as expected.
mc should use libpcre by default + PCRE_UTF8 flag.

Also, mc does not support escape sequences in replacement strings ("\n", "\t" and etc), but it should.

Attaching a patch implementing these features.

Attachments

mc-4.7.0pre-searchPCRE-escape-sequences-and-UTF8.patch (9.6 KB) - added by vitalif 10 years ago.
patch for src/search/regex.c
mc-4.7.0.1-regex-replace-escape-sequences.diff (9.3 KB) - added by vitalif 10 years ago.
New version of patch
mc-git20110403-regex-replace-escape-sequences.diff (10.0 KB) - added by vitalif 9 years ago.
This same patch, ported to git version 12eb8b62…
checkcurlybrace-widechars-tests.diff (19.0 KB) - added by vitalif 9 years ago.
Check } curly braces, fix tests, add \x{4344} utf-8 wide chars in utf-8 match mode

Change History

Changed 10 years ago by vitalif

patch for src/search/regex.c

comment:1 follow-up: ↓ 4 Changed 10 years ago by zaytsev

How important is to remove these comments???

comment:2 Changed 10 years ago by slavazanko

that's incorrect, because glib does not support PCRE_UTF8 compile flag, so \w and other "localizable" regex instructions don't work as expected.

What you mean? perl-compatible regexp'es of Glib library always turn on UTF-8 mode.

From http://library.gnome.org/devel/glib/stable/glib-Perl-compatible-regular-expressions.html:

Note that, unless you set the G_REGEX_RAW flag, all the strings passed 
to these functions must be encoded in UTF-8. The lengths and the positions
inside the strings are in bytes and not in characters, so, for instance, 
"\xc3\xa0" (i.e. "à") is two bytes long but it is treated as a single 
character. If you set G_REGEX_RAW the strings can be non-valid UTF-8 
strings and a byte is treated as a character, so "\xc3\xa0" is two bytes
and two characters long. 

Also, mc does not support escape sequences in replacement strings ("\n", "\t" and etc), but it should.

Show version of your mc, please:

mc -V

I have worked special sequences. Test case:

  • run mcedit
  • press SPACE key twice, type dddd; press ENTER key
  • press TAB key twice, type dddd; press ENTER key
  • go at top of editing file
  • press F7 key
  • enter '\t\tddd' into search string
  • select 'Regular expression' search type
  • press ENTER key

If no any matches found at this point - we will continue discuss.

comment:3 Changed 10 years ago by vitalif

Я имел ввиду escape-последовательности в строке ЗАМЕНЫ, а не в строке, которую ищем - там-то они работают.
Тест - ввести в строку поиска например "\x0d", а в строку замены например "\n". Везде в тексте \r будет заменён на просто символ "n", что не есть логично.
Сейчас обновился до 4.7.0.1-1 из Debian'а, баг до сих пор присутствует.

comment:4 in reply to: ↑ 1 Changed 10 years ago by vitalif

Replying to zaytsev:

How important is to remove these comments???

Why do you need to remove them?
Is documented code worse than undocumented?

Changed 10 years ago by vitalif

New version of patch

comment:5 Changed 10 years ago by zaytsev

You seemed to remove them, that's why I asked. But it looks like your previous patch was reversed, so you actually added them. I think that the new patch is OK.

Changed 9 years ago by vitalif

This same patch, ported to git version 12eb8b62...

comment:6 Changed 9 years ago by vitalif

Any progress here? Ticket is actual for the latest (git) version, so I posted a correct patch for it.

Comment 3 in english:
I mean escape sequence support inside _replacement_ strings, not inside the _search_ regex - of course they work inside regex.
Test is simple - enter, for example, "\x0d" as search string and "\n" as the replacement string, and \r will be replaced by just "n" character.

comment:7 Changed 9 years ago by slavazanko

  • Status changed from new to accepted
  • Owner set to slavazanko

comment:8 Changed 9 years ago by slavazanko

  • Version changed from 4.7.0-pre4 to master
  • Type changed from defect to enhancement
  • severity changed from no branch to on review
  • Milestone changed from 4.7 to 4.8.0-pre1
Last edited 8 years ago by angel_il (previous) (diff)

comment:9 Changed 9 years ago by slavazanko

  • severity changed from on review to on rework

vitalif, can you switch to branch and see my changes? Can you run 'make check' in branch?

I have created some unit tests for your algorithm... but I don't sure that algorithm is clean... or may be tests is wrong :)

Last edited 9 years ago by slavazanko (previous) (diff)

comment:10 Changed 9 years ago by vitalif

Despite of the fact that algorithm now ignores the absence of closing curly brace '}' (which probably isn't 100% correct), this should be checked in tests for replace_handle_esc_seq function, not process_escape_sequence - it is the replace_handle_esq_seq who decides whether it is an escape sequence or not.

Also, \x{4344} is usually a code for wide character (UTF-8), and not for "CD". So we can either ignore the higher bits, or generate wide character codes... The second would be convenient, but would also introduce a hard-coded UTF-8 charset.

comment:11 Changed 9 years ago by vitalif

I've attached a patch addressing these issues...
Also it removes *next_char parameter from handle_esc_seq, I think it's pointless there and just adds a non-trivial interface.

Last edited 9 years ago by vitalif (previous) (diff)

Changed 9 years ago by vitalif

Check } curly braces, fix tests, add \x{4344} utf-8 wide chars in utf-8 match mode

comment:12 Changed 9 years ago by slavazanko

  • severity changed from on rework to on review

Review branch, please

comment:13 Changed 9 years ago by andrew_b

  • Votes for changeset set to andrew_b

comment:14 Changed 8 years ago by slavazanko

  • severity changed from on review to no branch
  • Branch state set to on review

comment:15 Changed 8 years ago by slavazanko

Branch was rebased. New initial changeset:49e57e455691b43ccd54e7d343037cac84d57d2c

Version 0, edited 8 years ago by slavazanko (next)

comment:16 Changed 8 years ago by angel_il

  • Votes for changeset changed from andrew_b to andrew_b angel_il
  • Branch state changed from on review to approved

comment:17 Changed 8 years ago by slavazanko

  • Status changed from accepted to testing
  • Votes for changeset changed from andrew_b angel_il to committed-master
  • Resolution set to fixed
  • Branch state changed from approved to merged

Merge changeset:ba09144ce5210c1ae5a07389d4de5085938601c3
Show commits in branch:

git log --pretty=oneline bb12a12..e1edfd7

comment:18 Changed 8 years ago by slavazanko

  • Status changed from testing to closed

comment:19 Changed 8 years ago by vitalif

Sorry for silence, and big thanks for approving the patch :)

Note: See TracTickets for help on using tickets.