Ticket #3831 (new defect)

Opened 7 years ago

Last modified 7 years ago

Dot in regex pattern should match one character even if the texts are multi byte

Reported by: sstbw049040 Owned by:
Priority: major Milestone: Future Releases
Component: mc-search Version: master
Keywords: multi byte regex Cc:
Blocked By: Blocking:
Branch state: no branch Votes for changeset:

Description (last modified by egmont) (diff)

Dot in regex pattern matches one byte currently.
But it's too hard to know byte number of characters for user.
Regex search should match the pattern character by character always regardless of the encoding of file, not byte by byte.

Attachments

files1.txt (1.5 KB) - added by sstbw049040 7 years ago.
files2.PNG (18.6 KB) - added by sstbw049040 7 years ago.
mc1.PNG (26.6 KB) - added by sstbw049040 7 years ago.
mc2.PNG (19.8 KB) - added by sstbw049040 7 years ago.
mc3.PNG (30.3 KB) - added by sstbw049040 7 years ago.
mc4.PNG (20.6 KB) - added by sstbw049040 7 years ago.
mc5.PNG (26.9 KB) - added by sstbw049040 7 years ago.
mc6.PNG (19.7 KB) - added by sstbw049040 7 years ago.
files3.txt (2.7 KB) - added by sstbw049040 7 years ago.
files4.PNG (22.3 KB) - added by sstbw049040 7 years ago.
ucs2le.txt (142 bytes) - added by sstbw049040 7 years ago.
utf8.txt (143 bytes) - added by sstbw049040 7 years ago.
cp932.txt (91 bytes) - added by sstbw049040 7 years ago.
cp936.txt (86 bytes) - added by sstbw049040 7 years ago.
mc7.PNG (31.2 KB) - added by sstbw049040 7 years ago.
mc8.PNG (27.9 KB) - added by sstbw049040 7 years ago.
mc9.PNG (27.6 KB) - added by sstbw049040 7 years ago.
mc9.2.PNG (27.6 KB) - added by sstbw049040 7 years ago.
mc10.PNG (31.6 KB) - added by sstbw049040 7 years ago.
mc11.PNG (21.1 KB) - added by sstbw049040 7 years ago.
mc12.PNG (31.4 KB) - added by sstbw049040 7 years ago.
mc13.PNG (21.4 KB) - added by sstbw049040 7 years ago.
mc14.PNG (31.4 KB) - added by sstbw049040 7 years ago.
mc15.PNG (21.2 KB) - added by sstbw049040 7 years ago.

Change History

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

comment:1 follow-ups: ↓ 4 ↓ 12 Changed 7 years ago by zaytsev

Lol, Ubuntu on Windows, who would have thought we'll ever get that far...

I'm not sure about glib regex (which is most probably what you are using), but in PCRE you can match code points with \X for multibyte encodings, so in your case the pattern should be 一\X三 instead of 一.三. Does this help?

comment:2 Changed 7 years ago by sstbw049040

  • Summary changed from Colon in regex pattern should match one character even if the texts are multi byte to Dot in regex pattern should match one character even if the texts are multi byte

comment:3 in reply to: ↑ description Changed 7 years ago by sstbw049040

Replying to sstbw049040:

Colon in regex pattern matches one byte currently.

Sorry, It's dot, not colon.

comment:4 in reply to: ↑ 1 Changed 7 years ago by sstbw049040

Replying to zaytsev:

I'm not sure about glib regex (which is most probably what you are using), but in PCRE you can match code points with \X for multibyte encodings, so in your case the pattern should be 一\X三 instead of 一.三. Does this help?

Thank you for your comment, Let me check it.
Do you know how to check which regexp library of PCRE/GLIB my midnight commander is using?
I'm using the midnight commander of msys2 version.

$ pacman -Ss mc | grep \\bmsys/mc\\b
msys/mc 4.8.19-1 [installed]

$ mc -V
GNU Midnight Commander unknown
Built with GLib 2.48.2
Using the ncursesw library
With builtin Editor
With subshell support as default
With support for background operations
With mouse support on xterm
With internationalization support
With multiple codepages support
Virtual File Systems: cpiofs, tarfs, sfs, extfs, ftpfs, sftpfs, fish
Data types: char: 8; int: 32; long: 32; void *: 32; size_t: 32; off_t: 64;

$ mc -V
Built with GLib 2.48.2

Is this meaning the GLIB is being used for regex search?

Last edited 7 years ago by sstbw049040 (previous) (diff)

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

comment:5 follow-up: ↓ 6 Changed 7 years ago by zaytsev

According to https://github.com/Alexpux/MSYS2-packages/tree/master/mc , it uses glib for regex.

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

comment:6 in reply to: ↑ 5 Changed 7 years ago by sstbw049040

Replying to zaytsev:

According to https://github.com/Alexpux/MSYS2-packages/tree/master/mc , it uses glib for regex.

Thank you for your comment.

comment:7 Changed 7 years ago by zaytsev

So did it work? Are there still any problems? I see you keep attaching screenshots, which is really suboptimal to try to understand what do you want from us.

comment:8 Changed 7 years ago by sstbw049040

I tested another case.
When I tested to search a regex pattern of U.F from attached files(ucs2le.txt, utf8.txt, cp932.txt, cp936.txt), the pattern was matched to all files despite that the text of "UTF" is only in utf8.txt.
What is the reason?

Attachemnets:
files3.txt​ ,files4.PNG​ ,ucs2le.txt​ ,utf8.txt​ ,cp932.txt​ ,cp936.txt​ ,mc7.PNG​ ,mc8.PNG​ ,mc9.PNG​

Version 0, edited 7 years ago by sstbw049040 (next)

comment:9 follow-up: ↓ 10 Changed 7 years ago by zaytsev

I see following possibilities:

  1. There is a byte pattern 85 xx 70 in every file
  2. There is a bug in the all charsets search function

comment:10 in reply to: ↑ 9 Changed 7 years ago by sstbw049040

Replying to zaytsev:

I see following possibilities:

  1. There is a byte pattern 85 xx 70 in every file

I didn't find byte pattern 85 xx 70 (0x55 0x?? 0x46) from files other than utf8.txt.
(I attached hex values of test files in files3.txt)

comment:11 Changed 7 years ago by egmont

  • Description modified (diff)

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

Changed 7 years ago by sstbw049040

comment:12 in reply to: ↑ 1 Changed 7 years ago by sstbw049040

Replying to zaytsev:

I'm not sure about glib regex (which is most probably what you are using), but in PCRE you can match code points with \X for multibyte encodings, so in your case the pattern should be 一\X三 instead of 一.三. Does this help?

I tried the regex pattern 一\X三, the behaviour is similar to 一.三.

Attachements:
mc10.PNG, mc11.PNG, mc12.PNG, mc13.PNG, mc14.PNG, mc15.PNG

Note: See TracTickets for help on using tickets.