Ticket #3831 (new defect)
Dot in regex pattern should match one character even if the texts are multi byte
Reported by: | sstbw049040 | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | Future Releases |
Component: | mc-search | Version: | master |
Keywords: | multi byte regex | Cc: | |
Blocked By: | Blocking: | ||
Branch state: | no branch | Votes for changeset: |
Description (last modified by egmont) (diff)
Dot in regex pattern matches one byte currently.
But it's too hard to know byte number of characters for user.
Regex search should match the pattern character by character always regardless of the encoding of file, not byte by byte.
Attachments
Change History
comment:1 follow-ups: ↓ 4 ↓ 12 Changed 7 years ago by zaytsev
Lol, Ubuntu on Windows, who would have thought we'll ever get that far...
I'm not sure about glib regex (which is most probably what you are using), but in PCRE you can match code points with \X for multibyte encodings, so in your case the pattern should be 一\X三 instead of 一.三. Does this help?
comment:2 Changed 7 years ago by sstbw049040
- Summary changed from Colon in regex pattern should match one character even if the texts are multi byte to Dot in regex pattern should match one character even if the texts are multi byte
comment:3 in reply to: ↑ description Changed 7 years ago by sstbw049040
Replying to sstbw049040:
Colon in regex pattern matches one byte currently.
Sorry, It's dot, not colon.
comment:4 in reply to: ↑ 1 Changed 7 years ago by sstbw049040
Replying to zaytsev:
I'm not sure about glib regex (which is most probably what you are using), but in PCRE you can match code points with \X for multibyte encodings, so in your case the pattern should be 一\X三 instead of 一.三. Does this help?
Thank you for your comment, Let me check it.
Do you know how to check which regexp library of PCRE/GLIB my midnight commander is using?
I'm using the midnight commander of msys2 version.
$ pacman -Ss mc | grep
bmsys/mc
b
msys/mc 4.8.19-1 [installed]
$ mc -V
GNU Midnight Commander unknown
Built with GLib 2.48.2
Using the ncursesw library
With builtin Editor
With subshell support as default
With support for background operations
With mouse support on xterm
With internationalization support
With multiple codepages support
Virtual File Systems: cpiofs, tarfs, sfs, extfs, ftpfs, sftpfs, fish
Data types: char: 8; int: 32; long: 32; void *: 32; size_t: 32; off_t: 64;
comment:5 follow-up: ↓ 6 Changed 7 years ago by zaytsev
According to https://github.com/Alexpux/MSYS2-packages/tree/master/mc , it uses glib for regex.
comment:6 in reply to: ↑ 5 Changed 7 years ago by sstbw049040
Replying to zaytsev:
According to https://github.com/Alexpux/MSYS2-packages/tree/master/mc , it uses glib for regex.
Thank you for your comment.
comment:7 Changed 7 years ago by zaytsev
So did it work? Are there still any problems? I see you keep attaching screenshots, which is really suboptimal to try to understand what do you want from us.
comment:8 Changed 7 years ago by sstbw049040
I tested another case.
When I tested to search a regex pattern of U.F from attached files(ucs2le.txt, utf8.txt, cp932.txt, cp936.txt), the pattern was matched to all files despite that the text of "UTF" is only in utf8.txt.
Do you know what is the reason?
Attachemnets:
files3.txt ,files4.PNG ,ucs2le.txt ,utf8.txt ,cp932.txt ,cp936.txt ,mc7.PNG ,mc8.PNG ,mc9.PNG
comment:9 follow-up: ↓ 10 Changed 7 years ago by zaytsev
I see following possibilities:
- There is a byte pattern 85 xx 70 in every file
- There is a bug in the all charsets search function
comment:10 in reply to: ↑ 9 Changed 7 years ago by sstbw049040
Replying to zaytsev:
I see following possibilities:
- There is a byte pattern 85 xx 70 in every file
I didn't find byte pattern 85 xx 70 (0x55 0x?? 0x46) from files other than utf8.txt.
(I attached hex values of test files in files3.txt)
comment:12 in reply to: ↑ 1 Changed 7 years ago by sstbw049040
Replying to zaytsev:
I'm not sure about glib regex (which is most probably what you are using), but in PCRE you can match code points with \X for multibyte encodings, so in your case the pattern should be 一\X三 instead of 一.三. Does this help?
I tried the regex pattern 一\X三, the behaviour is similar to 一.三.
Attachements:
mc10.PNG, mc11.PNG, mc12.PNG, mc13.PNG, mc14.PNG, mc15.PNG