Ticket #2743 (closed defect: fixed)
File selection by patterns uses bytes instead of (unicode) characters
Reported by: | onlyjob | Owned by: | slavazanko |
---|---|---|---|
Priority: | major | Milestone: | 4.8.14 |
Component: | mc-search | Version: | 4.8.1 |
Keywords: | Cc: | egmont@… | |
Blocked By: | Blocking: | #1610 | |
Branch state: | merged | Votes for changeset: | committed-master |
Description
Bug-Debian: http://bugs.debian.org/663136
Using the "?" pattern in the file selection dialog brought up with '+',
mc uses the file name length in bytes instead of characters. Example:
$ touch 例 者
$ ls ?
例 者
But trying to select the files with "?" as pattern in mc doesn't work,
"??" selects 者, "???" selects 例.
:::::::::::::::::::::
[mc -V]
GNU Midnight Commander 4.8.1
Built with GLib 2.30.2
Using the S-Lang library with terminfo database
With builtin Editor
With subshell support as default
With support for background operations
With mouse support on xterm and Linux console
With support for X11 events
With internationalization support
With multiple codepages support
Virtual File Systems: cpiofs, tarfs, sfs, extfs, ext2undelfs, ftpfs, fish
Data types: char: 8; int: 32; long: 64; void *: 64; size_t: 64; off_t: 64;
[mc --configure-options]
'--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' 'AWK=awk' '--libexecdir=${prefix}/lib' '--with-x' '--with-screen=slang' '--disable-rpath' '--disable-static' '--enable-vfs-undelfs' '--enable-tests' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security' 'LDFLAGS=-fPIE -pie -Wl,-z,relro -Wl,-z,now' 'CPPFLAGS=-D_FORTIFY_SOURCE=2'
Attachments
Change History
comment:1 Changed 13 years ago by slavazanko
- Owner set to slavazanko
- Status changed from new to accepted
comment:2 follow-up: ↓ 3 Changed 13 years ago by slavazanko
I need help at this point.
Currently GLOB pattern ? always replaced to regex ^(.)$
I tried to use regex ^(\X)$ but simple unicode characters wasn't found. Can anyone help me with construct regexp for search exactly one symbol in line? For this, you may run mcedit, fill lines:
a 例 者 non-searchable
and try to search (F7) as regexp.
comment:3 in reply to: ↑ 2 Changed 10 years ago by fuky
Bugfix lib/search/regex.c replace:
g_regex_new (mc_search_cond->str->str, G_REGEX_OPTIMIZE | G_REGEX_RAW | G_REGEX_DOTALL, 0, &error);
by:
g_regex_new (mc_search_cond->str->str, G_REGEX_OPTIMIZE | G_REGEX_DOTALL, 0, &error);
https://developer.gnome.org/glib/2.37/glib-Perl-compatible-regular-expressions.html
Note that, unless you set the G_REGEX_RAW flag, all the strings passed to these functions must be encoded in UTF-8. The lengths and the positions inside the strings are in bytes and not in characters, so, for instance, "\xc3\xa0" (i.e. "à") is two bytes long but it is treated as a single character. If you set G_REGEX_RAW the strings can be non-valid UTF-8 strings and a byte is treated as a character, so "\xc3\xa0" is two bytes and two characters long.
comment:4 Changed 10 years ago by boris
Probably the attached patch is more correct (written in analogy with the code using PCRE).
comment:6 Changed 10 years ago by slavazanko
- Branch state changed from no branch to on review
- Milestone changed from 4.8 to 4.8.14
Created branch 2743_search_unicode
Initial changeset: 994cc7ba5f369d0dfccc20c78b2d9559f58483c7
Review please.
comment:7 Changed 10 years ago by andrew_b
- Votes for changeset set to andrew_b
- Component changed from mc-core to mc-search
- Branch state changed from on review to approved
comment:8 Changed 10 years ago by slavazanko
- Status changed from accepted to testing
- Votes for changeset changed from andrew_b to committed-master
- Resolution set to fixed
- Branch state changed from approved to merged
Merged to master:
git log --pretty=oneline e5ec837...933fd25
comment:10 Changed 10 years ago by egmont
The fix to this bug causes a serious regression: The viewer no longer finds accented letters in case insensitive mode.
Could you guys (those who worked on this bug) please take a look and help fix this regression? Thanks in advance!
Ticket #3146.
comment:12 Changed 10 years ago by zaytsev
Suspect another regression: #3449 ... could any of you guys please have a look? Thanks!