Ticket #2743 (closed defect: fixed)

Opened 5 years ago

Last modified 2 years ago

File selection by patterns uses bytes instead of (unicode) characters

Reported by: onlyjob Owned by: slavazanko
Priority: major Milestone: 4.8.14
Component: mc-search Version: 4.8.1
Keywords: Cc: egmont@…
Blocked By: Blocking: #1610
Branch state: merged Votes for changeset: committed-master

Description

Bug-Debian: http://bugs.debian.org/663136

Using the "?" pattern in the file selection dialog brought up with '+',
mc uses the file name length in bytes instead of characters. Example:

$ touch 例 者
$ ls ?
例 者

But trying to select the files with "?" as pattern in mc doesn't work,
"??" selects 者, "???" selects 例.

:::::::::::::::::::::

[mc -V]

GNU Midnight Commander 4.8.1
Built with GLib 2.30.2
Using the S-Lang library with terminfo database
With builtin Editor
With subshell support as default
With support for background operations
With mouse support on xterm and Linux console
With support for X11 events
With internationalization support
With multiple codepages support
Virtual File Systems: cpiofs, tarfs, sfs, extfs, ext2undelfs, ftpfs, fish
Data types: char: 8; int: 32; long: 64; void *: 64; size_t: 64; off_t: 64;

[mc --configure-options]

'--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--disable-dependency-tracking' 'AWK=awk' '--libexecdir=${prefix}/lib' '--with-x' '--with-screen=slang' '--disable-rpath' '--disable-static' '--enable-vfs-undelfs' '--enable-tests' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fPIE -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security' 'LDFLAGS=-fPIE -pie -Wl,-z,relro -Wl,-z,now' 'CPPFLAGS=-D_FORTIFY_SOURCE=2'

Attachments

utf8-search.patch (645 bytes) - added by boris 3 years ago.

Change History

comment:1 Changed 5 years ago by slavazanko

  • Status changed from new to accepted
  • Owner set to slavazanko

comment:2 follow-up: ↓ 3 Changed 5 years ago by slavazanko

I need help at this point.

Currently GLOB pattern ? always replaced to regex ^(.)$
I tried to use regex ^(\X)$ but simple unicode characters wasn't found. Can anyone help me with construct regexp for search exactly one symbol in line? For this, you may run mcedit, fill lines:

a
例
者
non-searchable

and try to search (F7) as regexp.

comment:3 in reply to: ↑ 2 Changed 3 years ago by fuky

Bugfix lib/search/regex.c replace:

g_regex_new (mc_search_cond->str->str, G_REGEX_OPTIMIZE | G_REGEX_RAW | G_REGEX_DOTALL, 0, &error);

by:

g_regex_new (mc_search_cond->str->str, G_REGEX_OPTIMIZE | G_REGEX_DOTALL, 0, &error);

https://developer.gnome.org/glib/2.37/glib-Perl-compatible-regular-expressions.html
Note that, unless you set the G_REGEX_RAW flag, all the strings passed to these functions must be encoded in UTF-8. The lengths and the positions inside the strings are in bytes and not in characters, so, for instance, "\xc3\xa0" (i.e. "à") is two bytes long but it is treated as a single character. If you set G_REGEX_RAW the strings can be non-valid UTF-8 strings and a byte is treated as a character, so "\xc3\xa0" is two bytes and two characters long.

Changed 3 years ago by boris

comment:4 Changed 3 years ago by boris

Probably the attached patch is more correct (written in analogy with the code using PCRE).

comment:5 Changed 2 years ago by andrew_b

  • Blocking 1610 added

comment:6 Changed 2 years ago by slavazanko

  • Branch state changed from no branch to on review
  • Milestone changed from 4.8 to 4.8.14

Created branch 2743_search_unicode

Initial changeset: 994cc7ba5f369d0dfccc20c78b2d9559f58483c7

Review please.

comment:7 Changed 2 years ago by andrew_b

  • Votes for changeset set to andrew_b
  • Component changed from mc-core to mc-search
  • Branch state changed from on review to approved

comment:8 Changed 2 years ago by slavazanko

  • Status changed from accepted to testing
  • Votes for changeset changed from andrew_b to committed-master
  • Resolution set to fixed
  • Branch state changed from approved to merged

Merged to master:

git log --pretty=oneline e5ec837...933fd25

comment:9 Changed 2 years ago by slavazanko

  • Status changed from testing to closed

comment:10 Changed 2 years ago by egmont

The fix to this bug causes a serious regression: The viewer no longer finds accented letters in case insensitive mode.

Could you guys (those who worked on this bug) please take a look and help fix this regression? Thanks in advance!

Ticket #3416.

Last edited 2 years ago by andrew_b (previous) (diff)

comment:11 Changed 2 years ago by egmont

  • Cc egmont@… added

comment:12 Changed 2 years ago by zaytsev

Suspect another regression: #3449 ... could any of you guys please have a look? Thanks!

Note: See TracTickets for help on using tickets.