Ticket #3616 (closed enhancement: fixed)
Speed up of str_utf8_normalize() for a frequent special case
Reported by: | devzero | Owned by: | andrew_b |
---|---|---|---|
Priority: | major | Milestone: | 4.8.20 |
Component: | mc-core | Version: | master |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Branch state: | merged | Votes for changeset: | committed-master |
Description
When content of a large directory is being sorted by file names, a significant amount of CPU time is spent in str_utf8_normalize() that is called from str_utf8_create_key_gen().
For example, /usr/bin/ contains 5437 files on my Archlinux box. Running mc /usr/bin/ /usr/bin/ takes approx. 75 000 000 CPU instructions to sort file names, or 25% of total program run time. From these 75 000 000 instructions, 42 500 000 instruction are spent in str_utf8_normalize().
str_utf8_normalize() uses g_utf8_normalize() to do the work. g_utf8_normalize() is a heavyweight function, that converts UTF-8 into UCS-4, does the normalization and then converts UCS-4 back into UTF-8.
Since file names are composed of ASCII characters in most cases, we can speed up str_utf8_normalize() by checking if the heavyweight Unicode normalization is actually needed. Normalization of ASCII string is no-op, so it is effectively "normalized" by just strdup(). Here is the patch:
diff --git a/lib/strutil/strutilutf8.c b/lib/strutil/strutilutf8.c index 8ec754d..8c7f909 100644 --- a/lib/strutil/strutilutf8.c +++ b/lib/strutil/strutilutf8.c @@ -1080,6 +1080,17 @@ str_utf8_normalize (const char *text) const char *start; const char *end; + const char *p = text; + while (1) + { + char c = *p; + if (c == 0) + return g_strndup(text, p - text); + else if ((c & 0x80) != 0) + break; + p++; + } + fixed = g_string_sized_new (4); start = text;
With this patch, running mc /usr/bin/ /usr/bin/ requires just 37 000 000 instructions to sort the file names (down from 75 000 000) and 4 500 000 instuctions to do str_utf8_normalize() (down from 42 500 000).
Change History
comment:1 Changed 7 years ago by andrew_b
- Status changed from new to accepted
- Owner set to andrew_b
- Branch state changed from no branch to on review
- Milestone changed from Future Releases to 4.8.20
comment:2 Changed 7 years ago by zaytsev
- Votes for changeset set to zaytsev
- Branch state changed from on review to approved
comment:3 Changed 7 years ago by andrew_b
- Status changed from accepted to testing
- Votes for changeset changed from zaytsev to committed-master
- Resolution set to fixed
- Branch state changed from approved to merged
Merged to master: [37013e7db3f35d0dbdf03a8aa023f3319de68f43].
Thanks for the patch! I rewrote the code but kept the idea itself.
Branch: 3616_utf8_normalize_speedup
changeset:d6ca63fa303a0e82281c1764e2ef78e9fa65d3f9
Please review.