Ticket #3616 (closed enhancement: fixed)

Opened 8 years ago

Last modified 7 years ago

Speed up of str_utf8_normalize() for a frequent special case

Reported by: devzero Owned by: andrew_b
Priority: major Milestone: 4.8.20
Component: mc-core Version: master
Keywords: Cc:
Blocked By: Blocking:
Branch state: merged Votes for changeset: committed-master


When content of a large directory is being sorted by file names, a significant amount of CPU time is spent in str_utf8_normalize() that is called from str_utf8_create_key_gen().

For example, /usr/bin/ contains 5437 files on my Archlinux box. Running mc /usr/bin/ /usr/bin/ takes approx. 75 000 000 CPU instructions to sort file names, or 25% of total program run time. From these 75 000 000 instructions, 42 500 000 instruction are spent in str_utf8_normalize().

str_utf8_normalize() uses g_utf8_normalize() to do the work. g_utf8_normalize() is a heavyweight function, that converts UTF-8 into UCS-4, does the normalization and then converts UCS-4 back into UTF-8.

Since file names are composed of ASCII characters in most cases, we can speed up str_utf8_normalize() by checking if the heavyweight Unicode normalization is actually needed. Normalization of ASCII string is no-op, so it is effectively "normalized" by just strdup(). Here is the patch:

diff --git a/lib/strutil/strutilutf8.c b/lib/strutil/strutilutf8.c
index 8ec754d..8c7f909 100644
--- a/lib/strutil/strutilutf8.c
+++ b/lib/strutil/strutilutf8.c
@@ -1080,6 +1080,17 @@ str_utf8_normalize (const char *text)
     const char *start;
     const char *end;
+    const char *p = text;
+    while (1)
+    {
+        char c = *p;
+        if (c == 0)
+            return g_strndup(text, p - text);
+        else if ((c & 0x80) != 0)
+            break;
+        p++;
+    }
     fixed = g_string_sized_new (4);
     start = text;

With this patch, running mc /usr/bin/ /usr/bin/ requires just 37 000 000 instructions to sort the file names (down from 75 000 000) and 4 500 000 instuctions to do str_utf8_normalize() (down from 42 500 000).

Change History

comment:1 Changed 7 years ago by andrew_b

  • Status changed from new to accepted
  • Owner set to andrew_b
  • Branch state changed from no branch to on review
  • Milestone changed from Future Releases to 4.8.20

Thanks for the patch! I rewrote the code but kept the idea itself.

Branch: 3616_utf8_normalize_speedup

Please review.

comment:2 Changed 7 years ago by zaytsev

  • Votes for changeset set to zaytsev
  • Branch state changed from on review to approved

comment:3 Changed 7 years ago by andrew_b

  • Status changed from accepted to testing
  • Votes for changeset changed from zaytsev to committed-master
  • Resolution set to fixed
  • Branch state changed from approved to merged

comment:4 Changed 7 years ago by andrew_b

  • Status changed from testing to closed
Note: See TracTickets for help on using tickets.