Ticket #3589: 3589-Make-hex-search-work-for-binary-data.patch

File 3589-Make-hex-search-work-for-binary-data.patch, 2.4 KB (added by mooffie, 7 years ago)
  • lib/search/hex.c

    From 8ea66d51ffdb1bc104947d46ad8ebcb62eeaeb44 Mon Sep 17 00:00:00 2001
    From: Mooffie <mooffie@gmail.com>
    Date: Mon, 26 Sep 2016 16:17:08 +0300
    Subject: [PATCH] Ticket #3589: Make hex search work for binary data.
    
    ---
     lib/search/hex.c | 36 ++++++++++++++++++++++++++++++++++++
     1 file changed, 36 insertions(+)
    
    diff --git a/lib/search/hex.c b/lib/search/hex.c
    index 8b5470e..30202b4 100644
    a b mc_search__cond_struct_new_init_hex (const char *charset, mc_search_t * lc_mc_se 
    137137    mc_search_hex_parse_error_t error = MC_SEARCH_HEX_E_OK; 
    138138    int error_pos = 0; 
    139139 
     140    /* 
     141     * We may be searching in binary data, which is often invalid UTF-8. 
     142     * 
     143     * We have to create a non UTF-8 regex (that is, G_REGEX_RAW) or else, as 
     144     * the data is invalid UTF-8, both GLib's PCRE and our 
     145     * mc_search__g_regex_match_full_safe() are going to fail us. The former by 
     146     * not finding all bytes, the latter by overwriting the supposedly invalid 
     147     * UTF-8 with NULs. 
     148     * 
     149     * To do this, we specify "ASCII" as the charset. 
     150     * 
     151     * In fact, we can specify any charset other than "UTF-8": any such charset 
     152     * will trigger G_REGEX_RAW (see [1]). The output of [2] will be the same 
     153     * for all charsets because it skips the \xXX symbols 
     154     * mc_search__hex_translate_to_regex() outputs. 
     155     * 
     156     * But "ASCII" is the best choice because a hex pattern may contain a 
     157     * quoted string: this way we know [2] will ignore any characters outside 
     158     * ASCII letters range (these ignored chars will be copied verbatim to the 
     159     * output and will match as-is; in other words, in a case-sensitive manner; 
     160     * If the user is interested in case-insensitive searches of international 
     161     * text, he shouldn't be using hex search in the first place.) 
     162     * 
     163     * Switching out of UTF-8 has another advantage: 
     164     * 
     165     * When doing case-insensitive searches, GLib treats \xXX symbols as normal 
     166     * letters and therefore matches both "a" and "A" for the hex pattern 
     167     * "0x61". When we switch out of UTF-8, we're switching to using [2], which 
     168     * doesn't have this issue. 
     169     * 
     170     * [1] mc_search__cond_struct_new_init_regex 
     171     * [2] mc_search__cond_struct_new_regex_ci_str 
     172     */ 
     173    if (str_isutf8 (charset)) 
     174        charset = "ASCII"; 
     175 
    140176    tmp = mc_search__hex_translate_to_regex (mc_search_cond->str, &error, &error_pos); 
    141177    if (tmp != NULL) 
    142178    {