| 140 | /* |
| 141 | * We may be searching in binary data, which is often invalid UTF-8. |
| 142 | * |
| 143 | * We have to create a non UTF-8 regex (that is, G_REGEX_RAW) or else, as |
| 144 | * the data is invalid UTF-8, both GLib's PCRE and our |
| 145 | * mc_search__g_regex_match_full_safe() are going to fail us. The former by |
| 146 | * not finding all bytes, the latter by overwriting the supposedly invalid |
| 147 | * UTF-8 with NULs. |
| 148 | * |
| 149 | * To do this, we specify "ASCII" as the charset. |
| 150 | * |
| 151 | * In fact, we can specify any charset other than "UTF-8": any such charset |
| 152 | * will trigger G_REGEX_RAW (see [1]). The output of [2] will be the same |
| 153 | * for all charsets because it skips the \xXX symbols |
| 154 | * mc_search__hex_translate_to_regex() outputs. |
| 155 | * |
| 156 | * But "ASCII" is the best choice because a hex pattern may contain a |
| 157 | * quoted string: this way we know [2] will ignore any characters outside |
| 158 | * ASCII letters range (these ignored chars will be copied verbatim to the |
| 159 | * output and will match as-is; in other words, in a case-sensitive manner; |
| 160 | * If the user is interested in case-insensitive searches of international |
| 161 | * text, he shouldn't be using hex search in the first place.) |
| 162 | * |
| 163 | * Switching out of UTF-8 has another advantage: |
| 164 | * |
| 165 | * When doing case-insensitive searches, GLib treats \xXX symbols as normal |
| 166 | * letters and therefore matches both "a" and "A" for the hex pattern |
| 167 | * "0x61". When we switch out of UTF-8, we're switching to using [2], which |
| 168 | * doesn't have this issue. |
| 169 | * |
| 170 | * [1] mc_search__cond_struct_new_init_regex |
| 171 | * [2] mc_search__cond_struct_new_regex_ci_str |
| 172 | */ |
| 173 | if (str_isutf8 (charset)) |
| 174 | charset = "ASCII"; |
| 175 | |