Ticket #400 (reopened defect)

Opened 15 years ago

Last modified 21 months ago

Support for multiline search, and for searching for 0x0A bytes.

Reported by: Hubbitus Owned by: andrew_b
Priority: major Milestone: Future Releases
Component: mc-search Version: master
Keywords: Cc:
Blocked By: #3694 Blocking:
Branch state: no branch Votes for changeset:

Description

So, as I known, now search made line by line, and no chance match/replace pattern like "Some text\n+Another text".

This possibilities is highly appreciated.

Attachments

400-001-search-introduce-slurp-mode.patch (7.3 KB) - added by mooffie 8 years ago.
400-002-hex-search-make-it-possible-to-search-for-0x0A-bytes.patch (1.1 KB) - added by mooffie 8 years ago.
400-003-mcedit-make-it-possible-to-search-across-lines.patch (3.8 KB) - added by mooffie 8 years ago.
searchnewline.diff (1.2 KB) - added by sxmboer2 2 years ago.
test-search.diff (4.4 KB) - added by sxmboer 2 years ago.
Adding a test suite for lib/search

Change History

comment:1 Changed 15 years ago by slavazanko

  • Milestone changed from 4.7 to future releases

comment:2 in reply to: ↑ description Changed 15 years ago by sxmboer

  • severity set to no branch

Replying to Hubbitus:

So, as I known, now search made line by line, and no chance match/replace pattern like "Some text\n+Another text".

This possibilities is highly appreciated.

This is already possible by typing CTRL-Q, Enter for the newline
This works for search and search and replace.

comment:3 Changed 15 years ago by Hubbitus

By CTRL+Q pressing "Insert Literal" dialog appeared.
mc-4.6.99.3-0.9.52.gd40065d

comment:4 Changed 9 years ago by mooffie

  • Branch state set to no branch

No, one can't search over multiple lines in mcedit.

The search library searches line by line. See mc_search__run_regex() in lib/search/regex.c. There's a loop there that accumulates a line:

   while (TRUE) {
      ...
      if (current_chr == '\n' || ...)
        break;
   }

(You can find the last \n on a line, but this has no real benefit. BTW, you don't have to use CTRL-Q: if you choose "Regular expression" you can simply type the two characters "\n".)

mc2 comes with a snippet that lets one search over multiple lines.

comment:5 Changed 8 years ago by mooffie

We can solve this by adding to the search dialog a "[x] Search over lines" checkbox what would turn off the "current_chr == '\n'" check I quoted in comment 4. This would make the text accumulate in one giant string instead of line-by-line (but we should set some limit in case one searches in a 20 terabyte file...).

If we detect a newline in the regexp, we'll search-over-lines automatically. But having a checkbox is good for the case one wants to search for things like "struct \{.*?\}", where it's a dot that stands for a newline.

Last edited 8 years ago by mooffie (previous) (diff)

comment:6 Changed 8 years ago by andrew_b

  • Component changed from mcedit to mc-search

comment:7 Changed 8 years ago by andrew_b

Ticket #3454 has been marked as a duplicate of this ticket.

comment:8 Changed 8 years ago by andrew_b

  • Blocked By 3694 added

comment:9 Changed 8 years ago by mooffie

  • Type changed from enhancement to defect
  • Summary changed from Please add multiline search in mcedit to Support for multiline search, and for searching for 0x0A bytes.

Since we can't find 0x10 bytes (in hex search), I'm changing this to 'defect'.

Version 2, edited 8 years ago by mooffie (previous) (next) (diff)

comment:10 Changed 8 years ago by mooffie

Ok, here's a series of three small patches to fix this ticket:

What's in it?

(1) "introduce slurp mode"

This mode disables the "\n" chunking, which, as comment comment:4 explains, is the cause of all the problems. There's very little actual code here.

(2) "hex search: make it possible to search for 0x0A bytes."

This patch uses slurp mode to fix the hex search bug.

(3) "mcedit: make it possible to search across lines."

This patch uses slurp mode to implement a new feature: letting users search across lines. It adds an "Across lines" checkbox to the search and search-replace dialogs:

┌─────────────────────── Search ───────────────────────┐
│ Enter search string:                                 │
│ \{.*?\}                                          [^] │
├──────────────────────────────────────────────────────┤
│ ( ) Normal                [ ] Case sensitive         │
│ (*) Regular expression    [ ] Backwards              │
│ ( ) Hexadecimal           [ ] In selection           │
│ ( ) Wildcard search       [ ] Whole words            │
│                           [x] Across lines           │
│                           [ ] All charsets           │
├──────────────────────────────────────────────────────┤
│           [< OK >] [ Find all ] [ Cancel ]           │
└──────────────────────────────────────────────────────┘

(In comment:5 I also suggested enabling this feature automatically if the pattern has newlines in it, but I no longer think this "extra wisdom" it a good idea; it will just add confusion.)

What's next?

Andrew, if you're fine with this solution, let me know and I'll work on some refinements:

  • Work on the 'todo' mentioned inside patch (1).
  • Write help text for the "Across lines" checkbox. As #1561 points out, there's no help page for the search box, so this will be done in a separate ticket.
  • Add the "Across lines" checkbox to the viewer's search dialog as well.
  • Automatically turn off the "Across lines" checkbox when the user clicks radio other than "Regular expression" (just so the user doesn't forget to turn it off). We may also disable it.

Performance

Initially I thought backwards-search, which is not implemented quite efficiently, would put the kibosh on slurp mode. Surprisingly, however, I didn't notice a substantial difference in performance between slurp and non-slurp mode when doing backwards-search. Neither was the memory strained.

Maintainability

There's very little code added, and the patches essentially modify only two lines --in regex.c. To me this looks like a safe solution.

Changed 8 years ago by mooffie

comment:11 follow-up: ↓ 12 Changed 8 years ago by zaytsev

"Across lines" sounds a bit weird to me, do you know how it is done/called in other editors? I quickly checked gedit and kate, and they don't seem to have such a checkbox, but if the pattern contains \n, they capture multiple lines by default... Is there a downside to this behavior?

comment:12 in reply to: ↑ 11 Changed 8 years ago by mooffie

Replying to zaytsev:

"Across lines" sounds a bit weird to me


I had to pick something that matches the grammatical setting of the surrounding checkboxes: "[search] backwards", "[search] in selection", "[search] whole words", "[search] all charsets", and therefore "[search] across lines". I don't mind changing it once somebody comes up with an alternative.

I quickly checked gedit and kate, and they don't seem to have such a checkbox, but if the pattern contains \n, they capture multiple lines by default...


(These editors evidently aren't searching the text line-by-line like we do, so they don't face the challenge of having to detect when to turn off this behavior.)

if the pattern contains \n, they capture multiple lines by default... Is there a downside to this behavior?


First, let's see if we agree that we have to have a checkbox at all. Consider "struct \{.*?\}". We can't know that the user intends for the "." to match any char. So we have to let her tell us. A checkbox seems a straightforward way. (BTW, the regex world has a somewhat similar virtual checkbox: the //s flag (I say "somewhat" because MC unconditionally turns G_REGEX_DOTALL on).)

Once we agree that a checkbox is necessary, the bonus issue of automatically detecting \n can be blissfully handled independently, in a separate ticket.

if the pattern contains \n, they capture multiple lines by default... Is there a downside to this behavior?


I'm not against the idea. But then there's the question of how to correctly detect \n, and whether we'll surprise the user (i.e., "Begin.*\n.*End" won't match just two lines). We'll have discussions that have the potential to hold up the ticket for months and years.

So I think there are two possibilities:

(A) We can delegate the \n detection to a separate ticket.
(B) We can split this ticket into two: patches (1) and (2) will go back into ticket #3454, and hopefully get committed. Patch (3) will stay here and be discussed further.

comment:13 Changed 8 years ago by ossi

"spanning lines" seems like a reasonable label.

defaulting to this behavior also seems reasonable. the user just needs to use "[^\n]" instead of ".".

Last edited 8 years ago by zaytsev (previous) (diff)

comment:14 Changed 2 years ago by andrew_b

Ticket #4395 has been marked as a duplicate of this ticket.

comment:15 follow-up: ↓ 16 Changed 2 years ago by sxmboer2

I submitted a fix as ticket #4395, but I probably should have added a comment here and uploaded the bugfix by attaching the diff file here.
The diff file I am attaching fixes various search issue related to multi-line searches and replacements; both in normal mode as well as in regex search mode.
I changed the boundary conditions because testing in search and replaces showed that the search consistently found the very end of the file as a hit (when searching for "\n\n" even when only one newline exists at the end of the file buffer.

Changed 2 years ago by sxmboer2

comment:16 in reply to: ↑ 15 Changed 2 years ago by andrew_b

Replying to sxmboer2:

The diff file I am attaching fixes various search issue related to multi-line searches and replacements; both in normal mode as well as in regex search mode.

A tests in tests/lib/search are highly desirable for this patch.

comment:17 Changed 2 years ago by sxmboer

Test suite was a good idea, but it took me a significant amount of time to figure out how to use the unit test system. The added source code includes tests that are somewhat specific to this ticket.
It should be easy to add tests for the other search modes, whenever they are relevant.
Adding patch test-search.diff

Changed 2 years ago by sxmboer

Adding a test suite for lib/search

comment:18 Changed 2 years ago by andrew_b

  • Status changed from new to accepted
  • Owner set to andrew_b
  • Branch state changed from no branch to on review
  • Milestone changed from Future Releases to 4.8.29

Cool. Thanks!

Branch: 400_multiline_search
changeset:89bb5b60104178097b98e7c7fd2b28bfabc4ebbb

comment:19 Changed 2 years ago by andrew_b

  • Votes for changeset set to andrew_b
  • Branch state changed from on review to approved

comment:20 Changed 2 years ago by andrew_b

  • Status changed from accepted to testing
  • Votes for changeset changed from andrew_b to committed-master
  • Resolution set to fixed
  • Branch state changed from approved to merged

comment:21 Changed 2 years ago by andrew_b

  • Status changed from testing to closed

comment:22 Changed 21 months ago by andrew_b

  • Status changed from closed to reopened
  • Votes for changeset committed-master deleted
  • Resolution fixed deleted
  • Branch state changed from merged to no branch
  • Milestone changed from 4.8.29 to Future Releases

Multi-line search has some bugs. See #4429 for details.

Note: See TracTickets for help on using tickets.