Ticket #117 (closed defect: fixed)

Opened 15 years ago

Last modified 4 years ago

savannah: consecutive resize events not handled correctly

Reported by: egmont Owned by: andrew_b
Priority: minor Milestone: 4.8.24
Component: mc-tty Version: master
Keywords: Cc: ossi, zaytsev, egmont@…
Blocked By: Blocking:
Branch state: merged Votes for changeset: andrew_b

Description (last modified by ossi) (diff)

Original: http://savannah.gnu.org/bugs/?17822

Submitted by:Egmont Koblinger <egmont>Submitted on:Fri 22 Sep 2006 01:04:44 PM UTC
Category:Screen outputSeverity:3 - Normal
Status:NonePrivacy:Public
Assigned to:NoneOpen/Closed:Open
Release:4.6.1Operating System:GNU/Linux

Original submission:

Modern desktop environments and window managers usually resize the
windows in an opaque way: plenty of resize events are sent to the
application while the edge of the window is being dragged. This
gives users a much better feedback than the ancient method (only
showing where the edges of the window will be, and resizing when
the mouse button is released), though it definitely requires much
more cpu resources.

Unfortunately mc is unable to properly handle when the terminal is
resized opaquely. It receives tons of resize events (sigwinch),
most likely ignores the new ones while processing an older one.
Quite often when I finish resizing my window, mc draws its panels
with a different size. Hence if I enlarge my terminal, I might get
black columns/rows on its right/bottom part. If I shrink the
window, the whole screen can be garbled.

Then when I press one key, or click somewhere in the terminal with
the mouse, suddenly mc's screen is repainted with the right size,
and everything goes on perfectly.

It would be nice to eliminate these race conditions and make mc
resize correctly to the final terminal size, no matter how many
resize events were sent in a very short time. 

Bonus (reported in #368): if you start mc in an xterm and resize the terminal before mc shows the panels, it will not be adjusted to the terminal's size until you resize again once the panels are there.

Comment 1 by Egmont Koblinger <egmont> at Tue 10 Oct 2006 03:27:46 PM UTC:

Normally mc stays in a select() call in key.c:get_event() which is
called by dialog.c:frontend_run_dlg().

frontend_run_dlg() checks if the window size has changed and calls
the proper functions in this case. Then it does a lot of other
things, then calls get_event() which yet again does a lot of other
things and finally arrives at that select().

A subsequent window resize event (SIGWINCH) causes this select to
return with -1 EINTR which is later handled correctly.

The problem is caused by the lot of code lines executed after
checking for a window size change, but before entering the select
call. If another window size change event occurs while executing
these commands, it will not be handled until select() exits due to
a keypress for example.

In key.c:get_event(), I placed the following line above the
"flag = select (...)" statement:
if (winch_flag) return EV_NONE;
and then my resizing problems were gone. So now I check again for
a resize event right before entering the select function.

Though this modification seems to solve my bug, another bug
appeared. Press F5 on ".." so an error dialog box appears. Resize
the window now. mc enters an infinite loop consuming 100% CPU time
and not reacting on any keypress. I don't yet know how to fix this
new bug. Swapping the new statement and the enable_interrupt_key()
call right before it doesn't help either.

On the other hand, even if we didn't have any side effects, my
patch still doesn't fix my original problem, only decreases the
possibility for this to occur. Still there is a chance that
SIGWINCH arrives right after checking for winch_flag but before
entering the select() call.

I'm just curious how to solve this problem 100% perfectly, without
any race condition, without the slightest possibility to misbehave.
Now I'm interested in the theory first, not in the implementation
details in mc. Let's suppose a single application that only wants
to do two things: process data arriving from several file
descriptors, and always correctly display the terminal size from
its main loop (i.e. not from a signal handler, since doing complex
things from a signal handler is just plain wrong). My only idea is
the following: create a pipe, the sigwinch handler writes a byte
into it, and the select() call checks for its reading end in
addition to the other file descriptors it is interested in. This
way the select() call immediately exits if there's an unprocessed
sigwinch event, no matter if it occured before or during this
select call. And of course we have to set the close-on-exec flag on
these fd's so that subprocesses don't inherit them.

Comment 2 by Egmont Koblinger <egmont> at Tue 10 Oct 2006 03:40:51 PM UTC:

Oh, I just found the pselect() call which is supposed to solve this
problem. However, its manpage says the Linux kernel supports this
only since 2.6.16 which is a quite new piece. It says that glibc
had an emulation which is vulnerable to the race condition I just
mentioned and pselect was just introduced for.
The manpage also mentions and recommends the self-pipe trick which
is more portable.

Comment 3 by Egmont Koblinger <egmont> at Tue 10 Oct 2006 04:22:16 PM UTC:

Added a resize.patch. This does not suffer from the new bug, though
I don't understand what made a difference.

Still a rewrite to using pipes is needed to fill a minor race condition.

Comment 4 by Leonard den Ottolander <leonardjo> at Fri 03 Nov 2006 07:14:55 PM UTC:

Shall I commit this patch or should I wait for the piped version?

Comment 5 by Egmont Koblinger <egmont> at Mon 06 Nov 2006 12:01:58 PM UTC:

Please commit it if it seems to be okay for you (I've been using mc
4.6.1 with this patch for a month, and found no side effects, while
it makes mc much better when resizing the terminal). But please
don't yet close this bugreport to remind us that a proper fix is
still needed. This patch only makes it much better, but still
there's a small chance for the bug to appear. Until we have
a proper fix, it's good to apply a temporary hack to heavily
decrease the chance for the bug.

Comment 6 by Leonard den Ottolander <leonardjo> at Mon 06 Nov 2006 10:24:35 PM UTC:

As I understand your patch/hack, eliminating the timeouts on a
window resize event significantly reduces the chance mc is still
waiting inside the get_event() loop when a new resize event occurs.

Comment 7 by Leonard den Ottolander <leonardjo> at Wed 08 Nov 2006 01:38:24 PM UTC:

Committed. Quite an improvement in behaviour. Thank you.

Comment 8 by Denis Vlasenko <vda> at Mon 01 Oct 2007 03:21:44 PM UTC:

If you run without mouse support (mc -d), mc doesn't detect resizes.

You need to press a key for mc to resize itself to new window after
window is maximized or resized (press up/down arrow, Ctrl-O, almost
anything will work).

Tested on 4.6.2-pre1.

Comment 9 by Oswald Buddenhagen <ossi> at Mon 01 Oct 2007 09:03:01 PM UTC:

when a resize is sent before mc is completely up, it doesn't get
the geometry right, either - and yes, this happens about every time
an xterm -e mc is restored by session management on my system. so
the signal handler should be set up before the initial geometry is
queried.

Comment 10 by Pavel Tsekov <ptsekov> at Tue 02 Oct 2007 09:18:19 AM UTC:

Please, do not steal existing bug reports.

Attachments

10939-resize.patch (1.3 KB) - added by slavazanko 15 years ago.

Change History

comment:1 Changed 54 years ago by andrew_b

  • Blocking 2198 added

(In #2198) Replying to Spinal:

That's probably the bug introduced in new-generation of midnight commander, developed by Slavaz.

Definetely, no.

I don't have the same bug in mc-4.6.2-pre1 version used on my Debian Lenny server.

Strange. I've tried 4.6.1 from repo, and it doesn't work also. It seems, subshell resizing in NCurses-based mc didn't work ever.

Replying to egmont:

I've tested it with the following 4x2 combinations, on Mac OS 10.5:
{ 4.6.1, 4.6.2, 4.7.0.5, 4.7.2 } x { slang, ncurses }

It's always the ncurses build that doesn't resize properly, with any of those mc versions.

Correct, it is an NCurses-related bug only. But implementation of window resize handling in S-Lang-based mc is not quite correct.

In general, the implementation of SIGWINCH signal handling in mc should be rewritten. One of possible way is described in #117.

Changed 15 years ago by slavazanko

comment:1 Changed 15 years ago by styx

  • Milestone set to future releases

comment:2 Changed 15 years ago by andrew_b

  • severity set to no branch
  • Description modified (diff)

comment:3 Changed 14 years ago by ossi

i had a quick look at the code, and it is the very definition of "mess". :)

what should be used here is the so-called signal self-pipe trick which is employed to serialize signals into the regular event loop.
a pretty minimal qt-based example can be seen at http://websvn.kde.org/trunk/KDE/kdebase/workspace/kscreensaver/libkscreensaver/main.cpp?r1=737514&r2=737513&pathrev=737514&view=patch

that would also remove the need for "minimal" resize handling and special-casing of slang vs. ncurses. the current code isn't signal-safe anyway - you may still get a random crash during resizing.

the same also applies to the sigchld handler (or rather, the two of them, which is a problem in itself). so maybe it would be appropriate to re-purpose this bug to a general "clean up signal handling".

comment:4 Changed 14 years ago by ossi

  • Cc ossi@… added

comment:5 Changed 14 years ago by zaytsev

  • Cc zaytsev added
  • Version changed from 4.6.1 to master
  • Description modified (diff)
  • Milestone changed from Future Releases to 4.7

Moved an extra bit from #368 to this report.

comment:6 Changed 12 years ago by andrew_b

  • Branch state set to no branch
  • Milestone changed from 4.7 to Future Releases

comment:7 Changed 11 years ago by andrew_b

  • Blocking 2198 removed

comment:8 follow-up: ↓ 9 Changed 10 years ago by ossi

to fix this bug properly, one needs to make some improvements to the main loop.
what do you guys think of switching to the glib mainloop and using proper socket notifiers and timers instead of the current handcrafted solution?

comment:9 in reply to: ↑ 8 Changed 10 years ago by andrew_b

Replying to ossi:

what do you guys think of switching to the glib mainloop?

Yes, we think about that.

comment:10 Changed 10 years ago by ossi

  • Cc ossi added; ossi@… removed
  • Description modified (diff)
  • Reporter changed from slavazanko to egmont

comment:11 Changed 10 years ago by ossi

  • Description modified (diff)

comment:12 Changed 10 years ago by egmont

  • Cc egmont@… added
  • Priority changed from major to minor

Now that pselect() is widely available, one quick and easy way to properly fix the problem is to change the main select() call to a pselect() which blocks/unblocks sigwinch. Glib mainloop would of course be much nicer, but probably more work. For what it's worth, the bug never occurred to me since it was "almost fixed", the race condition window is so small.

comment:13 Changed 10 years ago by ossi

i don't really like that - the critical sections would have to be quite large, and it's a nightmare to make sure everything is covered.

note that the glib mainloop is no hard precondition. i just wouldn't want to think of hacking the existing mainloop(s) if a cleaner solution is planned anyway.

comment:14 Changed 5 years ago by andrew_b

  • Status changed from new to accepted
  • Owner set to andrew_b
  • Component changed from mc-core to mc-tty
  • Branch state changed from no branch to on review
  • Milestone changed from Future Releases to 4.8.24
Last edited 5 years ago by andrew_b (previous) (diff)

comment:15 Changed 5 years ago by egmont

Is it really worth it? Looks quite complex for a nasty hackish workaround to me, instead of using the proper pselect.

Version 0, edited 5 years ago by egmont (next)

comment:16 Changed 5 years ago by egmont

Also, are these pipe fds closed when launching any external process? I can't test the patch now, sorry, but there's at least no cloexec in it.

comment:17 Changed 5 years ago by ossi

signal serialization via a pipe is the standard way to do things in an event loop based application. indeed, the linux kernel recently added dedicated poll()-able objects to make the workaround with the pipe unnecessary (forkfd, signalfd, eventfd, and some more iirc).

this implementation has the additional feature of signal compression, which is appropriate.

leaked fds should be indeed avoided. see pipe2() and fcntl() with O_CLOEXEC (of course, one can manually track the fds as well, but that's annoying).

comment:18 Changed 5 years ago by ossi

fwiw, the error handling should be a lot less defensive - when pipe() or fcntl() fail, something is seriously wrong, and you should perror() and exit(3) rather than falling back to no action. that also removes the need for conditionals down the path.

comment:19 Changed 5 years ago by egmont

signal serialization via a pipe is the standard way to do things in an event loop based application.

I'd agree that this used to be the standard way (or rather: standard workaround) for this, ages ago, until a cleaner solution was designed.

indeed, the linux kernel recently added dedicated poll()-able objects

According to its manual page, "pselect() is defined in POSIX.1g, and in POSIX.1-2001 and POSIX.1-2008", furthermore, "pselect() was added to Linux in kernel 2.6.16" (which was released in 2006). That's not "recently", that's a long ago.

comment:20 Changed 5 years ago by andrew_b

pselect() has portability problems.

comment:21 Changed 5 years ago by egmont

This function is missing on some platforms:
OpenBSD 3.8

Released in 2005

Minix 3.1.8

2010

AIX 5.1

2001

HP-UX 11.23

2003

IRIX 6.5

2006

OSF/1 5.1

2000

Solaris 9

2002

mingw

last release is 6 years old

MSVC 14

2015 (if I'm looking at the right thing: Visual C++ 2015 a.k.a. C++ 14.0), two new versions since

Interix 3.5

2004

BeOS.

last version in 2000

On some platforms, this function fails to detect invalid fds with EBADF, but only if they lie beyond the current maximum open fd:

FreeBSD 8.2.

2013

Honestly, why do we care about such old systems, or ones that haven't been able to implement a standard for 10+ (but even like 18-ish) years?

Let me put it into a different perspective: Is there any platform where mc currently compiles and runs fine, and pselect would break it?

comment:22 Changed 5 years ago by ossi

how exactly is pselect() a pollable object? i don't think you bothered to research the relevant keywords i posted here ...

even with perfect pselect() support, getting signal handling right in an event-loop based application (and even more a library, like glib and qt) is basically an impossible task. just forget about it.

comment:23 follow-up: ↓ 24 Changed 5 years ago by egmont

how exactly is pselect() a pollable object?

Indeed, my bad, I didn't literally answer to your question, rather meant to answer to the generic concerns about pselect.

getting signal handling right in an event-loop based application (and even more a library, like glib and qt) is basically an impossible task

Could you please elaborate? Especially in the context of mc, which isn't a library, doesn't use glib's main loop, nor qt.

mc spends its idle times in one of its few select() calls. Let's generally block WINCH, and replace these select()s with pselect()s that unblock WINCH for their duration only; let the signal handler flip a bit (as it does now), and let's check for that bit after the pselect() and resize if necessary (as it's pretty much done now). What's the problem?

Given that pselect was specifically designed to solve this very problem we're arguing about, e.g. quoting from its manual page

The reason that pselect() is needed is that if one wants to wait for either a signal or for a file descriptor to become ready, then an atomic test is needed to prevent race conditions.

I'm truly skeptical if anyone says pselect() is a problematic choice.

comment:24 in reply to: ↑ 23 ; follow-up: ↓ 25 Changed 5 years ago by ossi

Replying to egmont:

in the context of mc, which [...] doesn't use glib's main loop

as a matter of fact, it's supposed to start using the glib main loop once someone does the work. the current loops are an utter #ifdef mess.

mc spends its idle times in one of its few select() calls.

"one of its few" is a recipe for "fun".

What's the problem?

none so far, except that you get an extremely rigid structure that gets harder to extend and more fragile each time you (try to) add something to it. sigchld? timers?

comment:25 in reply to: ↑ 24 ; follow-up: ↓ 26 Changed 5 years ago by egmont

as a matter of fact, it's supposed to start using the glib main loop once someone does the work. the current loops are an utter #ifdef mess.

Let's code against current mc, not a future mc that we have in our minds but is unlikely to get implemented in the foreseeable future.

"one of its few" is a recipe for "fun".

The very same kind and amount of "fun" as the current raceable WINCH handling, or the proposed self-pipe solution.

What's the problem?

none so far, except that you get an extremely rigid structure that gets harder to extend and more fragile each time you (try to) add something to it. sigchld? timers?

I truly don't see it justified that replacing the current select() with a pselect() to temporarily unblock WINCH would result in an any more rigid structure than the self-pipe trick, e.g. I don't see it justified that adding a timer to pselect() would be any more complicated than adding it to select().

Mind you, it doesn't hold in the other direction either. It's not that the self-pipe trick is bad per se, nor that it's significantly more complex than pselect. It's just that the self-pipe trick is an ancient workaround for something that has had a properly designed solution for 10+ years. Hence my preference of going with the latter.

comment:26 in reply to: ↑ 25 ; follow-up: ↓ 27 Changed 5 years ago by ossi

Replying to egmont:

Let's code against current mc, not a future mc that we have in our minds

the code is already written and happens to be more compatible with the assumed future direction. it would be plain silly to change the approach now.

comment:27 in reply to: ↑ 26 Changed 5 years ago by egmont

the code is already written

Including the lack of O_CLOEXEC which needs to be fixed before merging to master.

and happens to be more compatible with the assumed future direction.

I still don't see this claim being justified.

it would be plain silly to change the approach now.

Let andrew_b pick whichever of these two approaches, based on the comments so far. Andrew, whichever you pick, thanks for the work!

comment:28 Changed 5 years ago by andrew_b

Sorry, I'm too busy and have no time to fix my code. I'll to that on weekend.

comment:29 Changed 5 years ago by andrew_b

comment:30 follow-up: ↓ 31 Changed 5 years ago by ossi

the pipe handling is now correct afaict.

anyway, i had a somewhat closer look at the rest now, and i'm not sure it makes sense: why is the setup specific to the toolkit? why are you still chaining to slang? isn't it possible to notify it externally, synchronously with the main loop?

the other question i have is how this plays with nested main loops (i.e., modal dialogs). will the whole stack resize, or only the top window?

comment:31 in reply to: ↑ 30 ; follow-up: ↓ 32 Changed 5 years ago by andrew_b

Replying to ossi:

why is the setup specific to the toolkit?

Because we have two different toolkits.

why are you still chaining to slang? isn't it possible to notify it externally, synchronously with the main loop?

What do you mean? I'm afraid I'm not quite understanding you.

the other question i have is how this plays with nested main loops (i.e., modal dialogs). will the whole stack resize, or only the top window?

pipe is just replacement for global flag winch_flag. SIGWINCH handling is unchanged,

comment:32 in reply to: ↑ 31 ; follow-up: ↓ 33 Changed 5 years ago by ossi

Replying to andrew_b:

Because we have two different toolkits.

that doesn't answer the question. if the setup was toolkit-agnostic, there would be no need to treat them separately.

What do you mean?

the slang codepath still passes on sigwinch directly to slang. it is well possible that slang offers an api function to trigger a size adjustment, making the special-cased signal handling unnecessary.

pipe is just replacement for global flag winch_flag. SIGWINCH handling is unchanged,

there is a major difference, which is that now querying the pipe implicitly resets the "flag", while previously it didn't. and the querying is done in several places. i didn't check the implications, just pointing it out.

comment:33 in reply to: ↑ 32 Changed 5 years ago by andrew_b

Replying to ossi:

Replying to andrew_b:
if the setup was toolkit-agnostic, there would be no need to treat them separately.

It wasn't.

it is well possible that slang offers an api function to trigger a size adjustment

Unfortunately, no.

comment:34 Changed 5 years ago by andrew_b

  • Votes for changeset set to andrew_b
  • Branch state changed from on review to approved

comment:35 Changed 5 years ago by andrew_b

  • Status changed from accepted to testing
  • Resolution set to fixed
  • Branch state changed from approved to merged

comment:36 Changed 5 years ago by andrew_b

  • Status changed from testing to closed

comment:37 Changed 5 years ago by andrew_b

Fix is in #4019.

comment:38 Changed 4 years ago by andrew_b

Yet another fix: #4052.

Note: See TracTickets for help on using tickets.