MESSAGE
DATE | 2017-01-20 |
FROM | ruben safir
|
SUBJECT | Subject: [Learn] Fwd: threads and exit() woes
|
From learn-bounces-at-nylxs.com Fri Jan 20 15:01:21 2017 Return-Path: X-Original-To: archive-at-mrbrklyn.com Delivered-To: archive-at-mrbrklyn.com Received: from www.mrbrklyn.com (www.mrbrklyn.com [96.57.23.82]) by mrbrklyn.com (Postfix) with ESMTP id 36C09161312; Fri, 20 Jan 2017 15:01:20 -0500 (EST) X-Original-To: learn-at-nylxs.com Delivered-To: learn-at-nylxs.com Received: from [10.0.0.62] (flatbush.mrbrklyn.com [10.0.0.62]) by mrbrklyn.com (Postfix) with ESMTP id 26EBB160E77 for ; Fri, 20 Jan 2017 15:01:18 -0500 (EST) References: To: "learn-at-nylxs.com" From: ruben safir X-Forwarded-Message-Id: Message-ID: <8f67ea6f-2e46-4ff3-7b2c-7afcc854ac82-at-mrbrklyn.com> Date: Fri, 20 Jan 2017 15:01:18 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/mixed; boundary="------------EF34A4545F925DCBFC0D2A27" Subject: [Learn] Fwd: threads and exit() woes X-BeenThere: learn-at-nylxs.com X-Mailman-Version: 2.1.17 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: learn-bounces-at-nylxs.com Sender: "Learn"
This is a multi-part message in MIME format. --------------EF34A4545F925DCBFC0D2A27 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit
Nice thread to share....
see the comment by Lew
--------------EF34A4545F925DCBFC0D2A27 Content-Type: message/rfc822; name="threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="threads and exit() woes.eml"
Path: reader1.panix.com!panix!goblin2!goblin1!goblin.stu.neva.ru!fu-berlin.de!uni-berlin.de!not-for-mail From: jt-at-toerring.de (Jens Thoms Toerring) Newsgroups: comp.unix.programmer Subject: threads and exit() woes Date: 12 Dec 2016 23:03:08 GMT Organization: Freie Universitaet Berlin Message-ID: X-Trace: news.uni-berlin.de bbipL1f4kfbSOAvBoih72Q0c5XLIVFg1GqdU4+1h1vhzlv X-Orig-Path: not-for-mail User-Agent: tin/2.1.1-20120623 ("Mulindry") (UNIX) (Linux/3.2.0-4-amd64 (x86_64)) Xref: panix comp.unix.programmer:236699
Hi,
I've to deal with a multi-threaded program that has, as one of its threads a "watchdog thread" that, when it doesn't notice some variable getting set within a certain time, is supposed to stop the whole program (at any cost, no worries about data lost). It does attempt to shut down the program by calling exit(). Now, all the references I have consulted (TLPI, APUE 3rd ed. etc.) all claim that when one of the threads calls exit() the program will be ended. A look at SUSv4 just mentions in addition that the end of the program might be delayed if there are outstanding asynchronuous I/O operations that can't be cancelled (nothing I guess I'm having).
This did work with a 3.4 Linux kernel. But after switching to a 4.4 kernel it suddenly doesn't work reliably anymore. If it fails one thread seems to run amok, using about 50% of the CPU time, the other 50% being used by ksoftirqd. The whole thing can't be stopped in any way (not even with 'kill -SIGKILL'). I've also tried to replace the exit() call with a kill(getpid(), SIGKILL) but also with no luck. Attaching with gdb fails as well (hangs indefinitely). Looks like a real zombie: dead and very active at the same time:-(
Does that ring a bell with anyone of you? One of the threads is rather likely to do a lot of epoll() calls.
Please keep in mind that I can't simply change the whole architecture - this is an embedded system already out in the field, and my role in this is to get a new kernel ver- sion to work, not upset a more or less working application (unless I can come up with very convincing arguments;-)
Best regards, Jens -- \ Jens Thoms Toerring ___ jt-at-toerring.de \__________________________ http://toerring.de
--------------EF34A4545F925DCBFC0D2A27 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!newsfeed-00-ls.mathworks.com!nntp.TheWorld.com!newsswitch.lcs.mit.edu!ottix-news.ottix.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!buffer1.nntp.dca1.giganews.com!news.giganews.com.POSTED!not-for-mail NNTP-Posting-Date: Mon, 12 Dec 2016 17:45:02 -0600 Message-ID: From: Subject: Re: threads and exit() woes Newsgroups: comp.unix.programmer References: User-Agent: tin/2.2.1-20140504 ("Tober an Righ") (UNIX) (OpenBSD/5.9 (amd64)) Date: Mon, 12 Dec 2016 15:35:59 -0800 X-Usenet-Provider: http://www.giganews.com X-Trace: sv3-e2QJgzdJeU4wf+oRnYUwb53k8/w2wEjkIEzzyHbCGWASVPPOZDTVJnZwoPVv3Wcrshyzvz6116IyEQP!DfwMfWDub3WYnlx+j/HXyuXbd1lZK97FMfodYnkCrs7/3iK1GHBG2/1fT8YSINkPeg== X-Complaints-To: abuse-at-giganews.com X-DMCA-Notifications: http://www.giganews.com/info/dmca.html X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 X-Original-Bytes: 2136 Xref: panix comp.unix.programmer:236700
Jens Thoms Toerring wrote: > Hi, > > I've to deal with a multi-threaded program that has, as > one of its threads a "watchdog thread" that, when it doesn't > notice some variable getting set within a certain time, > is supposed to stop the whole program (at any cost, no > worries about data lost). It does attempt to shut down the > program by calling exit().
> This did work with a 3.4 Linux kernel. But after switching > to a 4.4 kernel it suddenly doesn't work reliably anymore. > If it fails one thread seems to run amok, using about 50% > of the CPU time, the other 50% being used by ksoftirqd. The > whole thing can't be stopped in any way (not even with 'kill > -SIGKILL'). I've also tried to replace the exit() call with > a kill(getpid(), SIGKILL) but also with no luck. Attaching > with gdb fails as well (hangs indefinitely). Looks like a > real zombie: dead and very active at the same time:-(
A shot in the dark: is the application using robust mutexes? That's the first thing that comes to mind. Robust mutexes require the kernel, when destroying a thread, to walk a userspace linked-list data structure.
--------------EF34A4545F925DCBFC0D2A27 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!168.235.88.217.MISMATCH!2.us.feeder.erje.net!feeder.erje.net!1.eu.feeder.erje.net!fu-berlin.de!uni-berlin.de!not-for-mail From: jt-at-toerring.de (Jens Thoms Toerring) Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: 13 Dec 2016 00:16:48 GMT Organization: Freie Universitaet Berlin Message-ID: References: X-Trace: news.uni-berlin.de dSP4N1M0zqqASMB6HMlxjgxY7+wfDcglWZxJmbm8ilOw+N X-Orig-Path: not-for-mail User-Agent: tin/2.1.1-20120623 ("Mulindry") (UNIX) (Linux/3.2.0-4-amd64 (x86_64)) Xref: panix comp.unix.programmer:236701
william-at-wilbur.25thandclement.com wrote: > Jens Thoms Toerring wrote: > > Hi, > > > > I've to deal with a multi-threaded program that has, as > > one of its threads a "watchdog thread" that, when it doesn't > > notice some variable getting set within a certain time, > > is supposed to stop the whole program (at any cost, no > > worries about data lost). It does attempt to shut down the > > program by calling exit(). > > > This did work with a 3.4 Linux kernel. But after switching > > to a 4.4 kernel it suddenly doesn't work reliably anymore. > > If it fails one thread seems to run amok, using about 50% > > of the CPU time, the other 50% being used by ksoftirqd. The > > whole thing can't be stopped in any way (not even with 'kill > > -SIGKILL'). I've also tried to replace the exit() call with > > a kill(getpid(), SIGKILL) but also with no luck. Attaching > > with gdb fails as well (hangs indefinitely). Looks like a > > real zombie: dead and very active at the same time:-(
> A shot in the dark: is the application using robust mutexes? That's the > first thing that comes to mind. Robust mutexes require the kernel, when > destroying a thread, to walk a userspace linked-list data structure.
Unfortunately, I can't say (and the term "robust mutex" was new to me, admittedly). There are several libraries involved that create their own threads (libevent, libusb etc.) about which I can't say much. The rest of the threads in the application itself usually use pipes for basic communication apart from very simple boolean values, defined as volatile sig_atomic_t for certain state information. But, as far as I can see (but this can change as I get around to delve deeper into the application) there are no mutex locks that might lead to some kind of dead-lock. But then it's 150 kloc of code I'm not too familiar with... I'll de- finitely look at this aspect!
Could something like that keep a program alive that sends it- self a SIGKILL (or does exit() or _exit())? That are all things I've tried. The only result was that the chance that it got stuck in that strange busy, non-killable state seemed to change (and each test runs until the problem appears can take an hour and more, making things somewhat annoying;-)
Thank you and best regards, Jens -- \ Jens Thoms Toerring ___ jt-at-toerring.de \__________________________ http://toerring.de
--------------EF34A4545F925DCBFC0D2A27 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!newsfeed-00-ls.mathworks.com!nntp.TheWorld.com!newsswitch.lcs.mit.edu!ottix-news.ottix.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!buffer1.nntp.dca1.giganews.com!buffer2.nntp.dca1.giganews.com!news.giganews.com.POSTED!not-for-mail NNTP-Posting-Date: Mon, 12 Dec 2016 23:45:02 -0600 Message-ID: From: Subject: Re: threads and exit() woes Newsgroups: comp.unix.programmer References: User-Agent: tin/2.2.1-20140504 ("Tober an Righ") (UNIX) (OpenBSD/5.9 (amd64)) Date: Mon, 12 Dec 2016 21:31:02 -0800 X-Usenet-Provider: http://www.giganews.com X-Trace: sv3-jezSxJ1HCymagUAWvboQyQRdXlxAvXWJVkwMB/kO7DKseY9cGZnuu8J1hho7a4BEXiMW8yT/LW4skb0!xE4ajcN7g9NXyqp7wU82h5k/soGZF2z1JGPPqPwMonZTW2vwrEPe7+NeICyOp2KTlA== X-Complaints-To: abuse-at-giganews.com X-DMCA-Notifications: http://www.giganews.com/info/dmca.html X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 X-Original-Bytes: 3780 Xref: panix comp.unix.programmer:236704
Jens Thoms Toerring wrote: > william-at-wilbur.25thandclement.com wrote: >> Jens Thoms Toerring wrote: >> > Hi, >> > >> > I've to deal with a multi-threaded program that has, as >> > one of its threads a "watchdog thread" that, when it doesn't >> > notice some variable getting set within a certain time, >> > is supposed to stop the whole program (at any cost, no >> > worries about data lost). It does attempt to shut down the >> > program by calling exit(). >> >> > This did work with a 3.4 Linux kernel. But after switching >> > to a 4.4 kernel it suddenly doesn't work reliably anymore. >> > If it fails one thread seems to run amok, using about 50% >> > of the CPU time, the other 50% being used by ksoftirqd. The >> > whole thing can't be stopped in any way (not even with 'kill >> > -SIGKILL'). I've also tried to replace the exit() call with >> > a kill(getpid(), SIGKILL) but also with no luck. Attaching >> > with gdb fails as well (hangs indefinitely). Looks like a >> > real zombie: dead and very active at the same time:-( > >> A shot in the dark: is the application using robust mutexes? That's the >> first thing that comes to mind. Robust mutexes require the kernel, when >> destroying a thread, to walk a userspace linked-list data structure. > > Unfortunately, I can't say (and the term "robust mutex" was new > to me, admittedly). There are several libraries involved that > create their own threads (libevent, libusb etc.) about which I > can't say much.
USB, embedded... I switch my vote to a USB driver issue ;)
> Could something like that keep a program alive that sends it- > self a SIGKILL (or does exit() or _exit())? That are all things > I've tried. The only result was that the chance that it got > stuck in that strange busy, non-killable state seemed to change > (and each test runs until the problem appears can take an hour > and more, making things somewhat annoying;-)
Theoretically the kernel shouldn't have a problem if the linked-list is corrupted or if any of the memory it points to has weird permissions. However, the Linux kernel is quite complex and has more than its fair share of bugs.
The ksoftirqd load made me think of some kind of pathological page faulting behavior occuring from kernel context as it tears the process down (see exit_robust_list in kernel/futex.c). But I don't even know if ksoftirqd handles page faults at all.
Don't put much stock in my comments. I haven't personally run into issues with robust mutexes, beyond bugs in glibc[1]. That locking doesn't stand out to you would make me look elsewhere.
[1] https://sourceware.org/bugzilla/show_bug.cgi?id=12683
--------------EF34A4545F925DCBFC0D2A27 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!goblin3!goblin1!goblin.stu.neva.ru!eternal-september.org!feeder.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Lew Pitcher Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: Mon, 12 Dec 2016 22:42:21 -0500 Organization: The Pitcher Digital Freehold Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7Bit Injection-Info: mx02.eternal-september.org; posting-host="3010cfc25bc10d40bae4e65aed6697c7"; logging-data="14750"; mail-complaints-to="abuse-at-eternal-september.org"; posting-account="U2FsdGVkX19qxclA+UKTdb2N8c/Tdzr1qKNaMVK/BYI=" Cancel-Lock: sha1:RSL6zu/PHqZQOPv68OiUgnpOPXY= Xref: panix comp.unix.programmer:236703
On Monday December 12 2016 18:35, in comp.unix.programmer, "william-at-wilbur.25thandClement.com" wrote:
> Jens Thoms Toerring wrote: >> Hi, >> >> I've to deal with a multi-threaded program that has, as >> one of its threads a "watchdog thread" that, when it doesn't >> notice some variable getting set within a certain time, >> is supposed to stop the whole program (at any cost, no >> worries about data lost). It does attempt to shut down the >> program by calling exit(). > >> This did work with a 3.4 Linux kernel. But after switching >> to a 4.4 kernel it suddenly doesn't work reliably anymore. >> If it fails one thread seems to run amok, using about 50% >> of the CPU time, the other 50% being used by ksoftirqd. The >> whole thing can't be stopped in any way (not even with 'kill >> -SIGKILL'). I've also tried to replace the exit() call with >> a kill(getpid(), SIGKILL) but also with no luck. Attaching >> with gdb fails as well (hangs indefinitely). Looks like a >> real zombie: dead and very active at the same time:-( > > A shot in the dark: is the application using robust mutexes? That's the > first thing that comes to mind. Robust mutexes require the kernel, when > destroying a thread, to walk a userspace linked-list data structure.
Another shot in the dark: Did the C runtime library (glibc or local equivalent) change? If so, was it compiled so as to use the exit_group(2) syscall in the exit(3) function?
According to various Linux kernel docs, since the introduction of NPTL, exit(2) only terminates the calling thread, leaving all other threads in the "process" active. To terminate /all/ threads at once, use exit_group(2). Since glibc v2.3, the exit(3) call has invoked exit_group(2) instead of exit(2). Perhaps your newer version of the runtime library has reverted back to calling exit(2).
-- Lew Pitcher "In Skills, We Trust" PGP public key available upon request
--------------EF34A4545F925DCBFC0D2A27 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!goblin3!goblin1!goblin.stu.neva.ru!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Jorgen Grahn Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: 13 Dec 2016 06:49:29 GMT Message-ID: References: X-Trace: individual.net hBdkTHyZYiVcf2QxKiD3nQa+URd/ewASnuk/3t1AFbtBE8PgTv Cancel-Lock: sha1:ruNeKTwRwcqIawafw7si28Ha7T8= User-Agent: slrn/pre1.0.0-18 (Linux) Xref: panix comp.unix.programmer:236706
On Tue, 2016-12-13, Lew Pitcher wrote: > On Monday December 12 2016 18:35, in > comp.unix.programmer, "william-at-wilbur.25thandClement.com" > wrote: > >> Jens Thoms Toerring wrote: >>> Hi, >>> >>> I've to deal with a multi-threaded program that has, as >>> one of its threads a "watchdog thread" that, when it doesn't >>> notice some variable getting set within a certain time, >>> is supposed to stop the whole program (at any cost, no >>> worries about data lost). It does attempt to shut down the >>> program by calling exit(). >> >>> This did work with a 3.4 Linux kernel. But after switching >>> to a 4.4 kernel it suddenly doesn't work reliably anymore. >>> If it fails one thread seems to run amok, using about 50% >>> of the CPU time, the other 50% being used by ksoftirqd. The >>> whole thing can't be stopped in any way (not even with 'kill >>> -SIGKILL'). I've also tried to replace the exit() call with >>> a kill(getpid(), SIGKILL) but also with no luck. Attaching >>> with gdb fails as well (hangs indefinitely). Looks like a >>> real zombie: dead and very active at the same time:-( >> >> A shot in the dark: is the application using robust mutexes? That's the >> first thing that comes to mind. Robust mutexes require the kernel, when >> destroying a thread, to walk a userspace linked-list data structure. > > Another shot in the dark: > Did the C runtime library (glibc or local equivalent) change? If so, was it > compiled so as to use the exit_group(2) syscall in the exit(3) function? > > According to various Linux kernel docs, since the introduction of NPTL, > exit(2) only terminates the calling thread, leaving all other threads in > the "process" active. To terminate /all/ threads at once, use exit_group(2). > Since glibc v2.3, the exit(3) call has invoked exit_group(2) instead of > exit(2).
This also seems to be documented in _exit(2). (Note the underscore.)
> Perhaps your newer version of the runtime library has reverted back > to calling exit(2).
Also, perhaps Jens' team has broken exit() while porting. Since it's embedded I suppose they (or a third party) provide the OS. From your description, this seems easy to get wrong.
/Jorgen
-- // Jorgen Grahn \X/ snipabacken.se> O o .
--------------EF34A4545F925DCBFC0D2A27 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline
_______________________________________________ Learn mailing list Learn-at-nylxs.com http://lists.mrbrklyn.com/mailman/listinfo/learn
--------------EF34A4545F925DCBFC0D2A27--
|
|