MESSAGE
DATE | 2017-01-20 |
FROM | ruben safir
|
SUBJECT | Subject: [Learn] Fwd: Re: threads and exit() woes
|
From learn-bounces-at-nylxs.com Fri Jan 20 15:03:12 2017 Return-Path: X-Original-To: archive-at-mrbrklyn.com Delivered-To: archive-at-mrbrklyn.com Received: from www.mrbrklyn.com (www.mrbrklyn.com [96.57.23.82]) by mrbrklyn.com (Postfix) with ESMTP id EA96B161312; Fri, 20 Jan 2017 15:03:11 -0500 (EST) X-Original-To: learn-at-nylxs.com Delivered-To: learn-at-nylxs.com Received: from [10.0.0.62] (flatbush.mrbrklyn.com [10.0.0.62]) by mrbrklyn.com (Postfix) with ESMTP id 45C08160E77 for ; Fri, 20 Jan 2017 15:03:08 -0500 (EST) References: <877f73ke3n.fsf-at-doppelsaurus.mobileactivedefense.com> <20161213233015.ccfd8a0248833f37069ca9c6-at-speakeasy.net> <87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com> <686dnZXvcrwrAsXFnZ2dnUU7-X3NnZ2d-at-posted.internetamerica> To: "learn-at-nylxs.com" From: ruben safir X-Forwarded-Message-Id: <877f73ke3n.fsf-at-doppelsaurus.mobileactivedefense.com> <20161213233015.ccfd8a0248833f37069ca9c6-at-speakeasy.net> <87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com> <686dnZXvcrwrAsXFnZ2dnUU7-X3NnZ2d-at-posted.internetamerica> Message-ID: <058c1891-9b5e-70cf-11cb-bca30e16e32b-at-mrbrklyn.com> Date: Fri, 20 Jan 2017 15:03:08 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/mixed; boundary="------------03A2DB7651F09BFE8C70F091" Subject: [Learn] Fwd: Re: threads and exit() woes X-BeenThere: learn-at-nylxs.com X-Mailman-Version: 2.1.17 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: learn-bounces-at-nylxs.com Sender: "Learn"
This is a multi-part message in MIME format. --------------03A2DB7651F09BFE8C70F091 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit
ditto
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!goblin1!goblin.stu.neva.ru!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Jorgen Grahn Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: 13 Dec 2016 06:44:58 GMT Message-ID: References: X-Trace: individual.net fr/oyQQ/5RQ5DNqAWUHJhgDKJeY8gXh+VnMMso8PNOLWd+vzbp Cancel-Lock: sha1:+FwHL5SqESz0ScoXHd828SQxUq4= User-Agent: slrn/pre1.0.0-18 (Linux) Xref: panix comp.unix.programmer:236705
On Mon, 2016-12-12, Jens Thoms Toerring wrote: > Hi, > > I've to deal with a multi-threaded program that has, as > one of its threads a "watchdog thread" that, when it doesn't > notice some variable getting set within a certain time, > is supposed to stop the whole program (at any cost, no > worries about data lost). It does attempt to shut down the > program by calling exit(). Now, all the references I have > consulted (TLPI, APUE 3rd ed. etc.) all claim that when one > of the threads calls exit() the program will be ended. A > look at SUSv4 just mentions in addition that the end of > the program might be delayed if there are outstanding > asynchronuous I/O operations that can't be cancelled > (nothing I guess I'm having). > > This did work with a 3.4 Linux kernel. But after switching > to a 4.4 kernel it suddenly doesn't work reliably anymore. > If it fails one thread seems to run amok, using about 50% > of the CPU time, the other 50% being used by ksoftirqd. The > whole thing can't be stopped in any way (not even with 'kill > -SIGKILL'). I've also tried to replace the exit() call with > a kill(getpid(), SIGKILL) but also with no luck. Attaching > with gdb fails as well (hangs indefinitely). Looks like a > real zombie: dead and very active at the same time:-( > > Does that ring a bell with anyone of you? One of the threads > is rather likely to do a lot of epoll() calls. > > Please keep in mind that I can't simply change the whole > architecture - this is an embedded system already out in > the field, and my role in this is to get a new kernel ver- > sion to work, not upset a more or less working application > (unless I can come up with very convincing arguments;-)
Apart from what the others wrote:
- Can you use strace or pstack or something to find out what that remaining thread is doing? Even looking in /proc can be useful.
- Keep in mind that exit() does things before exiting, e.g. run exit handlers.
Also shots in the dark ...
/Jorgen
-- // Jorgen Grahn \X/ snipabacken.se> O o .
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!goblin1!goblin.stu.neva.ru!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Rainer Weikusat Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: Tue, 13 Dec 2016 16:06:52 +0000 Message-ID: <877f73ke3n.fsf-at-doppelsaurus.mobileactivedefense.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: individual.net mkuINJo5GrcQrgzSegXW2QZrlNY62iGJ9KJV8gx6VDJXA5MKY= Cancel-Lock: sha1:n4T5frj4oLH11ajhLfBYX0G3538= sha1:NKqF1FZDtckQfWzDBuu6k4+q/I8= User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux) Xref: panix comp.unix.programmer:236707
jt-at-toerring.de (Jens Thoms Toerring) writes:
[terminate program via exit run by watchdog thread]
> This did work with a 3.4 Linux kernel. But after switching > to a 4.4 kernel it suddenly doesn't work reliably anymore. > If it fails one thread seems to run amok, using about 50% > of the CPU time, the other 50% being used by ksoftirqd. The > whole thing can't be stopped in any way (not even with 'kill > -SIGKILL').
This suggests that the thread is in a D state (uninterruptible sleep) which persists for some reason. Trying to determine what it's doing in the kernel (eg, strace, /proc//wchan) might be useful.
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!goblin1!goblin3!goblin.stu.neva.ru!news.mb-net.net!open-news-network.org!.POSTED!not-for-mail From: Marcel Mueller Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: Tue, 13 Dec 2016 18:34:23 +0100 Organization: MB-NET.NET for Open-News-Network e.V. Message-ID: References: NNTP-Posting-Host: aftr-95-222-29-234.unity-media.net Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Trace: gwaiyur.mb-net.net 1481650463 32695 95.222.29.234 (13 Dec 2016 17:34:23 GMT) X-Complaints-To: abuse-at-open-news-network.org NNTP-Posting-Date: Tue, 13 Dec 2016 17:34:23 +0000 (UTC) User-Agent: Mozilla/5.0 (OS/2; Warp 4.5; rv:24.0) Gecko/20100101 Thunderbird/24.8.1 In-Reply-To: Xref: panix comp.unix.programmer:236708
On 13.12.16 00.03, Jens Thoms Toerring wrote: > This did work with a 3.4 Linux kernel. But after switching > to a 4.4 kernel it suddenly doesn't work reliably anymore. > If it fails one thread seems to run amok, using about 50% > of the CPU time, the other 50% being used by ksoftirqd. The > whole thing can't be stopped in any way (not even with 'kill > -SIGKILL'). I've also tried to replace the exit() call with > a kill(getpid(), SIGKILL) but also with no luck. Attaching > with gdb fails as well (hangs indefinitely). Looks like a > real zombie: dead and very active at the same time:-(
Probably an exit handler does unexpected things. This could be part of the C runtime as well as part of a used library or even your code.
Maybe shutting down your program this way runs into badly tested code paths with some race conditions.
Try abort() which does not invoke that much exit handlers.
> Does that ring a bell with anyone of you? One of the threads > is rather likely to do a lot of epoll() calls.
Definitely I/O. It should check for the exit condition before invoking another I/O. The Linux kernel behaves quite bad when killing processes with outstanding I/O. Request like that are simply ignored.
Marcel
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!168.235.88.217.MISMATCH!feeder.erje.net!2.us.feeder.erje.net!news.glorb.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!post01.iad.highwinds-media.com!fx36.iad.POSTED!not-for-mail X-Newsreader: xrn 9.03-beta-14-64bit Sender: scott-at-dragon.sl.home (Scott Lurndal) From: scott-at-slp53.sl.home (Scott Lurndal) Reply-To: slp53-at-pacbell.net Subject: Re: threads and exit() woes Newsgroups: comp.unix.programmer References: Message-ID: X-Complaints-To: abuse-at-usenetserver.com NNTP-Posting-Date: Tue, 13 Dec 2016 18:13:40 UTC Organization: UsenetServer - www.usenetserver.com Date: Tue, 13 Dec 2016 18:13:40 GMT X-Received-Bytes: 2015 X-Received-Body-CRC: 2604350729 Xref: panix comp.unix.programmer:236709
Marcel Mueller writes: >On 13.12.16 00.03, Jens Thoms Toerring wrote: >> This did work with a 3.4 Linux kernel. But after switching >> to a 4.4 kernel it suddenly doesn't work reliably anymore. >> If it fails one thread seems to run amok, using about 50% >> of the CPU time, the other 50% being used by ksoftirqd. The >> whole thing can't be stopped in any way (not even with 'kill >> -SIGKILL'). I've also tried to replace the exit() call with >> a kill(getpid(), SIGKILL) but also with no luck. Attaching >> with gdb fails as well (hangs indefinitely). Looks like a >> real zombie: dead and very active at the same time:-( > >Probably an exit handler does unexpected things. This could be part of >the C runtime as well as part of a used library or even your code. > >Maybe shutting down your program this way runs into badly tested code >paths with some race conditions. > >Try abort() which does not invoke that much exit handlers. > >> Does that ring a bell with anyone of you? One of the threads >> is rather likely to do a lot of epoll() calls. > >Definitely I/O. It should check for the exit condition before invoking >another I/O. The Linux kernel behaves quite bad when killing processes >with outstanding I/O. Request like that are simply ignored. >
If SIGKILL doesn't kill the process, you've a kernel bug.
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!168.235.88.217.MISMATCH!2.us.feeder.erje.net!feeder.erje.net!2.eu.feeder.erje.net!news.swapon.de!eternal-september.org!feeder.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Lew Pitcher Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: Tue, 13 Dec 2016 13:27:48 -0500 Organization: The Pitcher Digital Freehold Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7Bit Injection-Info: mx02.eternal-september.org; posting-host="3010cfc25bc10d40bae4e65aed6697c7"; logging-data="31197"; mail-complaints-to="abuse-at-eternal-september.org"; posting-account="U2FsdGVkX18096ZIB0NkFx0gkx7L0aWo+EKfDPDtt4E=" Cancel-Lock: sha1:AlAmiMqQBBT0sbC81Q36rXJOV9Q= Xref: panix comp.unix.programmer:236711
On Tuesday December 13 2016 13:13, in comp.unix.programmer, "Scott Lurndal" wrote:
> Marcel Mueller writes: >>On 13.12.16 00.03, Jens Thoms Toerring wrote: >>> This did work with a 3.4 Linux kernel. But after switching >>> to a 4.4 kernel it suddenly doesn't work reliably anymore. >>> If it fails one thread seems to run amok, using about 50% >>> of the CPU time, the other 50% being used by ksoftirqd. The >>> whole thing can't be stopped in any way (not even with 'kill >>> -SIGKILL'). I've also tried to replace the exit() call with >>> a kill(getpid(), SIGKILL) but also with no luck. Attaching >>> with gdb fails as well (hangs indefinitely). Looks like a >>> real zombie: dead and very active at the same time:-( >> >>Probably an exit handler does unexpected things. This could be part of >>the C runtime as well as part of a used library or even your code. >> >>Maybe shutting down your program this way runs into badly tested code >>paths with some race conditions. >> >>Try abort() which does not invoke that much exit handlers. >> >>> Does that ring a bell with anyone of you? One of the threads >>> is rather likely to do a lot of epoll() calls. >> >>Definitely I/O. It should check for the exit condition before invoking >>another I/O. The Linux kernel behaves quite bad when killing processes >>with outstanding I/O. Request like that are simply ignored. >> > > If SIGKILL doesn't kill the process, you've a kernel bug.
Even with a non-buggy kernel, SIGKILL won't terminate a zombie process, nor a process stuck in "uninterruptable sleep" state.
It would be helpfull to see the state of the hung thread, as reported by ps or some other tool.
-- Lew Pitcher "In Skills, We Trust" PGP public key available upon request
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!usenet.stanford.edu!news.glorb.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!post01.iad.highwinds-media.com!fx16.iad.POSTED!not-for-mail X-Newsreader: xrn 9.03-beta-14-64bit Sender: scott-at-dragon.sl.home (Scott Lurndal) From: scott-at-slp53.sl.home (Scott Lurndal) Reply-To: slp53-at-pacbell.net Subject: Re: threads and exit() woes Newsgroups: comp.unix.programmer References: Message-ID: X-Complaints-To: abuse-at-usenetserver.com NNTP-Posting-Date: Tue, 13 Dec 2016 18:44:59 UTC Organization: UsenetServer - www.usenetserver.com Date: Tue, 13 Dec 2016 18:44:59 GMT X-Received-Bytes: 2983 X-Received-Body-CRC: 2032316764 Xref: panix comp.unix.programmer:236712
Lew Pitcher writes: >On Tuesday December 13 2016 13:13, in comp.unix.programmer, "Scott Lurndal" > wrote: > >> Marcel Mueller writes: >>>On 13.12.16 00.03, Jens Thoms Toerring wrote: >>>> This did work with a 3.4 Linux kernel. But after switching >>>> to a 4.4 kernel it suddenly doesn't work reliably anymore. >>>> If it fails one thread seems to run amok, using about 50% >>>> of the CPU time, the other 50% being used by ksoftirqd. The >>>> whole thing can't be stopped in any way (not even with 'kill >>>> -SIGKILL'). I've also tried to replace the exit() call with >>>> a kill(getpid(), SIGKILL) but also with no luck. Attaching >>>> with gdb fails as well (hangs indefinitely). Looks like a >>>> real zombie: dead and very active at the same time:-( >>> >>>Probably an exit handler does unexpected things. This could be part of >>>the C runtime as well as part of a used library or even your code. >>> >>>Maybe shutting down your program this way runs into badly tested code >>>paths with some race conditions. >>> >>>Try abort() which does not invoke that much exit handlers. >>> >>>> Does that ring a bell with anyone of you? One of the threads >>>> is rather likely to do a lot of epoll() calls. >>> >>>Definitely I/O. It should check for the exit condition before invoking >>>another I/O. The Linux kernel behaves quite bad when killing processes >>>with outstanding I/O. Request like that are simply ignored. >>> >> >> If SIGKILL doesn't kill the process, you've a kernel bug. > >Even with a non-buggy kernel, SIGKILL won't terminate a zombie process, nor a >process stuck in "uninterruptable sleep" state.
A zombie no longer holds resources, with the exception of the exit status (say 32-bits) and the pid.
It's the parent responsibility to reap the status.
An operating system that allows an application to enter an "uninterruptable sleep" state is broken.
It used to be in SVR3, that one could end up in an uninterruptable sleep state during close(2) when the file descriptor referenced a character special device for a parallel port (e.g. printer) and the printer was off-line. Bugs like that were mainly fixed a quarter century ago.
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!goblin2!goblin1!goblin.stu.neva.ru!eternal-september.org!feeder.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Lew Pitcher Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: Tue, 13 Dec 2016 14:18:22 -0500 Organization: The Pitcher Digital Freehold Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7Bit Injection-Info: mx02.eternal-september.org; posting-host="3010cfc25bc10d40bae4e65aed6697c7"; logging-data="11428"; mail-complaints-to="abuse-at-eternal-september.org"; posting-account="U2FsdGVkX1+hbyxoCh68E1Z6eEdVlVzkJFeeQOiUxCc=" Cancel-Lock: sha1:Gog6ChEHhvzP8J7O1khU2ydBnRA= Xref: panix comp.unix.programmer:236713
On Tuesday December 13 2016 13:44, in comp.unix.programmer, "Scott Lurndal" wrote:
> Lew Pitcher writes: >>On Tuesday December 13 2016 13:13, in comp.unix.programmer, "Scott Lurndal" >> wrote: >> >>> Marcel Mueller writes: >>>>On 13.12.16 00.03, Jens Thoms Toerring wrote: >>>>> This did work with a 3.4 Linux kernel. But after switching >>>>> to a 4.4 kernel it suddenly doesn't work reliably anymore. >>>>> If it fails one thread seems to run amok, using about 50% >>>>> of the CPU time, the other 50% being used by ksoftirqd. The >>>>> whole thing can't be stopped in any way (not even with 'kill >>>>> -SIGKILL'). I've also tried to replace the exit() call with >>>>> a kill(getpid(), SIGKILL) but also with no luck. Attaching >>>>> with gdb fails as well (hangs indefinitely). Looks like a >>>>> real zombie: dead and very active at the same time:-( >>>> >>>>Probably an exit handler does unexpected things. This could be part of >>>>the C runtime as well as part of a used library or even your code. >>>> >>>>Maybe shutting down your program this way runs into badly tested code >>>>paths with some race conditions. >>>> >>>>Try abort() which does not invoke that much exit handlers. >>>> >>>>> Does that ring a bell with anyone of you? One of the threads >>>>> is rather likely to do a lot of epoll() calls. >>>> >>>>Definitely I/O. It should check for the exit condition before invoking >>>>another I/O. The Linux kernel behaves quite bad when killing processes >>>>with outstanding I/O. Request like that are simply ignored. >>>> >>> >>> If SIGKILL doesn't kill the process, you've a kernel bug. >> >>Even with a non-buggy kernel, SIGKILL won't terminate a zombie process, nor >>a process stuck in "uninterruptable sleep" state. > > A zombie no longer holds resources, with the exception of the exit status > (say 32-bits) and the pid. > > It's the parent responsibility to reap the status.
True. It remains in the process table (and visible through ps(1)) until the parent reaps the status, or permits init(8) to reap the status. Since the process is already dead, it CANNOT be "killed" (terminated and removed from the process table) by SIGKILL.
> An operating system that allows an application to enter an > "uninterruptable sleep" state is broken.
OK. Thanks for the opinion.
Howver, whether or not the OS is, in your opinion, "broken", "uninterruptable sleep" is still a permitted state. And, because the process cannot be scheduled, it cannot receive /any/ signal, let alone SIGKILL.
> It used to be in SVR3, that one could end up in an uninterruptable > sleep state during close(2) when the file descriptor referenced a > character special device for a parallel port (e.g. printer) and the > printer was off-line. Bugs like that were mainly fixed a quarter > century ago.
-- Lew Pitcher "In Skills, We Trust" PGP public key available upon request
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!goblin3!goblin.stu.neva.ru!news.mb-net.net!open-news-network.org!.POSTED!not-for-mail From: Marcel Mueller Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: Tue, 13 Dec 2016 21:01:54 +0100 Organization: MB-NET.NET for Open-News-Network e.V. Message-ID: References: NNTP-Posting-Host: aftr-95-222-29-234.unity-media.net Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Trace: gwaiyur.mb-net.net 1481659314 22785 95.222.29.234 (13 Dec 2016 20:01:54 GMT) X-Complaints-To: abuse-at-open-news-network.org NNTP-Posting-Date: Tue, 13 Dec 2016 20:01:54 +0000 (UTC) User-Agent: Mozilla/5.0 (OS/2; Warp 4.5; rv:24.0) Gecko/20100101 Thunderbird/24.8.1 In-Reply-To: Xref: panix comp.unix.programmer:236714
On 13.12.16 19.13, Scott Lurndal wrote: >> Definitely I/O. It should check for the exit condition before invoking >> another I/O. The Linux kernel behaves quite bad when killing processes >> with outstanding I/O. Request like that are simply ignored. > > If SIGKILL doesn't kill the process, you've a kernel bug.
Well, welcome to real word. A process hanging in state D is one of the most often causes of system reboots. This did not change significantly over the last 15 years from Debian Woody to recent Raspbian with kernel 4.4. Of course, it is not that often that I have serious trouble. Once or twice per year or something like that. AFAIK there is absolutely no recovery from a process blocked in state D. This seems to be a Linux specific "feature".
Marcel
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!168.235.88.217.MISMATCH!feeder.erje.net!2.us.feeder.erje.net!news.glorb.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!buffer1.nntp.dca1.giganews.com!news.giganews.com.POSTED!not-for-mail NNTP-Posting-Date: Tue, 13 Dec 2016 18:30:02 -0600 Message-ID: From: Subject: Re: threads and exit() woes Newsgroups: comp.unix.programmer References: User-Agent: tin/2.2.1-20140504 ("Tober an Righ") (UNIX) (OpenBSD/5.9 (amd64)) Date: Tue, 13 Dec 2016 16:23:08 -0800 X-Usenet-Provider: http://www.giganews.com X-Trace: sv3-WeYPoU4YB163az2eTaLRJKMASQKmTMcWZWQgiXior0JywE5Za6CK4GPE7Q2Nxso/BhRjun0G0uciwXE!6KNyh69LTZ6OeupiBMZjusXwl2dkyf8Y+FFVl2HW8o1wgGh8z0MFpmj6xI8ld8QYMQ== X-Complaints-To: abuse-at-giganews.com X-DMCA-Notifications: http://www.giganews.com/info/dmca.html X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 X-Original-Bytes: 4490 Xref: panix comp.unix.programmer:236717
Marcel Mueller wrote: > On 13.12.16 19.13, Scott Lurndal wrote: >>> Definitely I/O. It should check for the exit condition before invoking >>> another I/O. The Linux kernel behaves quite bad when killing processes >>> with outstanding I/O. Request like that are simply ignored. >> >> If SIGKILL doesn't kill the process, you've a kernel bug. > > Well, welcome to real word. > A process hanging in state D is one of the most often causes of system > reboots. This did not change significantly over the last 15 years from > Debian Woody to recent Raspbian with kernel 4.4. Of course, it is not > that often that I have serious trouble. Once or twice per year or > something like that. > AFAIK there is absolutely no recovery from a process blocked in state D. > This seems to be a Linux specific "feature".
The classic stumbling block is that the block device subsystems in Linux as well the *BSDs are fundamentally synchronous. This is related historically to why polling I/O on regular (block device) files is defined by POSIX to alway immediately return ready. Given the expectations engendered by the history, it was apparently too convenient for implementations to bake synchronous interfaces into their block device and driver models.
NFS implementations on Linux (and I assume other Unix systems) were especially notorious in this regard, because the kernel implementations adopted the same synchronous interface model, but for obvious reasons were much more prone to putting processes into a prolonged, uninterruptible state.
AFAIU, making block device I/O asynchronous (and thus interruptible) requires extensive refactoring of the driver model as well as the individual drivers for those operating systems.
POSIX AIO on those systems simply use kernel threads to do the synchronous calls, which only hides the issue. The kernel thread could still block, consuming system resources indefinitely even after the requesting process has long exited. You get a slightly cleaner user process tree, yes, but requests still linger behind the scenes, and resource accounting can no longer be kept deterministic without some ugly compromises.
Given the pedigree of Solaris, AIX, and HP-UX, I'm curious what those systems did. Did they refactor their driver model? Officially commit to the kernel thread hack? Or find some sort of compromise, e.g. a quasi-synchronous interface where updated drivers could bubble up through the call stack an interrupt or timeout?
There have been several attempts over the years to systematize the kernel thread hack in Linux. See, e.g., these 2007 articles
"Fibrils and asynchronous system calls", https://lwn.net/Articles/219954/ "LCA: A new approach to asynchronous I/O" https://lwn.net/Articles/316806/
and most recently from 2016
"Fixing asynchronous I/O, again" https://lwn.net/Articles/671649/
I like to think they always fail because at the end of the day using slave threads can be easily done in userspace. And interfaces like splice(2), sendfile(2), eventfd(2), etc that can allow the userspace solution to match or even exceed the kernel-space solution are useful in their own right. That reality makes it difficult to accept the maintenance burden of an in-kernel overlay solution that doesn't address the underlying issues. But maybe that's just wishful thinking.
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!news.linkpendium.com!news.linkpendium.com!news.glorb.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!post02.iad.highwinds-media.com!fx23.iad.POSTED!not-for-mail From: "James K. Lowden" Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Message-ID: <20161213233015.ccfd8a0248833f37069ca9c6-at-speakeasy.net> References: X-Newsreader: Sylpheed 3.4.3 (GTK+ 2.24.28; x86_64--netbsd) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Complaints-To: abuse-at-newsdemon.com NNTP-Posting-Date: Wed, 14 Dec 2016 04:30:15 UTC Organization: http://www.NewsDemon.com Date: Tue, 13 Dec 2016 23:30:15 -0500 X-Received-Bytes: 1776 X-Received-Body-CRC: 134682496 Xref: panix comp.unix.programmer:236719
On Tue, 13 Dec 2016 16:23:08 -0800 wrote:
> The classic stumbling block is that the block device subsystems in > Linux as well the *BSDs are fundamentally synchronous.
It's not clear to me why they should be anything other than synchronous. The devices themselves might in some cases support a queued command interface (e.g. SCSI) but that view of the device is very different from a linear-store-of-bytes abstraction.
The kernel provides applications with a perfectly good asynchronous interface: the timeslice. if the application has something better to do while it's blocked against I/O, it can put that processing on another pid. In the typical case, the application blocks against needed input, and the kernel can schedule CPU time for something else.
> I like to think they always fail because at the end of the day using > slave threads can be easily done in userspace.
Exactly.
--jkl
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!news.linkpendium.com!news.linkpendium.com!news.glorb.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!post01.iad.highwinds-media.com!fx33.iad.POSTED!not-for-mail X-Newsreader: xrn 9.03-beta-14-64bit Sender: scott-at-dragon.sl.home (Scott Lurndal) From: scott-at-slp53.sl.home (Scott Lurndal) Reply-To: slp53-at-pacbell.net Subject: Re: threads and exit() woes Newsgroups: comp.unix.programmer References: Message-ID: X-Complaints-To: abuse-at-usenetserver.com NNTP-Posting-Date: Wed, 14 Dec 2016 13:39:18 UTC Organization: UsenetServer - www.usenetserver.com Date: Wed, 14 Dec 2016 13:39:18 GMT X-Received-Bytes: 1575 X-Received-Body-CRC: 3603917634 Xref: panix comp.unix.programmer:236728
writes:
>Given the pedigree of Solaris, AIX, and HP-UX, I'm curious what those >systems did. Did they refactor their driver model? Officially commit to the >kernel thread hack? Or find some sort of compromise, e.g. a >quasi-synchronous interface where updated drivers could bubble up through >the call stack an interrupt or timeout?
SVR4.2 ES/MP completely redesigned the I/O system to handle asynchronicity natively (along with eliminating the BFKL[*]). The POSIX asynchronous I/O apis were implemented naturally throughout the I/O stack.
Our Chorus microkernel-based port of SVR4.2 ES/MP (called SVR4/MK, or project Amadeus in Europe) also supported the asynchronous interfaces internally, and they were heavily used by Oracle for performance.
[*] Big F'ing Kernel Lock
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!newsswitch.lcs.mit.edu!ottix-news.ottix.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!buffer1.nntp.dca1.giganews.com!nntp.posted.internetamerica!news.posted.internetamerica.POSTED!not-for-mail NNTP-Posting-Date: Thu, 15 Dec 2016 00:35:59 -0600 Sender: Gordon Burditt From: gordonb.ci1jn-at-burditt.org (Gordon Burditt) Subject: Re: threads and exit() woes Newsgroups: comp.unix.programmer References: User-Agent: tin/2.3.4-20160628 ("Newton") (UNIX) (FreeBSD/10.0-RELEASE (i386)) Message-ID: Date: Thu, 15 Dec 2016 00:35:59 -0600 X-Usenet-Provider: http://www.giganews.com NNTP-Posting-Host: 108.65.82.77 X-Trace: sv3-C8SOH6kikRcZtGxbKrqgv4cxAAQ7mKQRg4eGRVI3YZ6S4TXi15VhPpDjsLxw9gnv2dHwlEqn1nbdTr7!BIvEhENUIXve4TJFe07KOvlfMnc42HSTT4wvb2wFCQZpdsICwbvg8zNgbQuYrTkn61U2PimtyPrP!sKX+qbfb3RWdeOHA/exDym6ZSaHx X-Complaints-To: abuse-at-airmail.net X-DMCA-Complaints-To: abuse-at-airmail.net X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 X-Original-Bytes: 2107 Xref: panix comp.unix.programmer:236736
> Well, welcome to real word. > A process hanging in state D is one of the most often causes of system > reboots. This did not change significantly over the last 15 years from > Debian Woody to recent Raspbian with kernel 4.4. Of course, it is not > that often that I have serious trouble. Once or twice per year or > something like that. > AFAIK there is absolutely no recovery from a process blocked in state D. > This seems to be a Linux specific "feature".
I'm not sure I agree with that. Hanging device drivers (in state "D"), specifically due to USB devices being disconnected at inconvenient times, seems to be a bigger problem than just Linux. I've observed it occasionally on the *BSDs. Usually it's quite obvious that the device shouldn't have been intentionally disconnected, but that the cable/connector was a little loose and someone wiggled it.
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!168.235.88.217.MISMATCH!2.us.feeder.erje.net!feeder.erje.net!1.eu.feeder.erje.net!weretis.net!feeder4.news.weretis.net!news.mb-net.net!open-news-network.org!.POSTED!not-for-mail From: Marcel Mueller Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: Thu, 15 Dec 2016 19:32:10 +0100 Organization: MB-NET.NET for Open-News-Network e.V. Message-ID: References: NNTP-Posting-Host: aftr-95-222-29-234.unity-media.net Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Trace: gwaiyur.mb-net.net 1481826730 25536 95.222.29.234 (15 Dec 2016 18:32:10 GMT) X-Complaints-To: abuse-at-open-news-network.org NNTP-Posting-Date: Thu, 15 Dec 2016 18:32:10 +0000 (UTC) User-Agent: Mozilla/5.0 (OS/2; Warp 4.5; rv:24.0) Gecko/20100101 Thunderbird/24.8.1 In-Reply-To: Xref: panix comp.unix.programmer:236738
On 15.12.16 07.35, Gordon Burditt wrote: >> AFAIK there is absolutely no recovery from a process blocked in state D. >> This seems to be a Linux specific "feature". > > I'm not sure I agree with that. Hanging device drivers (in state > "D"), specifically due to USB devices being disconnected at > inconvenient times, seems to be a bigger problem than just Linux. > I've observed it occasionally on the *BSDs. Usually it's quite > obvious that the device shouldn't have been intentionally disconnected, > but that the cable/connector was a little loose and someone wiggled > it.
Bugs and I/O errors can occur everywhere. Not that nice, but that's life. The only problem is the kernel is unable to recover from this errors without reboot. This is not contemporary.
Marcel
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!goblin2!goblin1!goblin.stu.neva.ru!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Rainer Weikusat Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: Fri, 16 Dec 2016 17:38:09 +0000 Message-ID: <87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: individual.net Qx6DB7EzpoHiBHQL+DxPrgcRb3601ITjULtqHZJ1048dcFktA= Cancel-Lock: sha1:FdcxYpsgHq4gzzOxnPfR3zvQv1A= sha1:jOKKe3MrmzUAcpZMMDLSjPwLbPo= User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux) Xref: panix comp.unix.programmer:236742
Marcel Mueller writes: > On 15.12.16 07.35, Gordon Burditt wrote: >>> AFAIK there is absolutely no recovery from a process blocked in state D. >>> This seems to be a Linux specific "feature". >> >> I'm not sure I agree with that. Hanging device drivers (in state >> "D"), specifically due to USB devices being disconnected at >> inconvenient times, seems to be a bigger problem than just Linux. >> I've observed it occasionally on the *BSDs. Usually it's quite >> obvious that the device shouldn't have been intentionally disconnected, >> but that the cable/connector was a little loose and someone wiggled >> it. > > Bugs and I/O errors can occur everywhere. Not that nice, but that's life. > The only problem is the kernel is unable to recover from this errors > without reboot. This is not contemporary.
It is contemporary because it's happening now.
'Uninterruptible sleep' state usually means 'the operation being waited for is always expected to complete' as it's entirely within the domain of the local system. Insofar the state persists when talking to a device, that's usually a hardware failure. Another possible cause would be a kernel mutex deadlock.
Interruptible sleeping needs correct support code for every instance of a sleep. That's a whole load of opportunities for additional bugs as this will usually need 'resource allocation unwinding' back up the complete callstack. It also needs to be handled correctly in all applications. IMHO, is very questionable if this is really a good idea "just in case there's a kernel bug".
It's entirely unclear how "recovery in case of hardware errors" should look like. If a mass storage device fails, the result is going to be "unpleasant" regardless of requiring a reboot to paper over the issue for some time.
The idea to use 'D' state for network filesystems is obviously moronic and there should be some kind of 'emergency abort' for removable storage devices, too.
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!goblin2!goblin.stu.neva.ru!aioe.org!.POSTED!not-for-mail From: spud-at-potato.field Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: Mon, 19 Dec 2016 09:38:04 +0000 (UTC) Organization: Aioe.org NNTP Server Message-ID: References: <87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com> NNTP-Posting-Host: IkJkuvU+mf0C8Ve1AyJG/g.user.gioia.aioe.org X-Complaints-To: abuse-at-aioe.org X-Newsreader: :redaersweN-X X-Notice: Filtered by postfilter v. 0.8.2 Xref: panix comp.unix.programmer:236744
On Fri, 16 Dec 2016 17:38:09 +0000 Rainer Weikusat wrote: >It's entirely unclear how "recovery in case of hardware errors" should >look like. If a mass storage device fails, the result is going to be >"unpleasant" regardless of requiring a reboot to paper over the issue >for some time.
Unless the device is the drive the OS system files are hosted on or some other critical main board component, then any hardware failure should be dealt with gracefully. Period. Hardware failures should be expected and the OS should help the admins diagnose the problem, not just give up and die.
>The idea to use 'D' state for network filesystems is obviously moronic >and there should be some kind of 'emergency abort' for removable storage >devices, too.
FreeBSD had a nice bug back in the day (maybe still does) whereby if you mounted a floppy disk as a filesystem then removed the disk the kernel would crash. Despite numerous people including myself pointing this out they still hadn't fixed it by 6.0, at which point I switched to linux for other reasons.
-- Spud
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!newsswitch.lcs.mit.edu!ottix-news.ottix.net!border1.nntp.dca1.giganews.com!nntp.giganews.com!buffer1.nntp.dca1.giganews.com!buffer2.nntp.dca1.giganews.com!nntp.posted.internetamerica!news.posted.internetamerica.POSTED!not-for-mail NNTP-Posting-Date: Mon, 19 Dec 2016 21:04:22 -0600 Sender: Gordon Burditt From: gordonb.b6e0s-at-burditt.org (Gordon Burditt) Subject: Re: threads and exit() woes Newsgroups: comp.unix.programmer References: <87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com> User-Agent: tin/2.3.4-20160628 ("Newton") (UNIX) (FreeBSD/10.0-RELEASE (i386)) Message-ID: <686dnZXvcrwrAsXFnZ2dnUU7-X3NnZ2d-at-posted.internetamerica> Date: Mon, 19 Dec 2016 21:04:22 -0600 X-Usenet-Provider: http://www.giganews.com NNTP-Posting-Host: 108.65.82.77 X-Trace: sv3-eRfoIVM4LYkHTQNUg5bmtY+uAQHhj/2tqYECM+4OZEBQF6etbDK0BOqfpb8vPx1BgASeh9cDSUnm7RW!/xFB4Z4qtAXvSEIDkOx7ueBJsd3MEttDvqZJXjO1k03olvCJ2EAZ1v06VdnLDMmb3EH6Fl+r1DRv!VIb/W0sVliknDlKE+H2Tzv3ZNvZN X-Complaints-To: abuse-at-airmail.net X-DMCA-Complaints-To: abuse-at-airmail.net X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 X-Original-Bytes: 3809 Xref: panix comp.unix.programmer:236746
>>The idea to use 'D' state for network filesystems is obviously moronic >>and there should be some kind of 'emergency abort' for removable storage >>devices, too. > > FreeBSD had a nice bug back in the day (maybe still does) whereby if you > mounted a floppy disk as a filesystem then removed the disk the kernel would > crash. Despite numerous people including myself pointing this out they still > hadn't fixed it by 6.0, at which point I switched to linux for other reasons.
I expect that you would have the same problem for *ANY* removable device with a UFS filesystem with soft updates enabled (on FreeBSD 10.1, and I think on 11.0). I've managed to trigger some kind of panic related to soft updates by accidental removal of a mounted filesystem (as in "accidentally yanked the cable out"). Floppies using a FAT-16 filesystem probably won't have this issue. Neither, it seems, will a UFS filesystems with soft updates turned off. The data is inconsistent, but the system doesn't panic. Sometimes, the panic was triggered after the program that wrote the data had already terminated (but not all data flushed to disk). Soft updates does seem to work well for actually non-removable drives. The problem of panics doesn't exist when non-removable drives are removed from the system by a power failure.
I'm not sure about journaling on UFS, but journaling is usually unsuitable for my application for removable media: large copy to the drive, followed by the data being read-only for a long time (maybe months), or else read a few times (usually by different systems) and then deleted. Journaling increases the number of writes (possibly wearing out flash drives earlier), and I don't really care about the integrity of the data *between* the time the copy starts and everything gets written. I do care about data integrity after it's unmounted and re-mounted.
No, this wasn't any essential filesystem like /, swap, /usr, or /var. Most of the time it was /mnt or /mnt2, filesystems used for data transfer or archive using USB memory sticks, or a USB hard drive. I suppose it would also happen with a USB or normal floppy drive. Nothing is permanently mounted on /mnt. In case of accidental disconnection, I'd expect the data in process of being transferred to be toast, and I really don't care much about that. I can't trust the copy anyway.
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!goblin3!goblin.stu.neva.ru!news.mb-net.net!open-news-network.org!.POSTED!not-for-mail From: Marcel Mueller Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: Mon, 19 Dec 2016 20:29:59 +0100 Organization: MB-NET.NET for Open-News-Network e.V. Message-ID: References: <87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com> NNTP-Posting-Host: aftr-95-222-29-121.unity-media.net Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Trace: gwaiyur.mb-net.net 1482175799 32441 95.222.29.121 (19 Dec 2016 19:29:59 GMT) X-Complaints-To: abuse-at-open-news-network.org NNTP-Posting-Date: Mon, 19 Dec 2016 19:29:59 +0000 (UTC) User-Agent: Mozilla/5.0 (OS/2; Warp 4.5; rv:24.0) Gecko/20100101 Thunderbird/24.8.1 In-Reply-To: <87eg176agu.fsf-at-doppelsaurus.mobileactivedefense.com> Xref: panix comp.unix.programmer:236745
On 16.12.16 18.38, Rainer Weikusat wrote: >> Bugs and I/O errors can occur everywhere. Not that nice, but that's life. >> The only problem is the kernel is unable to recover from this errors >> without reboot. This is not contemporary. > > It is contemporary because it's happening now. > > 'Uninterruptible sleep' state usually means 'the operation being waited > for is always expected to complete' as it's entirely within the domain > of the local system. Insofar the state persists when talking to a > device, that's usually a hardware failure. Another possible cause would > be a kernel mutex deadlock.
Even if DMA is involved it should be possible to cancel this operation. And well, if a hardware DMA does not complete within a few minutes it will likely never complete. So unloading the driver is just fine in 99,9% of the cases.
> Interruptible sleeping needs correct support code for every instance of > a sleep. That's a whole load of opportunities for additional bugs as > this will usually need 'resource allocation unwinding' back up the > complete callstack.
Agree. But I do not talk about graceful exit. Just cancel all related threads. Of course, this might leave the driver in an inconsistent state. Not too surprising since there is the bug. So the next action is to forcibly unload the driver. Since most drivers reset their device when loaded (again) it is likely that the hardware could recover from this error.
> It also needs to be handled correctly in all > applications. IMHO, is very questionable if this is really a good idea > "just in case there's a kernel bug".
I do not see any action other than "kill" that could be executed in this state. So I see no need for any action in userspace.
> It's entirely unclear how "recovery in case of hardware errors" should > look like. If a mass storage device fails, the result is going to be > "unpleasant" regardless of requiring a reboot to paper over the issue > for some time.
If it is the root filesystem or swap, yes. There is no reasonable recovery. But most of the time state D is not related to the system disk. More likely it is a WLAN device (amazingly unreliable this kind of hardware) or an USB stick or some other less important device.
> The idea to use 'D' state for network filesystems is obviously moronic > and there should be some kind of 'emergency abort' for removable storage > devices, too.
Indeed. NFS is really annoying if the network is not 100% solid.
Marcel
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!168.235.88.217.MISMATCH!2.us.feeder.erje.net!feeder.erje.net!1.eu.feeder.erje.net!newsreader4.netcologne.de!news.netcologne.de!fu-berlin.de!uni-berlin.de!not-for-mail From: jt-at-toerring.de (Jens Thoms Toerring) Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: 13 Dec 2016 22:32:32 GMT Organization: Freie Universitaet Berlin Message-ID: References: X-Trace: news.uni-berlin.de xD10nltki5lG8On22aX5TgJ6Vu+omUj+ScqRv68mkSTi2x X-Orig-Path: not-for-mail User-Agent: tin/2.1.1-20120623 ("Mulindry") (UNIX) (Linux/3.2.0-4-amd64 (x86_64)) Xref: panix comp.unix.programmer:236715
Hi,
thank you all - I'm quite overwhelmed by the number and quality of responses! So please don't be annoyed if I don't respond to each post in detail.
As usual I guess I've looked too much at "red herrings". It doesn't seem to have been something really related to threads. After a lot more of looking at the rather longish output of strace I started to notice a pattern, i.e. that one of the threads got interrupted in a call of close(). This often happend a long (relatively speaking) time be- fore the software watchdog tried to stop the program - and that thread never got re-scheduled.
So I switched my attention to the serial driver (that close() call was for a device file for one of the serial ports of the processor) and found a different version of it. And, lo and behold, with that updated driver I haven't seen any of that strange behaviour anymore for about 400 test runs. While that is, of course, no proof that everything is well, it at least encouraging;)
Unfortunately, the somewhat restricted tools I have at my disposal don't tell me too much what state a process is in. 'ps' is rather terse in what it tells you (no D/S/R etc., i. e. no STAT field at all) one is used from a PC. But the pro- cess/thread was definitely not sleeping nor a zombie - it was so active that it used up about 50% of the CPU time, and ob- viously somehow kept [ksoftirqd] busy as well;-)
So from what I can say at the moment it was a slightly buggy driver that, in what manner I can't tell yet, didn't close the device file as requested and thus kept the program from exiting. At least my believe in TLPI/APUE has been restored in that it most likely was a situation where an exit() would have killed all threads if not a buggy driver had intervened;-)
Thank you all and best regards, Jens -- \ Jens Thoms Toerring ___ jt-at-toerring.de \__________________________ http://toerring.de
--------------03A2DB7651F09BFE8C70F091 Content-Type: message/rfc822; name="Re: threads and exit() woes.eml" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="Re: threads and exit() woes.eml"
Path: reader1.panix.com!panix!bloom-beacon.mit.edu!bloom-beacon.mit.edu!168.235.88.217.MISMATCH!feeder.erje.net!2.us.feeder.erje.net!newsfeed.fsmpi.rwth-aachen.de!newsfeed.straub-nv.de!news-1.dfn.de!news.dfn.de!news.informatik.hu-berlin.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Jorgen Grahn Newsgroups: comp.unix.programmer Subject: Re: threads and exit() woes Date: 14 Dec 2016 00:10:17 GMT Message-ID: References: X-Trace: individual.net KXStxWAi7DhXSdgmn1cKVAhGNQTmm7nj6VeFQboa2GiXvI7hJP Cancel-Lock: sha1:9Zpq4aMyThIc9MN998m/x380B/w= User-Agent: slrn/pre1.0.0-18 (Linux) Xref: panix comp.unix.programmer:236716
On Tue, 2016-12-13, Jens Thoms Toerring wrote: > Hi, > > thank you all - I'm quite overwhelmed by the number and > quality of responses! So please don't be annoyed if I don't > respond to each post in detail. > > As usual I guess I've looked too much at "red herrings". > It doesn't seem to have been something really related to > threads. After a lot more of looking at the rather longish > output of strace I started to notice a pattern, i.e. that > one of the threads got interrupted in a call of close(). > This often happend a long (relatively speaking) time be- > fore the software watchdog tried to stop the program - and > that thread never got re-scheduled. > > So I switched my attention to the serial driver (that close() > call was for a device file for one of the serial ports of the > processor)
Seems that was the turning point. Nice!
> and found a different version of it. And, lo and > behold, with that updated driver I haven't seen any of that > strange behaviour anymore for about 400 test runs. While > that is, of course, no proof that everything is well, it at > least encouraging;) > > Unfortunately, the somewhat restricted tools I have at my > disposal don't tell me too much what state a process is in. > 'ps' is rather terse in what it tells you (no D/S/R etc., i. > e. no STAT field at all) one is used from a PC.
One useful trick is to look in the Linux /proc file system. I think that's where ps gets its information anyway, and there's more useful information in there too. The proc(5) man page et cetera may be needed to interpret it.
> But the pro- > cess/thread was definitely not sleeping nor a zombie - it was > so active that it used up about 50% of the CPU time, and ob- > viously somehow kept [ksoftirqd] busy as well;-)
> So from what I can say at the moment it was a slightly buggy > driver that, in what manner I can't tell yet, didn't close > the device file as requested and thus kept the program from > exiting.
A guess: the buggy serial driver sometimes couldn't deal with the resource cleanup caused by the file descriptor closing. close() never returned but initiated some work: partly attributed to the process, and partly to the kernel itself. Maybe the work was actual I/O.
Probably you'd have triggered the same thing with a 'kill -9' or an abort() as with exit(). In both cases there's a freeing of kernel resources associated with that file descriptor.
> At least my believe in TLPI/APUE has been restored > in that it most likely was a situation where an exit() would > have killed all threads if not a buggy driver had intervened;-) > > Thank you all and best regards, Jens
/Jorgen
-- // Jorgen Grahn \X/ snipabacken.se> O o .
--------------03A2DB7651F09BFE8C70F091 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline
_______________________________________________ Learn mailing list Learn-at-nylxs.com http://lists.mrbrklyn.com/mailman/listinfo/learn
--------------03A2DB7651F09BFE8C70F091--
|
|