www.a00.de > tcpgroup > 1992 > msg00043

TCP-group 1992

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Copy of personal note to Gerard

From:   IDS::MIKEBW       "Mike Bilow, mikebw@idsvax.ids.com"  8-JAN-1992 09:52:
To:     MX%"gvdg@tophat.cdh.cdc.com"
Subj:   RE: New NOS upload

Most of the memory errors seemed originally to come from stack overflow.
The old garbage collector made use of a lot of stack, and I am not sure
why.  It may just have been holding off on some activity and letting the
stack grow from failure to service processes and queues, but I don't
really know.  Making the stacks bigger, which was done a few months ago
in CONFIG.C, made the garbage collector at least work.  Before, even one
garbage collect would blow up the whole system within a few minutes.

There seem to be three major causes of memory failure now:

1.  The mailbox code.  This is usually responsible for invalid frees in
the smtp_send process.  The inference is that sending mail when logged
into the mailbox somehow acts very differently from regular smtp mail,
since the ordinary smtp server and client are, as far as I know,
perfectly stable.  This may have something to do with the domain name
resolver, since I have had a couple of reports about users fixing
problems by doing such things as specifying raw IP addresses in their
AUTOEXEC.NOS files instead of domain names, but I cannot confirm this,
and it seems unlikely since the resolver is same for smtp mail and for
the mailbox.

2.  RSPF.  There have been lots of problems in the RSPF code, as is well
known.  The most striking results have been from Walt, waltcy@ids.jvnc.net,
who has been porting my releases of PA0GRI to OS/2.  For a long time, he
would run fine as long as RSPF was not enabled, but would crash very
quickly when it was.  Since Walt, like me, is in Rhode Island, where
there are about 10 full-time or nearly full-time hosts running RSPF as
a test across Rhode Island and southeastern Massachusetts, his RSPF is
heavily taxed to a much greater extent than it would be anywhere else.
I would not actually condemn the RSPF code itself, but would rather
suspect that there are problems in the coding of the timer process
which manifest themselves only because of the use RSPF makes of them.
Walt has stabilized his release for OS/2 by making some patches, but
whether he has found the real problem -- and whether it will crop up
again from something other than RSPF -- is not known.  RSPF makes use
of the timers for things other than timing (it uses timer for flags,
actually), so this odd characteristic may never turn up again.

3.  Synchronization.  Once every month or two, I get a report and an
error log that shows that NOS was running perfectly fine, often for
several days or even weeks, and suddenly ran itself out of memory
within a few minutes.  When this happens, coreleft would drop from
60-80K down to a few hundred bytes within 5 minutes or so, and would
eventually crash NOS, locking itself up or rebooting depending upon
the state of the watchdog flag.

I spent a while looking for problem 3, which looks like a classic
case of deadlock in a multitasking system.  Because it happens so
rarely, there are few clues, and I have very little idea where to
look.  It may get fixed by accident when changing something else,
for all I know.

I have spent unfathomably huge amounts of time with the RSPF code,
which I feel is fundamentally flawed.  Based on the experience I have
developed with the several months of testing in the Rhode Island area,
I believe that most of the problems result from the somewhat unnatural
data structures used to represent routing information.  This, in turn,
constrains the IP router from doing things that would be nice, such as
not trashing good routes in order to test bad ones.  Changing these
data structures, for all practical purposes, is going to force RSPF
to be recoded from scratch.  I think I have figured out what the data
structures ought to be, and the coding looks like a major project, at
least a month or two more on and off.

I have spent almost no time looking into the mailbox problem.  It is
a serious problem, since it directly affects the reliability of NOS
when used as a forwarding gateway between AX.25 PBBSs and TCP/IP.  A
lot of other people have hunted for this, and I have been happy
wasting oceans of time on RSPF, so I have not gone after this myself.

Note that all of these problems are unrelated to the garbage collector.
As a practical matter, almost all NOS installations run with a fair
amount of memory free, and garbage collection generally will not occur
unless coreleft drops below mem thresh, which defaults to 8K.  If the
memory is that low, something is usually putting pressure on NOS which
is not going to be satisfied with the last 8K anyway.  Running with
mem efficient on seems to eliniate most of the motivation for doing
garbage collection in any case.  I can make about a 653K window under
DESQview on my 386, and I actually allocate about 610K for NOS.  Right
now, in that window, I have a 134,816 byte heap, of which 57,200 bytes
(42%) are free, and 41440 bytes coreleft.  My system is very active
handling about 300-400 pieces of mail per month on Amprnet, but is also
very stable -- I usually leave my computer on 24 hours, and NOS is
generally running in background.

I am not sure I agree with Phil's daemon idea about garbage collection.
My view is that a normal system should never have to do any garbage
collects, and even one garbage collect will usually result in a crash,
sometimes immediately and sometimes several hours later.  At the very
least, I think there is a lot to worry about in the memory allocator
itself before garbage collection.

-- Mike

Document URL : http://www.a00.de/tcpgroup/1992/msg00043.php
Ralf D. Kloth, Ludwigsburg, DE (QRQ.software). < hostmaster at a00.de > [don't send spam]
Created 2004-11-12. Last modified 2004-11-12. Your visit 2021-10-25 00:05.29. Page created in 0.0146 sec.
[Go to the top of this page]   [... to the index page]