SMP meeting summary

From: jas...@canonware.com (Jason Evans)
Subject: SMP meeting summary
Date: 2000/06/25
Message-ID: <20000624235605.D8965@blitz.canonware.com>
X-Deja-AN: 638673074
Approved: n...@camelot.de
Content-Type: text/plain; charset=us-ascii
X-Complaints-To: abuse@camelot.de
X-Trace: lancelot.camelot.de 961916330 75356 195.30.224.3 (25 Jun 2000 06:58:50 GMT)
Organization: Mail2News Gateway at CameloT Online Services
Mime-Version: 1.0
NNTP-Posting-Date: 25 Jun 2000 06:58:50 GMT
Newsgroups: muc.lists.freebsd.smp,mpc.lists.freebsd.smp

[The following meeting summary was originally written by Greg Lehey, and he
later revised it to include various points from the notes that I took
during the meeting.  Finally, I edited (added some, changed some, removed
some) Greg's summary.  Thanks go to Greg for doing the majority of the
writing work!]

On the 15th and 16th of June we had a seminar at Yahoo! in Sunnyvale about
the recent changes to the BSD/OS kernel designed to improve SMP
performance.

Participants were, in seating order:

  Don Brady	   Apple Computer	       File systems
  Ramesh ?	   Apple Computer
  Ted Walker	   Apple Computer              network drivers
  Jeffrey Hsu	   FreeBSD project
  Chuck Paterson   BSDi			       Chief developer
  Jonathan Lemon   Cisco, FreeBSD project
  Matt Dillon	   FreeBSD project             VM, NFS
  Paul Saab	   Yahoo!
  Kirk McKusick
  Peter Wemm	   Yahoo!
  Jayanth ?	   Yahoo!
  Doug Rabson	   FreeBSD project	       Alpha port
  Jason Evans	   FreeBSD project	       kernel threads
  David Greenman   FreeBSD project	       chief architect
  Justin Gibbs	   Adaptec, FreeBSD project    SCSI, 0 copy TCP
  Greg Lehey	   Linuxcare, FreeBSD project  storage management
  Mike Smith	   BSDi, FreeBSD project       hardware, iA64 port
  Alfred Perlstein Wintel, FreeBSD project
  David O'Brien	   BSDi, FreeBSD project       compilers, binutils
  Ceren Ercen	   Linuxcare                   Daemon babe

We met for approximately 8 hours on Thursday and 4 hours on Friday.

Chuck Patterson spent Thursday presenting how BSDi implemented SMP in
BSD/OS 5.0 (as of yet unreleased).  Chuck also briefly explained BSD/OS
4.x's SMP implementation.

The BSD/OS 4.x SMP implementation is mainly comprised of a giant lock, but
with a twist.  Whenever a processor tries to acquire the giant lock it can
either succeed or fail.  If it succeeds, then it's business as usual.
However, if the acquisition fails, the processor does not spin on the giant
lock.  Instead, it acquires the schedlock (which protects scheduler-related
portions of the kernel) and schedules another runnable process, if any.
This technique works extremely well for heavy work loads that have less
than one CPU worth of system (kernel processing) load.  It is very simple,
and it achieves optimal throughput.

The BSD/OS 5.0 SMP implementation is more complex, and is what most of the
meeting time was spent discussing.  From here on, all discussion of BSD/OS
is with regard to 5.0.

1.  Source code access.

    BSD/OS is a proprietary operating system, for which binary-only and
    source code licenses are available.  BSD/OS is based on the same free
    sources (4.4BSD) as the free BSD operating systems.  It is similar to
    FreeBSD, though the two have diverged significantly enough to cause
    serious pains when moving kernel code between them.

    A few weeks back, BSDi made the source code of BSD/OS available to all
    FreeBSD committers.  During the meeting we discussed what this really
    means, and Kirk McKusick (amongst other things chairman of the board of
    BSDi) said, "Well, we're quite happy for you to take generous chunks,
    but if you ended up taking it all, people might get a little uneasy".
    Basically, anything short of simple repackaging of BSD/OS is not an
    issue.

2.  The current problems.

    UNIX was written for single processor machines, and many of the design
    choices are not just suboptimal for SMP, they're just plain ugly.  In
    particular the synchronization mechanisms don't work well with more
    than one processor.  Briefly:

    - The process context, including the upper half of device drivers,
      doesn't need to protect itself.  The kernel is non-preemptive, so as
      long as a process is executing in the kernel, no other process can
      execute in the kernel.  If another process, even with higher
      priority, becomes runnable while a process is executing kernel code,
      it will have to wait until the active process leaves the kernel or
      sleeps.

    - Processes protect themselves against the interrupt context, primarily
      the bottom half of device drivers, by masking interrupts.  The
      original PDP-11 UNIX used the hardware priority levels (numbered 4 to
      7), and even today you'll find function calls like spl4() and spl7()
      in System V code.  BSD changed the names to more descriptive terms
      like splbio(), splnet() and splhigh(), and also replaced the fixed
      priorities by interrupt masks, but the principle remains the same.
      It's not always easy to solve the question of which interrupts need
      to be masked in which context, and one of the interesting
      observations at this meeting was that as time goes on, the interrupt
      masks are getting blacker.  In other words each spl() is masking off
      more and more bits in the interrupt mask register.  This is not good
      for performance.

    - Processes synchronize with each other using the sleep() or tsleep()
      calls.  Traditional UNIX, including System V, uses sleep(), and BSD
      prefers tsleep(), which provides nice strings which ps(1) displays to
      show what the process is waiting for.  FreeBSD no longer has a
      sleep() call, while BSD/OS has both, but sleep() is deprecated.
      tsleep() is used both for voluntary process synchronization
      (e.g. send a request to another process and wait until it is
      finished), and for involuntary synchronization (e.g. wait for a
      shared resource to become available).

      Processes sleep on a specific address.  In many cases, the address in
      itself has no meaning, and it's probably easier to think of it as a
      number.  When a process sleeps, it is put on a sleep queue.  The
      wakeup() function takes the sleep address, walks through the sleep
      queue, and wakes *every* process which is sleeping on this address.
      This can cause massive problems even on single processor machines;
      UNIX was never really intended to have hundreds of processes waiting
      on the same resource, and a number of Apache performance problems
      center around this behavior.  As a partial solution, FreeBSD also has
      an additional function, wakeup_one(), which only wakes one process.

   There are a number of reasons why this concept is not a good solution
   for SMP.  Firstly, the simplistic assumption "nothing else can be
   executing in the kernel while I am" falls flat.  We currently haven't
   implemented a solution for this.  Instead, we found a way of enforcing
   this illogical state, the Big Giant Lock (BGL).  Any process entering
   the kernel must first obtain the BGL; if a process executing on another
   processor has the lock, then the current processor spins; it can't even
   schedule another process to run, because that requires entering the
   kernel.

   The other issue is with masking interrupts.  This is also quite a
   problem for SMP machines, since it requires masking the interrupts on
   all processors, again requiring an expensive synchronization.

3. The BSD/OS solution.

   - The BGL remains, but becomes increasingly meaningless.  In particular,
     it is not always necessary to obtain it in order to enter the kernel.

   - Instead the system protects shared data structures with mutexes.
     These mutexes replace calls to tsleep() when waiting on shared
     resources (the involuntary process synchronization mentioned above).
     In contrast to traditional UNIX, mutexes will be used much more
     frequently in order to protect data structures which were previously
     implicitly protected by the non-preemptive nature of the kernel.  This
     mechanism will replace calls to tsleep() for involuntary context
     switches.

     Compared with the use of tsleep(), mutexes have a number of
     advantages:

     - Each mutex has its own wait (sleep) queue.  When a process releases
       a mutex, it automatically schedules the next process waiting on the
       queue.  This is more efficient than searching a possibly very long,
       linear sleep queue.  It also avoids the flooding when multiple
       processes get scheduled, and most of them have to go back to sleep
       again.

     - Mutexes can be a combination of spin and sleep mutexes: for a
       resource which may be held only for a very short period of time,
       even the overhead of sleeping and rescheduling may be higher than
       waiting in a tight loop.  A spin/sleep lock might first wait in a
       tight loop for 2 microseconds and then sleep if the lock is still
       not available at that time.  This is an issue which Sun has
       investigated in great detail with Solaris.  BSDi has not pursued
       this yet, though the BSD/OS threading primitives make this an easy
       extention to add.  It's possibly an area for us to investigate once
       the system is up and limping again.

   - Interrupt lockout (spl()s) go away completely.  Instead, interrupt
     functions use mutexes for synchronization.  This means that an
     interrupt function must be capable of blocking, which is currently
     impossible.  In order to block, the function must have a "process"
     context (a stack and a process control block).  In particular, this
     could include kernel threads.

     BSD/OS on Intel currently uses light-weight interrupt threads to
     process interrupts, while on SPARC uses normal ("heavyweight")
     processes.  Chuck indicated that the decision to implement
     light-weight threads initially was probably the wrong one, since it
     gave rise to a large number of problems, and although the heavyweight
     process model would give lousy performance, it would probably make it
     easier to develop the kernel while the light-weight processes were
     being debugged.  There is also the possibility of building a kernel
     with one or the other support, so that in case of problems during
     development it would be possible to revert to the heavy-weight
     processes while searching for the bug.

   Other details we discussed included:

   - BSD/OS has not implemented condition variables.  We didn't go into
     details.  The opinion was expressed that they would be very useful for
     synchronization, but that they require coding discipline somewhat
     different than the tsleep() mechanism.  Converting all use of tsleep()
     is a lot of work, and of dubious value.  However, condition variables
     can live along with tsleep(), so a full changeover is not necessary.

   - BSD/OS also does not implement read/write locks, a thing that Linux
     has recently introduced.  We didn't discuss this further, but the
     concepts are well understood, and it should be relatively simple to
     implement them if we find a need.

   - Netgraph poses locking performance problems, since locks have to be
     released at multiple potential transfer points, regardless of whether
     Netgraph is in use.  This problem also exists with System V STREAMS.
     During the meeting we didn't come to a clear consensus on how much of
     a problem this really is.

   - Interrupts can have priority inversion problems on MP machines in
     combination with lazy context switching (aka context stealing).
     However, it's a temporary inversion that just causes latency.  The
     reason that deadlock never occurs is that as soon as a lock is missed,
     the interrupt stack stealing is unwound, so there is never a situation
     where a lock is held that can cause deadlock.  When a high-priority
     process waits for a lower priority process, the blocking process
     temporarily lends its priority to the running process in order to
     ensure that it finishes quickly.  This technique is interchangeably
     called priority inheritance, priority lending, deadlock avoidance, and
     probably other names, just to make things confusing.

   - NFS leasing causes big problems.  Samba will have similar problems,
     potentially.

   - Message queues are probably worthwhile, but they're currently not a
     high priority.

   - There are a number of global variable updates that are not locked, and
     can thus result in partially updated variables (i.e. reads can get
     corrupt values).  This requires either using a locked instruction, or
     using a mutex.  A mutex isn't much more expensive, and is probably
     easier.

   - We should split part of struct proc out into a fixed-size kproc for ps
     use.  This isn't really related to the SMP work, but it would be nice
     to get rid of the dreaded "proc size mismatch" error message that
     people get when their kernel is out of sync with userland.

   - We spoke about naming conventions.  Some people weren't too happy with
     BSD/OS's macro names.  Chuck agreed and said that he would adopt our
     naming convention if we chose a better one.

   - Per-CPU variables need GET_*() and SET_*() routines to lock.

4. Things we need to do.

   There are a number of things we need to do.  During the meeting we
   didn't get beyond deciding the first couple of things:

    - First remove the BGL (currently a spinlock) and replace it with two,
      maybe three mutexes, one for the scheduler (schedlock), and a
      blocking mutex for the kernel in place of the BGL.  BSD/OS also has
      an ipending lock for posting interrupts, which we should probably
      implement in the short term, though it's possible that it might go
      again.

    - In addition, implement the heavy-weight interrupt processes.  These
      would remain in place while the light-weight threads were being
      debugged.

5. Who does what?

   A number of people will work on the SMP project.  During the first stage:

  - Matt Dillon will put in locking primitives and schedlock.  This
    includes resurrecting our long-dead idle process to scan the run queue
    for interrupt threads.  He won't have time for NFS.

  - Doug Rabson will work on the alpha bits, so that it doesn't get left in
    the dust.

  - Greg Lehey will implement the heavyweight interrupt processes and
    lightweight interrupt threads.

  - Jason Evans will be the project manager.

6. Timing.

  We have a general agreement that it's better to do it right than do it
  quickly.  Thus far, Matt has implemented much of his part and is now
  waiting on Greg to do the interrupt processes.  When they've done that,
  they'll do their own tests, and others will do additional testing.  All
  commits will be dependent on approval from Jason, and the first can be
  expected within two months (probably sooner).

  The SMP changes will be maintained as patches against -current until the
  following milestones have been met:

   - Port the BSD/OS locking primitives to the i386 port (Matt) and the
     alpha port (Doug Rabson).

   - Convert the BGL to a blocking lock, add the schedlock, add per-CPU
     idle processes (Matt).

   - Implement heavy-weight interrupt threads (Greg).  Light-weight
     interrupt thread context switching may be working by the time the
     first commit is made, but this is not a requirement.

   - Stub out (basically disable) spl()s.

   - Demonstrated successful compilation and reasonable stability
     (self-hosted kernel build) on both i386 (UP and SMP) and alpha.

  The maintenance of the patches is expected to be a bit of pain, but we
  have decided not to branch due to technical issues with maintaining
  branches in CVS.  The patches are expected to exist only until the first
  commit is made.  At that point, all further development will be done
  directly on HEAD in cvs.

On the light side, we had a rather amusing experience on Friday.  We wanted
to order some sandwiches, but something went wrong with the order, so Paul
ordered pizza instead.  A bit later, the pizza boy came in and deposited
the pizzas on the conference table and was about to leave when Paul
introduced him.  His name is David Filo.  Thanks for the pizza!


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Daniel Eischen <eisc...@vigrid.com>
Subject: Re: SMP meeting summary
Date: 2000/06/25
Message-ID: <Pine.SUN.3.91.1000625091445.2784A-100000@pcnet1.pcnet.com>#1/1
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
References: <20000624235605.D8965@blitz.canonware.com>
Content-Type: TEXT/PLAIN; charset=US-ASCII
MIME-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp

On 24 Jun 2000, Jason Evans wrote:
> 3. The BSD/OS solution.
> 
>    - The BGL remains, but becomes increasingly meaningless.  In particular,
>      it is not always necessary to obtain it in order to enter the kernel.
> 
>    - Instead the system protects shared data structures with mutexes.
>      These mutexes replace calls to tsleep() when waiting on shared
>      resources (the involuntary process synchronization mentioned above).
>      In contrast to traditional UNIX, mutexes will be used much more
>      frequently in order to protect data structures which were previously
>      implicitly protected by the non-preemptive nature of the kernel.  This
>      mechanism will replace calls to tsleep() for involuntary context
>      switches.
> 
>      Compared with the use of tsleep(), mutexes have a number of
>      advantages:
> 
>      - Each mutex has its own wait (sleep) queue.  When a process releases
>        a mutex, it automatically schedules the next process waiting on the
>        queue.  This is more efficient than searching a possibly very long,
>        linear sleep queue.  It also avoids the flooding when multiple
>        processes get scheduled, and most of them have to go back to sleep
>        again.
> 
>      - Mutexes can be a combination of spin and sleep mutexes: for a
>        resource which may be held only for a very short period of time,
>        even the overhead of sleeping and rescheduling may be higher than
>        waiting in a tight loop.  A spin/sleep lock might first wait in a
>        tight loop for 2 microseconds and then sleep if the lock is still
>        not available at that time.  This is an issue which Sun has
>        investigated in great detail with Solaris.  BSDi has not pursued
>        this yet, though the BSD/OS threading primitives make this an easy
>        extention to add.  It's possibly an area for us to investigate once
>        the system is up and limping again.

If anyone is interested...

All high-level interrupts (levels 11-15, mostly PIO serial interrupts)
in Solaris use spin mutexes and don't use an interrupt thread.  They
execute in the context of the thread that was currently running.  All
other interrupts below level 11 (clock, network, disk, etc) use interrupt
threads.

A Solaris (non-spinning) mutex will only spin while the owning thread is 
running.  Since BSDi mutexes have owners (correct me if I'm wrong), this
seems to be better than arbitrarily spinning.

> 
>    - Interrupt lockout (spl()s) go away completely.  Instead, interrupt
>      functions use mutexes for synchronization.  This means that an
>      interrupt function must be capable of blocking, which is currently
>      impossible.  In order to block, the function must have a "process"
>      context (a stack and a process control block).  In particular, this
>      could include kernel threads.
> 
>      BSD/OS on Intel currently uses light-weight interrupt threads to
>      process interrupts, while on SPARC uses normal ("heavyweight")
>      processes.  Chuck indicated that the decision to implement
>      light-weight threads initially was probably the wrong one, since it
>      gave rise to a large number of problems, and although the heavyweight
>      process model would give lousy performance, it would probably make it
>      easier to develop the kernel while the light-weight processes were
>      being debugged.  There is also the possibility of building a kernel
>      with one or the other support, so that in case of problems during
>      development it would be possible to revert to the heavy-weight
>      processes while searching for the bug.
> 
>    Other details we discussed included:
> 
>    - BSD/OS has not implemented condition variables.  We didn't go into
>      details.  The opinion was expressed that they would be very useful for
>      synchronization, but that they require coding discipline somewhat
>      different than the tsleep() mechanism.  Converting all use of tsleep()
>      is a lot of work, and of dubious value.  However, condition variables
>      can live along with tsleep(), so a full changeover is not necessary.

For a lot of drivers, it seems pretty straight forward to convert
splXXX() ... tsleep() ... splx() to mtx_enter() ... cv_wait()/cv_wait_sig()
... mtx_exit().

> 
>    - BSD/OS also does not implement read/write locks, a thing that Linux
>      has recently introduced.  We didn't discuss this further, but the
>      concepts are well understood, and it should be relatively simple to
>      implement them if we find a need.

Mutexes are only used in Solaris when they will be held for very small
amounts of time.  Read/write locks and semaphores are used for all
other instances.  While we are modifying the kernel to add mutexes,
it would probably be worthwhile to comment those sections of code
that could hold mutexes for something other than a very short period
of time.  Or even use a different naming convention for those mutexes.

-- 
Dan Eischen


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Terry Lambert <tlamb...@primenet.com>
Subject: Re: SMP meeting summary
Date: 2000/06/25
Message-ID: <200006251736.KAA09884@usr02.primenet.com>#1/1
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
Content-Transfer-Encoding: 7bit
References: <Pine.SUN.3.91.1000625091445.2784A-100000@pcnet1.pcnet.com>
Content-Type: text/plain; charset=us-ascii
MIME-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp

> All high-level interrupts (levels 11-15, mostly PIO serial interrupts)
> in Solaris use spin mutexes and don't use an interrupt thread.  They
> execute in the context of the thread that was currently running.  All
> other interrupts below level 11 (clock, network, disk, etc) use interrupt
> threads.
> 
> A Solaris (non-spinning) mutex will only spin while the owning thread is 
> running.  Since BSDi mutexes have owners (correct me if I'm wrong), this
> seems to be better than arbitrarily spinning.

We need to learn from Dynix (Sequent's UNIX).

The main issue that block concurrency is access to shared resources.

Critical sectioning is actually better than mutex protection of
structures for maximizing concurrency, but few people appear to be
willing to go down this road, since it requires flatening the call
graph for much of the kernel to ensure that locks are held and
released at the same call level, so that stack unwinding is not
needed to permit preemption.

Dynix had no problem with 32 processors.  Most SVR4 variants, and
I will include Solaris in this, use mutex protection of structures,
and start to fall down drastically over 4 processors.

The main reason Dynix did not have this scaling issue is that it
dealt with the shared resource issue by placing most objects into
per-processor allocation/deallocation pools.  These pools were
filled/drained from/to system pools.  Lock contention was only
necessary when the pools needed filling/draining, or when an object
was being migrated between CPUs.

Similarly, one can consider that the idea of CPU reentrancy into
the kernel is identical in all but inter-CPU synchronization to
the idea of kernel preemption.

It would perhaps be a good idea from this standpoint to adopt the
realtime code recently donated to the OpenBSD project, since the
issues involved in making a kernel RT are similar to those of
ensuring SMP kernel reentrancy without blocking on resource
contention.

> Mutexes are only used in Solaris when they will be held for very small
> amounts of time.  Read/write locks and semaphores are used for all
> other instances.  While we are modifying the kernel to add mutexes,
> it would probably be worthwhile to comment those sections of code
> that could hold mutexes for something other than a very short period
> of time.  Or even use a different naming convention for those mutexes.

Anything that can hold a mutex for other than a very short time will
need to go away.  This is one of the problems with data protection
rather than critical sectioning.

Reader/writer locks are an obvious optimization, if one is to use
mutex protection of data.  Another similar optimization is intention
mode locking.  The Soft Updates dependency flooding problem that is
associated with an update being commited to the update clock list,
and someone else needing to access it (the poor ZD Labs benchmark
results were in part traced to this), is one place where intention
mode locks would be useful in increasing concurrency.

Search altavista for "+intention +lock +SIX" to find the relevent
literature.

					Terry Lambert
					te...@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Nate Williams <n...@yogotech.com>
Subject: Re: SMP meeting summary
Date: 2000/06/25
Message-ID: <200006260442.WAA15731@nomad.yogotech.com>#1/1
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
Content-Transfer-Encoding: 7bit
References: <Pine.SUN.3.91.1000625091445.2784A-100000@pcnet1.pcnet.com> <200006251736.KAA09884@usr02.primenet.com>
Content-Type: text/plain; charset=us-ascii
MIME-Version: 1.0
Reply-To: n...@yogotech.com (Nate Williams)
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp

> Dynix had no problem with 32 processors.  Most SVR4 variants, and
> I will include Solaris in this, use mutex protection of structures,
> and start to fall down drastically over 4 processors.

Amazing that you say this, yet I see extremely good results on Solaris
boxes up to 64 processors.

Suffice it to say that I'm not convinced, nor am I convinced that
mutex's around data structures is any different than critical
sectioning.

They are essentially the same thing, in that the critical section is
almost always the code that deals with a particular (shared) data
structure.

Nate

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Frank Mayhar <fr...@exit.com>
Subject: Re: SMP meeting summary
Date: 2000/06/25
Message-ID: <200006260632.XAA43962@realtime.exit.com>#1/1
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
Content-Transfer-Encoding: 7bit
References: <200006260442.WAA15731@nomad.yogotech.com>
Content-Type: text/plain; charset=US-ASCII
Organization: Exit Consulting
MIME-Version: 1.0
Reply-To: fr...@exit.com
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp

Nate Williams wrote:
> > Dynix had no problem with 32 processors.  Most SVR4 variants, and
> > I will include Solaris in this, use mutex protection of structures,
> > and start to fall down drastically over 4 processors.
> Amazing that you say this, yet I see extremely good results on Solaris
> boxes up to 64 processors.

Hmm.  Do you have numbers?

> Suffice it to say that I'm not convinced, nor am I convinced that
> mutex's around data structures is any different than critical
> sectioning.
> 
> They are essentially the same thing, in that the critical section is
> almost always the code that deals with a particular (shared) data
> structure.

I agree with this, but I can state that Unixware doesn't scale well (i.e.
linearly) over roughly four (or possibly eight, its been a while since I
looked at the numbers) processors.  This has been clearly shown by various
and sundry benchmarks at Compaq and elsewhere.

I do like the "per-cpu pool" idea, though (although I haven't yet thought
it completely through).  This would have to, I think, go along with a much
stronger CPU affinity for threads and interrupts.  I clearly see that, under
4.0, processes pretty much freely migrate from CPU to CPU.  This is bad for
SMP performance (kills the cache) and would also mean that if a process
using a particular structure were on a different CPU, it would have to
either move to the proper CPU or move the structure into the per-CPU pool
of the CPU it's using.  Or use a non-CPU-pool structure.

I'll have to think about this some more.  It's an interesting idea, though.
-- 
Frank Mayhar fr...@exit.com	http://www.exit.com/
Exit Consulting                 http://store.exit.com/

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Luoqi Chen <lu...@watermarkgroup.com>
Subject: Re:  SMP meeting summary
Date: 2000/06/26
Message-ID: <200006261646.e5QGkUS06290@lor.watermarkgroup.com>#1/1
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp

>      Compared with the use of tsleep(), mutexes have a number of
>      advantages:
> 
>      - Each mutex has its own wait (sleep) queue.  When a process releases
>        a mutex, it automatically schedules the next process waiting on the
>        queue.  This is more efficient than searching a possibly very long,
>        linear sleep queue.  It also avoids the flooding when multiple
>        processes get scheduled, and most of them have to go back to sleep
>        again.
> 
What about processes of different priorities blocking for the same mutex?
Would you do a linear search on the queue? or have the queue sorted by
priority? or a FIFO queue is good enough?

-lq


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: jas...@canonware.com (Jason Evans)
Subject: Re: SMP meeting summary
Date: 2000/06/26
Message-ID: <20000626110633.F8965@blitz.canonware.com>#1/1
X-Deja-AN: 639177728
Approved: n...@camelot.de
References: <200006261646.e5QGkUS06290@lor.watermarkgroup.com>
Content-Type: text/plain; charset=us-ascii
X-Complaints-To: abuse@camelot.de
X-Trace: lancelot.camelot.de 962042970 76295 195.30.224.3 (26 Jun 2000 18:09:30 GMT)
Organization: Mail2News Gateway at CameloT Online Services
Mime-Version: 1.0
NNTP-Posting-Date: 26 Jun 2000 18:09:30 GMT
Newsgroups: muc.lists.freebsd.smp,mpc.lists.freebsd.smp

On Mon, Jun 26, 2000 at 12:46:30PM -0400, Luoqi Chen wrote:
> >      Compared with the use of tsleep(), mutexes have a number of
> >      advantages:
> > 
> >      - Each mutex has its own wait (sleep) queue.  When a process releases
> >        a mutex, it automatically schedules the next process waiting on the
> >        queue.  This is more efficient than searching a possibly very long,
> >        linear sleep queue.  It also avoids the flooding when multiple
> >        processes get scheduled, and most of them have to go back to sleep
> >        again.
> > 
> What about processes of different priorities blocking for the same mutex?
> Would you do a linear search on the queue? or have the queue sorted by
> priority? or a FIFO queue is good enough?

Processes that block on a mutex are granted the lock in FIFO order, rather
than priority order.  In order to avoid priority inversion, the mutex wait
queue implements priority lending.

Jason


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Luoqi Chen <lu...@watermarkgroup.com>
Subject: Re: SMP meeting summary
Date: 2000/06/26
Message-ID: <200006262013.e5QKDOP09679@lor.watermarkgroup.com>#1/1
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp

> On Mon, Jun 26, 2000 at 12:46:30PM -0400, Luoqi Chen wrote:
> > >      Compared with the use of tsleep(), mutexes have a number of
> > >      advantages:
> > > 
> > >      - Each mutex has its own wait (sleep) queue.  When a process releases
> > >        a mutex, it automatically schedules the next process waiting on the
> > >        queue.  This is more efficient than searching a possibly very long,
> > >        linear sleep queue.  It also avoids the flooding when multiple
> > >        processes get scheduled, and most of them have to go back to sleep
> > >        again.
> > > 
> > What about processes of different priorities blocking for the same mutex?
> > Would you do a linear search on the queue? or have the queue sorted by
> > priority? or a FIFO queue is good enough?
> 
> Processes that block on a mutex are granted the lock in FIFO order, rather
> than priority order.  In order to avoid priority inversion, the mutex wait
> queue implements priority lending.
> 
> Jason
> 
Ok. I remember I have read somewhere that solaris 7 has given up the behavior
of waking up only one thread after a mutex is released, now it wakes up all
the blocking threads. It seems that the "thundering herd" problem is not
serious after all if the lock granuity is high enough.

-lq


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: jas...@canonware.com (Jason Evans)
Subject: Re: SMP meeting summary
Date: 2000/06/27
Message-ID: <20000626144957.J8965@blitz.canonware.com>#1/1
X-Deja-AN: 639304259
Approved: n...@camelot.de
References: <200006262013.e5QKDOP09679@lor.watermarkgroup.com>
Content-Type: text/plain; charset=us-ascii
X-Complaints-To: abuse@camelot.de
X-Trace: lancelot.camelot.de 962057006 6691 195.30.224.3 (26 Jun 2000 22:03:26 GMT)
Organization: Mail2News Gateway at CameloT Online Services
Mime-Version: 1.0
NNTP-Posting-Date: 26 Jun 2000 22:03:26 GMT
Newsgroups: muc.lists.freebsd.smp,mpc.lists.freebsd.smp

On Mon, Jun 26, 2000 at 04:13:24PM -0400, Luoqi Chen wrote:
> > Processes that block on a mutex are granted the lock in FIFO order, rather
> > than priority order.  In order to avoid priority inversion, the mutex wait
> > queue implements priority lending.
> >
> Ok. I remember I have read somewhere that solaris 7 has given up the behavior
> of waking up only one thread after a mutex is released, now it wakes up all
> the blocking threads. It seems that the "thundering herd" problem is not
> serious after all if the lock granuity is high enough.

I don't think this is the case.  Solaris uses what are called turnstiles to
implement priority lending.  For a reasonably detailed explanation, see:

  http://www.sunworld.com/sunworldonline/swol-08-1999/swol-08-insidesolaris.html

My reading of this article is that turnstiles use priority lending to boost
the current owner(s) of a lock, but that subsequent lock granting is done
in priority order.

This lock granting behavior isn't strictly necessary, but it may have desireable
characteristics.  I haven't looked at the BSD/OS code in detail yet, but
according to Doug Rabson, it behaves in basically the same way.

Also, there is a book due out within the next several weeks that contains a
lot of good information about the Solaris kernel:

  Solaris Internals: Architecture and Techniques Vol. 1 Core Kernel Components
  by Jim Mauro, Richard McDougall
  ISBN: 0-13-022496-0

Jason

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: jas...@canonware.com (Jason Evans)
Subject: Re: SMP meeting summary
Date: 2000/06/27
Message-ID: <20000626151441.L8965@blitz.canonware.com>#1/1
X-Deja-AN: 639309467
Approved: n...@camelot.de
References: <200006262013.e5QKDOP09679@lor.watermarkgroup.com>
Content-Type: text/plain; charset=us-ascii
X-Complaints-To: abuse@camelot.de
X-Trace: lancelot.camelot.de 962057859 11300 195.30.224.3 (26 Jun 2000 22:17:39 GMT)
Organization: Mail2News Gateway at CameloT Online Services
Mime-Version: 1.0
NNTP-Posting-Date: 26 Jun 2000 22:17:39 GMT
Newsgroups: muc.lists.freebsd.smp,mpc.lists.freebsd.smp

On Mon, Jun 26, 2000 at 02:49:57PM -0700, Jason Evans wrote:
> On Mon, Jun 26, 2000 at 04:13:24PM -0400, Luoqi Chen wrote:
> > > Processes that block on a mutex are granted the lock in FIFO order, rather
> > > than priority order.  In order to avoid priority inversion, the mutex wait
> > > queue implements priority lending.
> > >
> > Ok. I remember I have read somewhere that solaris 7 has given up the behavior
> > of waking up only one thread after a mutex is released, now it wakes up all
> > the blocking threads. It seems that the "thundering herd" problem is not
> > serious after all if the lock granuity is high enough.
> 
> I don't think this is the case.

Whoops.  The article is broken into two web pages, and the second page
states exactly what you said: as of Solaris 7, all waiting threads are
woken up.

Jason


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Daniel Eischen <eisc...@vigrid.com>
Subject: Re: SMP meeting summary
Date: 2000/06/26
Message-ID: <Pine.SUN.3.91.1000626193709.15096A-100000@pcnet1.pcnet.com>#1/1
X-Deja-AN: 639344369
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
References: <20000626151441.L8965@blitz.canonware.com>
Delivered-To: freebsd-...@freebsd.org
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Gateway: Unidirectional mail2news gateway at MPCS
MIME-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp
X-Loop: FreeBSD.org

On 26 Jun 2000, Jason Evans wrote:

> On Mon, Jun 26, 2000 at 02:49:57PM -0700, Jason Evans wrote:
> > On Mon, Jun 26, 2000 at 04:13:24PM -0400, Luoqi Chen wrote:
> > > > Processes that block on a mutex are granted the lock in FIFO order, rather
> > > > than priority order.  In order to avoid priority inversion, the mutex wait
> > > > queue implements priority lending.
> > > >
> > > Ok. I remember I have read somewhere that solaris 7 has given up the behavior
> > > of waking up only one thread after a mutex is released, now it wakes up all
> > > the blocking threads. It seems that the "thundering herd" problem is not
> > > serious after all if the lock granuity is high enough.
> > 
> > I don't think this is the case.
> 
> Whoops.  The article is broken into two web pages, and the second page
> states exactly what you said: as of Solaris 7, all waiting threads are
> woken up.

Yes, this confirms what Jim Mauro said in the Solaris Internals course
at USENIX.  Since mutexes are held only for very small amounts of time
and the kernel is sufficiently fine-grained, their was no advantage
to calling wake_one() as opposed to wake_all().  Obviously with these
semantics, the waiter with the highest priority should obtain the
mutex.  At least that was my recollection...

In regards to turnstiles, each kernel thread is born with its own
turnstile.  When it blocks on a mutex that doesn't have any waiters
(no turnstile allocated to it), it uses the threads turnstile.  If
the mutex already has a turnstile (there are other waiters), then
the threads turnstile is added to the system (per-CPU?) pool of
turnstiles.  When the thread wakes up and acquires the mutex, it
takes a turnstile back from the turnstile pool.  Turnstiles are
also used for read/write locks.

-- 
Dan Eischen

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Greg Lehey <g...@lemis.com>
Subject: Re: SMP meeting summary
Date: 2000/06/28
Message-ID: <20000628130031.B1760@sydney.worldwide.lemis.com>#1/1
X-Deja-AN: 639828639
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF  13 24 52 F8 6D A4 95 EF
References: <20000624235605.D8965@blitz.canonware.com> <Pine.SUN.3.91.1000625091445.2784A-100000@pcnet1.pcnet.com>
Delivered-To: freebsd-...@freebsd.org
WWW-Home-Page: http://www.lemis.com/~grog
X-Authentication-Warning: front.linuxcare.com.au: Host [203.17.0.42] claimed to be sydney.worldwide.lemis.com
Content-Type: text/plain; charset=us-ascii
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
X-Gateway: Unidirectional mail2news gateway at MPCS
Mobile: +61-418-838-708
Phone: +61-8-8388-8286
Mime-Version: 1.0
User-Agent: Mutt/1.2i
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp
Content-Disposition: inline
X-Loop: FreeBSD.org
Fax: +61-8-8388-8725

On Sunday, 25 June 2000 at  9:58:27 -0400, Daniel Eischen wrote:
> On 24 Jun 2000, Jason Evans wrote:
>> 3. The BSD/OS solution.
>>
>>    - The BGL remains, but becomes increasingly meaningless.  In particular,
>>      it is not always necessary to obtain it in order to enter the kernel.
>>
>>    - Instead the system protects shared data structures with mutexes.
>>      These mutexes replace calls to tsleep() when waiting on shared
>>      resources (the involuntary process synchronization mentioned above).
>>      In contrast to traditional UNIX, mutexes will be used much more
>>      frequently in order to protect data structures which were previously
>>      implicitly protected by the non-preemptive nature of the kernel.  This
>>      mechanism will replace calls to tsleep() for involuntary context
>>      switches.
>>
>>      Compared with the use of tsleep(), mutexes have a number of
>>      advantages:
>>
>>      - Each mutex has its own wait (sleep) queue.  When a process releases
>>        a mutex, it automatically schedules the next process waiting on the
>>        queue.  This is more efficient than searching a possibly very long,
>>        linear sleep queue.  It also avoids the flooding when multiple
>>        processes get scheduled, and most of them have to go back to sleep
>>        again.
>>
>>      - Mutexes can be a combination of spin and sleep mutexes: for a
>>        resource which may be held only for a very short period of time,
>>        even the overhead of sleeping and rescheduling may be higher than
>>        waiting in a tight loop.  A spin/sleep lock might first wait in a
>>        tight loop for 2 microseconds and then sleep if the lock is still
>>        not available at that time.  This is an issue which Sun has
>>        investigated in great detail with Solaris.  BSDi has not pursued
>>        this yet, though the BSD/OS threading primitives make this an easy
>>        extention to add.  It's possibly an area for us to investigate once
>>        the system is up and limping again.
>
> If anyone is interested...
>
> All high-level interrupts (levels 11-15, mostly PIO serial interrupts)
> in Solaris use spin mutexes and don't use an interrupt thread.  They
> execute in the context of the thread that was currently running.  All
> other interrupts below level 11 (clock, network, disk, etc) use interrupt
> threads.
>
> A Solaris (non-spinning) mutex will only spin while the owning thread is
> running.  Since BSDi mutexes have owners (correct me if I'm wrong), this
> seems to be better than arbitrarily spinning.

Mutexes only have owners when they're being held.  But we won't spin
for any length of time on a mutex; that's why we have a thread context
for the interrupts.

>>    - BSD/OS also does not implement read/write locks, a thing that Linux
>>      has recently introduced.  We didn't discuss this further, but the
>>      concepts are well understood, and it should be relatively simple to
>>      implement them if we find a need.
>
> Mutexes are only used in Solaris when they will be held for very small
> amounts of time.  Read/write locks and semaphores are used for all
> other instances.  While we are modifying the kernel to add mutexes,
> it would probably be worthwhile to comment those sections of code
> that could hold mutexes for something other than a very short period
> of time.  Or even use a different naming convention for those mutexes.

Agreed, I don't like the terminology we seem to have settled on.  From
my way of thinking, a mutex is a spin lock, and a semaphore is a
blocking lock.  What we're talking about here are really semaphores,
though it makes sense to spin a bit first before blocking in the case
that the lock may be released quickly: it takes a fair amount of
overhead to schedule, and if there's a good chance the lock will be
available by the time we've scheduled, there's no point in blocking
immediately.  One of the things I want to do further down the line is
to instrument some statistics on the semaphores^H^H^Hnmutexes so we
can decide what kind we need where (and when).

Greg
--
Finger g...@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: jas...@canonware.com (Jason Evans)
Subject: Re: SMP meeting summary
Date: 2000/06/28
Message-ID: <20000627202557.S15267@blitz.canonware.com>#1/1
X-Deja-AN: 639836730
Approved: n...@camelot.de
References: <20000624235605.D8965@blitz.canonware.com>
Content-Type: text/plain; charset=us-ascii
X-Complaints-To: abuse@camelot.de
X-Trace: lancelot.camelot.de 962162905 18882 195.30.224.3 (28 Jun 2000 03:28:25 GMT)
Organization: Mail2News Gateway at CameloT Online Services
Mime-Version: 1.0
NNTP-Posting-Date: 28 Jun 2000 03:28:25 GMT
Newsgroups: muc.lists.freebsd.smp,mpc.lists.freebsd.smp

On Wed, Jun 28, 2000 at 01:00:31PM +1000, Greg Lehey wrote:
> On Sunday, 25 June 2000 at  9:58:27 -0400, Daniel Eischen wrote:
> > On 24 Jun 2000, Jason Evans wrote:
> >>    - BSD/OS also does not implement read/write locks, a thing that Linux
> >>      has recently introduced.  We didn't discuss this further, but the
> >>      concepts are well understood, and it should be relatively simple to
> >>      implement them if we find a need.
> >
> > Mutexes are only used in Solaris when they will be held for very small
> > amounts of time.  Read/write locks and semaphores are used for all
> > other instances.  While we are modifying the kernel to add mutexes,
> > it would probably be worthwhile to comment those sections of code
> > that could hold mutexes for something other than a very short period
> > of time.  Or even use a different naming convention for those mutexes.
> 
> Agreed, I don't like the terminology we seem to have settled on.  From
> my way of thinking, a mutex is a spin lock, and a semaphore is a
> blocking lock.  What we're talking about here are really semaphores,
> though it makes sense to spin a bit first before blocking in the case
> that the lock may be released quickly: it takes a fair amount of
> overhead to schedule, and if there's a good chance the lock will be
> available by the time we've scheduled, there's no point in blocking
> immediately.  One of the things I want to do further down the line is
> to instrument some statistics on the semaphores^H^H^Hnmutexes so we
> can decide what kind we need where (and when).

Mutexes come in different flavors.  From an API perspective, whether the
mutex spins, blocks, or is adaptive isn't visible.

A semaphore is significantly different.  It can be used in place of a
mutex, but it has additional functionality.  A POSIX semaphore has a count
associated with it.  Other definitions of semaphores generally have a count
associated with the semaphore as well, though there may be more
functionality than POSIX semaphores provide.  Posting (incrementing) the
semaphore always succeeds, but waiting on (decrementing) the semaphore will
spin/block until the decrement operation can be completed without the
semaphore value becoming negative.

So, both mutexes and semaphores can be implemented as
spinning/blocking/adaptive.

Jason


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: eisc...@vigrid.com (Daniel Eischen)
Subject: Re: SMP meeting summary
Date: 2000/06/28
Message-ID: <Pine.SUN.3.91.1000627230450.18557A-100000@pcnet1.pcnet.com>#1/1
X-Deja-AN: 639836732
Approved: n...@camelot.de
References: <20000628130031.B1760@sydney.worldwide.lemis.com>
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Complaints-To: abuse@camelot.de
X-Trace: lancelot.camelot.de 962163069 19040 195.30.224.3 (28 Jun 2000 03:31:09 GMT)
Organization: Mail2News Gateway at CameloT Online Services
Mime-Version: 1.0
NNTP-Posting-Date: 28 Jun 2000 03:31:09 GMT
Newsgroups: muc.lists.freebsd.smp,mpc.lists.freebsd.smp

On Wed, 28 Jun 2000, Greg Lehey wrote:
> On Sunday, 25 June 2000 at  9:58:27 -0400, Daniel Eischen wrote:
> > A Solaris (non-spinning) mutex will only spin while the owning thread is
> > running.  Since BSDi mutexes have owners (correct me if I'm wrong), this
> > seems to be better than arbitrarily spinning.
> 
> Mutexes only have owners when they're being held.  But we won't spin
> for any length of time on a mutex; that's why we have a thread context
> for the interrupts.

Right, I didn't think mutexes would have owners if they were not
locked ;-)

> 
> >>    - BSD/OS also does not implement read/write locks, a thing that Linux
> >>      has recently introduced.  We didn't discuss this further, but the
> >>      concepts are well understood, and it should be relatively simple to
> >>      implement them if we find a need.
> >
> > Mutexes are only used in Solaris when they will be held for very small
> > amounts of time.  Read/write locks and semaphores are used for all
> > other instances.  While we are modifying the kernel to add mutexes,
> > it would probably be worthwhile to comment those sections of code
> > that could hold mutexes for something other than a very short period
> > of time.  Or even use a different naming convention for those mutexes.
> 
> Agreed, I don't like the terminology we seem to have settled on.  From
> my way of thinking, a mutex is a spin lock, and a semaphore is a
> blocking lock.  What we're talking about here are really semaphores,
> though it makes sense to spin a bit first before blocking in the case
> that the lock may be released quickly: it takes a fair amount of
> overhead to schedule, and if there's a good chance the lock will be
> available by the time we've scheduled, there's no point in blocking
> immediately.

It doesn't make sense to spin if the lock holder is not runnable,
especially on a single CPU system.  In order to make the owning
thread runnable, you've got to take the scheduling queue lock
and there has to be a context switch anyways.  You might as well
get ready to place the blocking thread on the sleep queue.  If
after acquiring (or while spinning on) the sleep queue lock, 
the owning thread becomes runnable, you can back out of the sleep
queue insertion.

> One of the things I want to do further down the line is
> to instrument some statistics on the semaphores^H^H^Hnmutexes so we
> can decide what kind we need where (and when).

Great!  Sounds like Solaris lockstat(1).

> Greg
> --
> Finger g...@lemis.com for PGP public key
> See complete headers for address and phone numbers

-- 
Dan Eischen


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Greg Lehey <g...@lemis.com>
Subject: Re: SMP meeting summary
Date: 2000/06/28
Message-ID: <20000628145955.A2209@sydney.worldwide.lemis.com>#1/1
X-Deja-AN: 639856071
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF  13 24 52 F8 6D A4 95 EF
References: <20000624235605.D8965@blitz.canonware.com> <Pine.SUN.3.91.1000625091445.2784A-100000@pcnet1.pcnet.com> <20000628130031.B1760@sydney.worldwide.lemis.com> <20000627202557.S15267@blitz.canonware.com>
Delivered-To: freebsd-...@freebsd.org
WWW-Home-Page: http://www.lemis.com/~grog
X-Authentication-Warning: front.linuxcare.com.au: Host [203.17.0.42] claimed to be sydney.worldwide.lemis.com
Content-Type: text/plain; charset=us-ascii
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
X-Gateway: Unidirectional mail2news gateway at MPCS
Mobile: +61-418-838-708
Phone: +61-8-8388-8286
Mime-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp
X-Loop: FreeBSD.org
Fax: +61-8-8388-8725

On Tuesday, 27 June 2000 at 20:25:57 -0700, Jason Evans wrote:
> On Wed, Jun 28, 2000 at 01:00:31PM +1000, Greg Lehey wrote:
>> On Sunday, 25 June 2000 at  9:58:27 -0400, Daniel Eischen wrote:
>>> On 24 Jun 2000, Jason Evans wrote:
>>>>    - BSD/OS also does not implement read/write locks, a thing that Linux
>>>>      has recently introduced.  We didn't discuss this further, but the
>>>>      concepts are well understood, and it should be relatively simple to
>>>>      implement them if we find a need.
>>>
>>> Mutexes are only used in Solaris when they will be held for very small
>>> amounts of time.  Read/write locks and semaphores are used for all
>>> other instances.  While we are modifying the kernel to add mutexes,
>>> it would probably be worthwhile to comment those sections of code
>>> that could hold mutexes for something other than a very short period
>>> of time.  Or even use a different naming convention for those mutexes.
>>
>> Agreed, I don't like the terminology we seem to have settled on.  From
>> my way of thinking, a mutex is a spin lock, and a semaphore is a
>> blocking lock.  What we're talking about here are really semaphores,
>> though it makes sense to spin a bit first before blocking in the case
>> that the lock may be released quickly: it takes a fair amount of
>> overhead to schedule, and if there's a good chance the lock will be
>> available by the time we've scheduled, there's no point in blocking
>> immediately.  One of the things I want to do further down the line is
>> to instrument some statistics on the semaphores^H^H^Hnmutexes so we
>> can decide what kind we need where (and when).
>
> Mutexes come in different flavors.  From an API perspective, whether the
> mutex spins, blocks, or is adaptive isn't visible.
>
> A semaphore is significantly different.  It can be used in place of a
> mutex, but it has additional functionality.  A POSIX semaphore has a count
> associated with it.  Other definitions of semaphores generally have a count
> associated with the semaphore as well, though there may be more
> functionality than POSIX semaphores provide.  Posting (incrementing) the
> semaphore always succeeds, but waiting on (decrementing) the semaphore will
> spin/block until the decrement operation can be completed without the
> semaphore value becoming negative.

Hmm.  I haven't seen the POSIX definition, but Dijkstra's semaphores
had a count and two main operations, P and V.  P decrements, and if
the semaphore counter goes negative, the process is placed on the
semaphore sleep queue.  If V increments to 0, the first process on the
sleep queue is scheduled.

The mutexes we're looking at here are a degenerate case of semaphores:
instead of a count, we have a flag (in fact, a predicate) that says
whether the semaphore is held or not.  That's effectively a semaphore
with an initial counter value of 0.

> So, both mutexes and semaphores can be implemented as
> spinning/blocking/adaptive.

I was pretty sure that a semaphore was always blocking.

Greg
--
Finger g...@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Greg Lehey <g...@lemis.com>
Subject: Re: SMP meeting summary
Date: 2000/06/28
Message-ID: <20000628151149.B2209@sydney.worldwide.lemis.com>#1/1
X-Deja-AN: 639858569
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF  13 24 52 F8 6D A4 95 EF
References: <Pine.SUN.3.91.1000625091445.2784A-100000@pcnet1.pcnet.com> <200006251736.KAA09884@usr02.primenet.com> <200006260442.WAA15731@nomad.yogotech.com>
Delivered-To: freebsd-...@freebsd.org
WWW-Home-Page: http://www.lemis.com/~grog
X-Authentication-Warning: front.linuxcare.com.au: Host [203.17.0.42] claimed to be sydney.worldwide.lemis.com
Content-Type: text/plain; charset=us-ascii
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
X-Gateway: Unidirectional mail2news gateway at MPCS
Mobile: +61-418-838-708
Phone: +61-8-8388-8286
Mime-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp
X-Loop: FreeBSD.org
Fax: +61-8-8388-8725

On Sunday, 25 June 2000 at 22:42:02 -0600, Nate Williams wrote:
>> Dynix had no problem with 32 processors.  Most SVR4 variants, and
>> I will include Solaris in this, use mutex protection of structures,
>> and start to fall down drastically over 4 processors.
>
> Amazing that you say this, yet I see extremely good results on
> Solaris boxes up to 64 processors.

Yes, I was wondering about this statement too.  As usual, it probably
depends on what you're doing.  Terry seems to know Dynix pretty well,
so I wouldn't be surprised to hear that this statement originated
there.

> Suffice it to say that I'm not convinced, nor am I convinced that
> mutex's around data structures is any different than critical
> sectioning.

I'm convinced that they're different.  The real issue is which is
better, and I tend towards locking data structures.  But Terry, go
ahead and prove us wrong if you want.  I won't mind.

> They are essentially the same thing, in that the critical section is
> almost always the code that deals with a particular (shared) data
> structure.

That's a degenerate case, of course.

Greg
--
Finger g...@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Terry Lambert <tlamb...@primenet.com>
Subject: Re: SMP meeting summary
Date: 2000/06/28
Message-ID: <200006282315.QAA03731@usr08.primenet.com>#1/1
X-Deja-AN: 640187784
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
Content-Transfer-Encoding: 7bit
References: <200006260442.WAA15731@nomad.yogotech.com>
Delivered-To: freebsd-...@freebsd.org
Content-Type: text/plain; charset=us-ascii
X-Gateway: Unidirectional mail2news gateway at MPCS
MIME-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp
X-Loop: FreeBSD.org

> > Dynix had no problem with 32 processors.  Most SVR4 variants, and
> > I will include Solaris in this, use mutex protection of structures,
> > and start to fall down drastically over 4 processors.
> 
> Amazing that you say this, yet I see extremely good results on Solaris
> boxes up to 64 processors.

Boxes or clusters?  NUMA or non-NUMA?  MESI or MEI cache coherency?


> Suffice it to say that I'm not convinced, nor am I convinced that
> mutex's around data structures is any different than critical
> sectioning.
> 
> They are essentially the same thing, in that the critical section is
> almost always the code that deals with a particular (shared) data
> structure.

I can put you in touch with Sabsovitch or Leventhal if you need the
people who actually wrote the code, rather than someone who has only
read and modified the code, if you think that authority lends
credulity.

A SPARCCenter is not the same thing as an Intel box running SMP,
and is not even the same as a SPARC-20 running multiple processors
or a Sparc 5 with a Weitek processor add-in.


					Terry Lambert
					te...@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Greg Lehey <g...@lemis.com>
Subject: Re: SMP meeting summary
Date: 2000/07/03
Message-ID: <20000703114535.T39024@wantadilla.lemis.com>#1/1
X-Deja-AN: 641671101
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF  13 24 52 F8 6D A4 95 EF
References: <20000626151441.L8965@blitz.canonware.com> <Pine.SUN.3.91.1000626193709.15096A-100000@pcnet1.pcnet.com>
Delivered-To: freebsd-...@freebsd.org
WWW-Home-Page: http://www.lemis.com/~grog
Content-Type: text/plain; charset=us-ascii
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
X-Gateway: Unidirectional mail2news gateway at MPCS
Mobile: +61-418-838-708
Phone: +61-8-8388-8286
Mime-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp
X-Loop: FreeBSD.org
Fax: +61-8-8388-8725

On Monday, 26 June 2000 at 20:00:09 -0400, Daniel Eischen wrote:
> On 26 Jun 2000, Jason Evans wrote:
>
>> On Mon, Jun 26, 2000 at 02:49:57PM -0700, Jason Evans wrote:
>>> On Mon, Jun 26, 2000 at 04:13:24PM -0400, Luoqi Chen wrote:
>>>>> Processes that block on a mutex are granted the lock in FIFO order, rather
>>>>> than priority order.  In order to avoid priority inversion, the mutex wait
>>>>> queue implements priority lending.
>>>>>
>>>> Ok. I remember I have read somewhere that solaris 7 has given up the behavior
>>>> of waking up only one thread after a mutex is released, now it wakes up all
>>>> the blocking threads. It seems that the "thundering herd" problem is not
>>>> serious after all if the lock granuity is high enough.
>>>
>>> I don't think this is the case.
>>
>> Whoops.  The article is broken into two web pages, and the second page
>> states exactly what you said: as of Solaris 7, all waiting threads are
>> woken up.
>
> Yes, this confirms what Jim Mauro said in the Solaris Internals course
> at USENIX.  Since mutexes are held only for very small amounts of time
> and the kernel is sufficiently fine-grained, their was no advantage
> to calling wake_one() as opposed to wake_all().  Obviously with these
> semantics, the waiter with the highest priority should obtain the
> mutex.  At least that was my recollection...

I find this rather strange.  There can be many reasons to take a
mutex, and not all of them have to be fast.  Even in the case where
they are, it doesn't seem to be of any value to wake more processes
than can take the mutex.  From
http://www.sunworld.com/sunworldonline/swol-08-1999/swol-08-insidesolaris-2.html:

   Sun engineering coded the turnstile_wakeup() in Solaris 7 in a
   generic enough way so that a single thread wakeup could be
   executed, instead of all threads inevitably waking up
   together. Exhaustive testing under a variety of different loads has
   shown that, in practice, we very rarely end up with a large
   blocking chain of threads, and thus almost never run into the
   thundering herd problem. The wakeup-all implementation also solves
   some bit synchronization issues that make a wakeup-one scenario
   tricky.

This seems like a less honest way of saying "We couldn't figure out
how to avoid race conditions on wakeup, and so far nobody has been
able to point to a thundering herd".  I'd need some conviction.

Greg
--
Finger g...@lemis.com for PGP public key
See complete headers for address and phone numbers

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Daniel Eischen <eisc...@vigrid.com>
Subject: Re: SMP meeting summary
Date: 2000/07/03
Message-ID: <Pine.SUN.3.91.1000703060948.5216A-100000@pcnet1.pcnet.com>#1/1
X-Deja-AN: 641769245
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
References: <20000703114535.T39024@wantadilla.lemis.com>
Delivered-To: freebsd-...@freebsd.org
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Gateway: Unidirectional mail2news gateway at MPCS
MIME-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp
X-Loop: FreeBSD.org

On Mon, 3 Jul 2000, Greg Lehey wrote:
> On Monday, 26 June 2000 at 20:00:09 -0400, Daniel Eischen wrote:
> > Yes, this confirms what Jim Mauro said in the Solaris Internals course
> > at USENIX.  Since mutexes are held only for very small amounts of time
> > and the kernel is sufficiently fine-grained, their was no advantage
> > to calling wake_one() as opposed to wake_all().  Obviously with these
> > semantics, the waiter with the highest priority should obtain the
> > mutex.  At least that was my recollection...
> 
> I find this rather strange.  There can be many reasons to take a
> mutex, and not all of them have to be fast.  Even in the case where
> they are, it doesn't seem to be of any value to wake more processes
> than can take the mutex.  From
> http://www.sunworld.com/sunworldonline/swol-08-1999/swol-08-insidesolaris-2.html:
> 
>    Sun engineering coded the turnstile_wakeup() in Solaris 7 in a
>    generic enough way so that a single thread wakeup could be
>    executed, instead of all threads inevitably waking up
>    together. Exhaustive testing under a variety of different loads has
>    shown that, in practice, we very rarely end up with a large
>    blocking chain of threads, and thus almost never run into the
>    thundering herd problem. The wakeup-all implementation also solves
>    some bit synchronization issues that make a wakeup-one scenario
>    tricky.
> 
> This seems like a less honest way of saying "We couldn't figure out
> how to avoid race conditions on wakeup, and so far nobody has been
> able to point to a thundering herd".  I'd need some conviction.

Well if you are considering spinning for a bit of time on a held
mutex (which you seem to advocate?), then why not wake everyone?
If mutexes are held for very short periods of time and you don't
often have a thundering herd problem, then waking everyone is
an optimization since you only have to take the scheduling lock
once.  If mutexes can be held for long periods of time, then you
probably wouldn't want to wake everyone.

-- 
Dan Eischen


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Greg Lehey <g...@lemis.com>
Subject: Re: SMP meeting summary
Date: 2000/07/03
Message-ID: <20000703200039.H62680@wantadilla.lemis.com>#1/1
X-Deja-AN: 641770783
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF  13 24 52 F8 6D A4 95 EF
References: <20000703114535.T39024@wantadilla.lemis.com> <Pine.SUN.3.91.1000703060948.5216A-100000@pcnet1.pcnet.com>
Delivered-To: freebsd-...@freebsd.org
WWW-Home-Page: http://www.lemis.com/~grog
Content-Type: text/plain; charset=us-ascii
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
X-Gateway: Unidirectional mail2news gateway at MPCS
Mobile: +61-418-838-708
Phone: +61-8-8388-8286
Mime-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp
X-Loop: FreeBSD.org
Fax: +61-8-8388-8725

On Monday,  3 July 2000 at  6:23:28 -0400, Daniel Eischen wrote:
> On Mon, 3 Jul 2000, Greg Lehey wrote:
>> On Monday, 26 June 2000 at 20:00:09 -0400, Daniel Eischen wrote:
>>> Yes, this confirms what Jim Mauro said in the Solaris Internals course
>>> at USENIX.  Since mutexes are held only for very small amounts of time
>>> and the kernel is sufficiently fine-grained, their was no advantage
>>> to calling wake_one() as opposed to wake_all().  Obviously with these
>>> semantics, the waiter with the highest priority should obtain the
>>> mutex.  At least that was my recollection...
>>
>> I find this rather strange.  There can be many reasons to take a
>> mutex, and not all of them have to be fast.  Even in the case where
>> they are, it doesn't seem to be of any value to wake more processes
>> than can take the mutex.  From
>> http://www.sunworld.com/sunworldonline/swol-08-1999/swol-08-insidesolaris-2.html:
>>
>>    Sun engineering coded the turnstile_wakeup() in Solaris 7 in a
>>    generic enough way so that a single thread wakeup could be
>>    executed, instead of all threads inevitably waking up
>>    together. Exhaustive testing under a variety of different loads has
>>    shown that, in practice, we very rarely end up with a large
>>    blocking chain of threads, and thus almost never run into the
>>    thundering herd problem. The wakeup-all implementation also solves
>>    some bit synchronization issues that make a wakeup-one scenario
>>    tricky.
>>
>> This seems like a less honest way of saying "We couldn't figure out
>> how to avoid race conditions on wakeup, and so far nobody has been
>> able to point to a thundering herd".  I'd need some conviction.
>
> Well if you are considering spinning for a bit of time on a held
> mutex (which you seem to advocate?), then why not wake everyone?

Because it doesn't buy us anything.

> If mutexes are held for very short periods of time and you don't
> often have a thundering herd problem,

That's an assumption.  So far we have *never* had a thundering herd,
because the code don't work yet.

> then waking everyone is an optimization since you only have to take
> the scheduling lock once.

No.  If I understand things correctly, each process would need to get
the schedlock, and only one process can get the mutex.  Why wake the
rest?  What do you want them to do?  This applies even in the case of
a counting semaphore (of which our "mutex" is a special case), since
if any slots are available, the process wouldn't be sleeping.

Greg
--
Finger g...@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: "Jeroen C. van Gelderen" <jer...@vangelderen.org>
Subject: Re: SMP meeting summary
Date: 2000/07/03
Message-ID: <3960A971.982DDF07@vangelderen.org>#1/1
X-Deja-AN: 641852676
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
Content-Transfer-Encoding: 7bit
References: <20000703114535.T39024@wantadilla.lemis.com> <Pine.SUN.3.91.1000703060948.5216A-100000@pcnet1.pcnet.com> <20000703200039.H62680@wantadilla.lemis.com>
Delivered-To: freebsd-...@freebsd.org
X-Accept-Language: en
Content-Type: text/plain; charset=us-ascii
X-Gateway: Unidirectional mail2news gateway at MPCS
MIME-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp
X-Loop: FreeBSD.org

Greg Lehey wrote:
[...]
> That's an assumption.  So far we have *never* had a thundering herd,
> because the code don't work yet.

Your position is an assumption too. The difference is that 
one usually doesn't optimize until one has profiling 
information available. Am I correct in assuming that you
haven't done any profiling yet? Am I correct in assuming
that wake_one is an optimization?

> > then waking everyone is an optimization since you only have to take
> > the scheduling lock once.
> 
> No.  If I understand things correctly, each process would need to get
> the schedlock, and only one process can get the mutex.  Why wake the
> rest?  What do you want them to do?  

If -on average- there is only one process waiting you don't 
want to go trough the trouble of implementing a more complex
wake_one. It would only complicate the code with negligible
gain. 

That's my reading of Sun's claims in Solaris and given that 
they have a little more experience with this kind of thing 
I'm inclined to believe them until I see facts stating the 
contrary.

Cheers,
Jeroen
-- 
Jeroen C. van Gelderen          o      _     _         _
jer...@vangelderen.org  _o     /\_   _ \\o  (_)\__/o  (_)
                      _< \_   _>(_) (_)/<_    \_| \   _|/' \/
                     (_)>(_) (_)        (_)   (_)    (_)'  _\o_

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: c...@bsdi.com (Chuck Paterson)
Subject: Re: SMP meeting summary
Date: 2000/07/03
Message-ID: <8jqcmf$2e9p$1@FreeBSD.csie.NCTU.edu.tw>#1/1
X-Deja-AN: 641870330
X-Trace: FreeBSD.csie.NCTU.edu.tw 962639375 80186 140.113.209.200 (3 Jul 2000 15:49:35 GMT)
Organization: NCTU CSIE FreeBSD Server
NNTP-Posting-Date: 3 Jul 2000 15:49:35 GMT
Newsgroups: mailing.freebsd.smp
X-Complaints-To: usenet@FreeBSD.csie.NCTU.edu.tw

}That's my reading of Sun's claims in Solaris and given that 
}they have a little more experience with this kind of thing 
}I'm inclined to believe them until I see facts stating the 
}contrary.

I would caution against using Solaris to draw too detailed conclusions.
The locking in Solaris is finer grained than we are likely to
achieve for some time. Also having per processor run queues and
all the associated machinery to support this makes Solaris characterize
quite different than what we have today. As time goes on we will
have to make decisions on the number of processors we want to
support most efficiently. The answer for our problem set may be
quite different than what Sun arrived for their problem set. 
While I have no specific knowledge that this is true, I would not
be surprised if the Solaris machine dependent implementation differs
between Sparc and X86 with only Sparc being reported.

Chuck

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Greg Lehey <g...@lemis.com>
Subject: Re: SMP meeting summary
Date: 2000/07/04
Message-ID: <20000704083822.A65029@wantadilla.lemis.com>#1/1
X-Deja-AN: 642029602
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF  13 24 52 F8 6D A4 95 EF
References: <20000703114535.T39024@wantadilla.lemis.com> <Pine.SUN.3.91.1000703060948.5216A-100000@pcnet1.pcnet.com> <20000703200039.H62680@wantadilla.lemis.com> <3960A971.982DDF07@vangelderen.org>
Delivered-To: freebsd-...@freebsd.org
WWW-Home-Page: http://www.lemis.com/~grog
Content-Type: text/plain; charset=us-ascii
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
X-Gateway: Unidirectional mail2news gateway at MPCS
Mobile: +61-418-838-708
Phone: +61-8-8388-8286
Mime-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp
X-Loop: FreeBSD.org
Fax: +61-8-8388-8725

On Monday,  3 July 2000 at 10:55:45 -0400, Jeroen C. van Gelderen wrote:
> Greg Lehey wrote:
> [...]
>> That's an assumption.  So far we have *never* had a thundering herd,
>> because the code don't work yet.
>
> Your position is an assumption too. The difference is that
> one usually doesn't optimize until one has profiling
> information available. Am I correct in assuming that you
> haven't done any profiling yet? Am I correct in assuming
> that wake_one is an optimization?

You're not correct in your implied assumption that we can see any
potential problems with wake_one.

>>> then waking everyone is an optimization since you only have to take
>>> the scheduling lock once.
>>
>> No.  If I understand things correctly, each process would need to get
>> the schedlock, and only one process can get the mutex.  Why wake the
>> rest?  What do you want them to do?
>
> If -on average- there is only one process waiting you don't
> want to go trough the trouble of implementing a more complex
> wake_one. It would only complicate the code with negligible
> gain.

There's nothing to say that wake_one is more complex.  wake_one takes
the first process on the mutex's sleep list and wakes it.  wake_all
(or whatever) would make a loop out of that wake function and wake all
the processes on the list.  All would then be scheduled, try to take
the mutex, and all except one would fail and be put back on the sleep
list.  Does this make sense?

> That's my reading of Sun's claims in Solaris and given that they
> have a little more experience with this kind of thing I'm inclined
> to believe them until I see facts stating the contrary.

Sun's problem with Solaris is non-obvious, and may not bite us.

I think we should hold off with this kind of discussion for the while.
Everything I can see suggests that it's crazy to wake all processes.
If we find that we run into race conditions which we can only solve
with wake_all, though, we'll compare the effort in fixing them with
the (undoubted) performance degradation caused by waking them all.

Greg
--
Finger g...@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Alfred Perlstein <bri...@wintelcom.net>
Subject: Re: SMP meeting summary
Date: 2000/07/03
Message-ID: <20000703220823.Z25571@fw.wintelcom.net>#1/1
X-Deja-AN: 642110372
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
References: <20000703114535.T39024@wantadilla.lemis.com> <Pine.SUN.3.91.1000703060948.5216A-100000@pcnet1.pcnet.com> <20000703200039.H62680@wantadilla.lemis.com> <3960A971.982DDF07@vangelderen.org> <20000704083822.A65029@wantadilla.lemis.com>
Delivered-To: freebsd-...@freebsd.org
Content-Type: text/plain; charset=us-ascii
X-Gateway: Unidirectional mail2news gateway at MPCS
Mime-Version: 1.0
User-Agent: Mutt/1.2i
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp
Content-Disposition: inline
X-Loop: FreeBSD.org

* Greg Lehey <g...@lemis.com> [000703 16:10] wrote:
> > That's my reading of Sun's claims in Solaris and given that they
> > have a little more experience with this kind of thing I'm inclined
> > to believe them until I see facts stating the contrary.
> 
> Sun's problem with Solaris is non-obvious, and may not bite us.
> 
> I think we should hold off with this kind of discussion for the while.
> Everything I can see suggests that it's crazy to wake all processes.
> If we find that we run into race conditions which we can only solve
> with wake_all, though, we'll compare the effort in fixing them with
> the (undoubted) performance degradation caused by waking them all.

The idea is that for spin or spin-then-sleep mutexes (very short
hold time) is that since you won't have as many processes as cpus
contending (and when you do it's ok) that the mutual exclusion is
so short lived that by the time the next 'thundering' process is
actually given the CPU, the likelyhood is that other processes have
already aquired _and_ released the spinlock making it more than
likely that the reasource is free.

The idea is that the a quantum is actually so great that there's
little chance of one of the wake_all processes colliding on the
lock.

By effectively you gain a whole lot because you avoid having to
grab sched-mutex on each aquire/release and you also reduce the
cache cost of wakeups because it's likely that only once kernel
context will wind its way down the sleep queue.

What sort of interesting is that doing it one way or the other is
so similar that in reality the initial implementation doesn't
matter, switching from one to the other will be trivial at most,
the importance lies in getting one implementation done.

-Alfred

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Greg Lehey <g...@lemis.com>
Subject: Re: SMP meeting summary
Date: 2000/07/04
Message-ID: <20000704150736.H94351@wantadilla.lemis.com>#1/1
X-Deja-AN: 642118025
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
X-PGP-Fingerprint: 6B 7B C3 8C 61 CD 54 AF  13 24 52 F8 6D A4 95 EF
References: <20000703114535.T39024@wantadilla.lemis.com> <Pine.SUN.3.91.1000703060948.5216A-100000@pcnet1.pcnet.com> <20000703200039.H62680@wantadilla.lemis.com> <3960A971.982DDF07@vangelderen.org> <20000704083822.A65029@wantadilla.lemis.com> <20000703220823.Z25571@fw.wintelcom.net>
Delivered-To: freebsd-...@freebsd.org
WWW-Home-Page: http://www.lemis.com/~grog
Content-Type: text/plain; charset=us-ascii
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
X-Gateway: Unidirectional mail2news gateway at MPCS
Mobile: +61-418-838-708
Phone: +61-8-8388-8286
Mime-Version: 1.0
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp
X-Loop: FreeBSD.org
Fax: +61-8-8388-8725

On Monday,  3 July 2000 at 22:08:24 -0700, Alfred Perlstein wrote:
> What sort of interesting is that doing it one way or the other is
> so similar that in reality the initial implementation doesn't
> matter, switching from one to the other will be trivial at most,
> the importance lies in getting one implementation done.

There's a big difference in which implementation we do.  The BSD/OS
implementation works, at least in the BSD/OS environment.  Nothing
else has been written.  I think it's very important that we get the
BSD/OS version up and hobbling before we start redesigning things.  By
the time we've done that, we'll understand the material so much better
that we'll have a double win (working code and an understanding of how
to do it better).  I'm currently up to my elbows in dead interrupt
code, and I'm surprised how much I'm learning [wipes mess off arms].

Greg
--
Finger g...@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: Peter Wemm <pe...@netplex.com.au>
Subject: Re: SMP meeting summary 
Date: 2000/07/04
Message-ID: <200007042144.OAA54160@netplex.com.au>#1/1
X-Deja-AN: 642408019
Approved: n...@news1.mpcs.com
Sender: n...@news1.mpcs.com
References: <grog@lemis.com>
Delivered-To: freebsd-...@freebsd.org
X-Gateway: Unidirectional mail2news gateway at MPCS
Newsgroups: mpc.lists.freebsd.smp,muc.lists.freebsd.smp
X-Loop: FreeBSD.org

Greg Lehey wrote:
> On Monday,  3 July 2000 at 22:08:24 -0700, Alfred Perlstein wrote:
> > What sort of interesting is that doing it one way or the other is
> > so similar that in reality the initial implementation doesn't
> > matter, switching from one to the other will be trivial at most,
> > the importance lies in getting one implementation done.
> 
> There's a big difference in which implementation we do.  The BSD/OS
> implementation works, at least in the BSD/OS environment.  Nothing
> else has been written.  I think it's very important that we get the
> BSD/OS version up and hobbling before we start redesigning things.  By
> the time we've done that, we'll understand the material so much better
> that we'll have a double win (working code and an understanding of how
> to do it better).  I'm currently up to my elbows in dead interrupt
> code, and I'm surprised how much I'm learning [wipes mess off arms].

A general comment..  It was made very clear at the SMP meeting that things
would have taken a lot less time if they had the "safe but slower" fallback
code available right from the start.  I feel that it is imperative that we
implement a minimal-but-functional set of code that we can trust first and
*then* take a shot at the lightweight interrupt context, and do it in such
a way that when Weird Shit(TM) starts happening that we can easily fall
back to the conservative code so that we can eliminate the optimized
lightweight interrupt contexts from suspicion.  Having the BSD/OS code
available as a starting point is a huge help.  We should not have to worry
about the mutex or witness code until we are up and running.

There are truckloads of optimizations that can be done afterwards, but we
must walk first, not run.  Doing things conservatively and safely now with
an eye towards later optimization will hopefully save our sanity.  Whatever
we can leverage from BSD/OS as a "known quantity" we should - it will
reduce the amount of green or untried code while we get up to speed.  If
this means that our SMP work looks a lot like BSD/OS, then so what?  It
doesn't have to stay that way forever.  Once we have something that runs
and doesn't panic in 3 seconds, then we have something to tune/optimize/
reimplement/whatever.  If we all dive in and invent our own stuff right
from the start, we will have just as much pain and suffering as the BSDI
folks had and it will take just as long (or longer).

Cheers,
-Peter
--
Peter Wemm - pe...@FreeBSD.org; pe...@yahoo-inc.com; pe...@netplex.com.au
"All of this is for nothing if we don't go to the stars" - JMS/B5

To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message

From: g...@lemis.com (Greg Lehey)
Subject: Re: SMP meeting summary
Date: 2000/07/05
Message-ID: <20000705082900.I94351@wantadilla.lemis.com>#1/1
X-Deja-AN: 642439642
Approved: n...@camelot.de
References: <grog@lemis.com>
Content-Type: text/plain; charset=us-ascii
X-Complaints-To: abuse@camelot.de
X-Trace: lancelot.camelot.de 962751830 6775 195.30.224.3 (4 Jul 2000 23:03:50 GMT)
Organization: LEMIS, PO Box 460, Echunga SA 5153, Australia
Mime-Version: 1.0
NNTP-Posting-Date: 4 Jul 2000 23:03:50 GMT
Newsgroups: muc.lists.freebsd.smp,mpc.lists.freebsd.smp

On Tuesday,  4 July 2000 at 14:44:09 -0700, Peter Wemm wrote:
> Greg Lehey wrote:
>> On Monday,  3 July 2000 at 22:08:24 -0700, Alfred Perlstein wrote:
>>> What sort of interesting is that doing it one way or the other is
>>> so similar that in reality the initial implementation doesn't
>>> matter, switching from one to the other will be trivial at most,
>>> the importance lies in getting one implementation done.
>>
>> There's a big difference in which implementation we do.  The BSD/OS
>> implementation works, at least in the BSD/OS environment.  Nothing
>> else has been written.  I think it's very important that we get the
>> BSD/OS version up and hobbling before we start redesigning things.  By
>> the time we've done that, we'll understand the material so much better
>> that we'll have a double win (working code and an understanding of how
>> to do it better).  I'm currently up to my elbows in dead interrupt
>> code, and I'm surprised how much I'm learning [wipes mess off arms].
>
> A general comment..  It was made very clear at the SMP meeting that
> things would have taken a lot less time if they had the "safe but
> slower" fallback code available right from the start.  I feel that
> it is imperative that we implement a minimal-but-functional set of
> code that we can trust first and *then* take a shot at the
> lightweight interrupt context, and do it in such a way that when
> Weird Shit(TM) starts happening that we can easily fall back to the
> conservative code so that we can eliminate the optimized lightweight
> interrupt contexts from suspicion.

Agreed.  That's the way I'm going.  Is there anything I have said that
gives you reason to think I'm advocating something else?

> Having the BSD/OS code available as a starting point is a huge help.
> We should not have to worry about the mutex or witness code until we
> are up and running.

For some definition of "worry".

> There are truckloads of optimizations that can be done afterwards,
> but we must walk first, not run.  Doing things conservatively and
> safely now with an eye towards later optimization will hopefully
> save our sanity.  Whatever we can leverage from BSD/OS as a "known
> quantity" we should - it will reduce the amount of green or untried
> code while we get up to speed.  If this means that our SMP work
> looks a lot like BSD/OS, then so what?  It doesn't have to stay that
> way forever. 

I'm also not advocating change for change's sake.  If it turns out
that the BSD/OS code is the way to go, then I wouldn't want to change.

Greg
--
Finger g...@lemis.com for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to majord...@FreeBSD.org
with "unsubscribe freebsd-smp" in the body of the message