twitter: linux.kernel - 26 new messages in 12 topics

linux.kernel
http://groups.google.com/group/linux.kernel?hl=en

Today's topics:

* introduce sys_membarrier(): process-wide memory barrier - 5 messages, 4
authors
http://groups.google.com/group/linux.kernel/t/c8972d397ccbdcff?hl=en
* slab: initialize unused alien cache entry as NULL at alloc_alien_cache(). -
2 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/225b6b9dddc5ed96?hl=en
* af_packet: Don't use skb after dev_queue_xmit() - 5 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/8e6b471b3077e37d?hl=en
* [PATCH 6/8] mm: handle_speculative_fault() - 3 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/2a5e8285ffb8a998?hl=en
* s390 && user_enable_single_step() (Was: odd utrace testing results on s390x)
- 3 messages, 1 author
http://groups.google.com/group/linux.kernel/t/e13ca0bcc54b2ee7?hl=en
* Linux 2.6.31.10 - 2 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/ea7bb2f1bb30cdb8?hl=en
* NOMMU: Optimise away the {dac_,}mmap_min_addr tests - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/a5885fb2a96e1745?hl=en
* linux-next: Tree for January 7 (pcmcia) - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/a4235efd7c294f08?hl=en
* lib/vsprintf.c: Add %pMF to format FDDI bit reversed MAC addresses - 1
messages, 1 author
http://groups.google.com/group/linux.kernel/t/fe6737545216efec?hl=en
* Лидия Громыко has added you as a friend on the website VK.com - 1 messages,
1 author
http://groups.google.com/group/linux.kernel/t/db2424c90bbc71ea?hl=en
* cfq-iosched: non-rot devices do not need read queue merging - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/f5cfad2a8e2aea5f?hl=en
* Input: wacom - Setup features via driver info - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/180a14dbdf464341?hl=en

==============================================================================
TOPIC: introduce sys_membarrier(): process-wide memory barrier
http://groups.google.com/group/linux.kernel/t/c8972d397ccbdcff?hl=en
==============================================================================

== 1 of 5 ==
Date: Thurs, Jan 7 2010 9:50 am
From: Mathieu Desnoyers

* Josh Triplett (josh@joshtriplett.org) wrote:
> On Thu, Jan 07, 2010 at 01:04:39AM -0500, Mathieu Desnoyers wrote:
[...]
> > Just tried it with a 10,000,000 iterations loop.
> >
> > The thread doing the system call loop takes 2.0% of user time, 98% of
> > system time. All other cpus are nearly 100.0% idle. Just to give a bit
> > more info about my test setup, I also have a thread sitting on a CPU
> > busy-waiting for the loop to complete. This thread takes 97.7% user
> > time (but it really is just there to make sure we are indeed doing the
> > IPIs, not skipping it through the thread_group_empty(current) test). If
> > I remove this thread, the execution time of the test program shrinks
> > from 32 seconds down to 1.9 seconds. So yes, the IPI is actually
> > executed in the first place, because removing the extra thread
> > accelerates the loop tremendously. I used a 8-core Xeon to test.
>
> Do you know if the kernel properly measures the overhead of IPIs? The
> CPUs might have only looked idle. What about running some kind of
> CPU-bound benchmark on the other CPUs and testing the completion time
> with and without the process running the membarrier loop?

Good point. Just tried with a cache-hot kernel compilation using 6/8 CPUs.

Normally: real 2m41.852s
With the sys_membarrier+1 busy-looping thread running: real 5m41.830s

So... the unrelated processes become 2x slower. That hurts.

So let's try allocating a cpu mask for PeterZ scheme. I prefer to have a
small allocation overhead and benefit from cpumask broadcast if
possible so we scale better. But that all depends on how big the
allocation overhead is.

Impact of allocating a cpumask (time for 10,000,000 sys_membarrier
calls, one thread is doing the sys_membarrier, the others are busy
looping)):

IPI to all: real 0m44.708s
alloc cpumask+local mb()+IPI-many to 1 thread: real 1m2.034s

So, roughly, the cpumask allocation overhead is 17s here, not exactly
cheap. So let's see when it becomes better than single IPIs:

local mb()+single IPI to 1 thread: real 0m29.502s
local mb()+single IPI to 7 threads: real 2m30.971s

So, roughly, the single IPI overhead is 120s here for 6 more threads,
for 20s per thread.

Here is what we can do: Given that it costs almost half as much to
perform the cpumask allocation than to send a single IPI, as we iterate
on the CPUs, for the, say, first N CPUs (ourself and 1 cpu that needs to
have an IPI sent), we send a "single IPI". This will be N-1 IPI and a
local function call. If we need more than that, then we switch to the
cpumask allocation and send a broadcast IPI to the cpumask we construct
for the rest of the CPUs. Let's call it the "adaptative IPI scheme".

For my Intel Xeon E5405:

Just doing local mb()+single IPI to T other threads:

T=1: 0m29.219s
T=2: 0m46.310s
T=3: 1m10.172s
T=4: 1m24.822s
T=5: 1m43.205s
T=6: 2m15.405s
T=7: 2m31.207s

Just doing cpumask alloc+IPI-many to T other threads:

T=1: 0m39.605s
T=2: 0m48.566s
T=3: 0m50.167s
T=4: 0m57.896s
T=5: 0m56.411s
T=6: 1m0.536s
T=7: 1m12.532s

So I think the right threshold should be around 2 threads (assuming
other architecture will behave like mine). So starting 3 threads, we
allocate the cpumask and send IPIs.

How does that sound ?

[...]

>
> > > - Part of me thinks this ought to become slightly more general, and just
> > > deliver a signal that the receiving thread could handle as it likes.
> > > However, that would certainly prove more expensive than this, and I
> > > don't know that the generality would buy anything.
> >
> > A general scheme would have to call every threads, even those which are
> > not running. In the case of this system call, this is a particular case
> > where we can forget about non-running threads, because the memory
> > barrier is implied by the scheduler activity that brought them offline.
> > So I really don't see how we can use this IPI scheme for other things
> > that this kind of synchronization.
>
> No, I don't mean non-running threads. If you wanted that, you could do
> what urcu currently does, and send a signal to all threads. I meant
> something like "signal all *running* threads from my process".

Well, if you find me a real-life use-case, then we can surely look into
that ;)

>
> > > - Could you somehow register reader threads with the kernel, in a way
> > > that makes them easy to detect remotely?
> >
> > There are two ways I figure out we could do this. One would imply adding
> > extra shared data between kernel and userspace (which I'd like to avoid,
> > to keep coupling low). The other alternative would be to add per
> > task_struct information about this, and new system calls. The added per
> > task_struct information would use up cache lines (which are very
> > important, especially in the task_struct) and the added system call at
> > rcu_read_lock/unlock() would simply kill performance.
>
> No, I didn't mean that you would do a syscall in rcu_read_{lock,unlock}.
> I meant that you would do a system call when the reader threads start,
> saying "hey, reader thread here".

Hrm, we need to inform the userspace RCU library that this thread is
present too. So I don't see how going through the kernel helps us there.

Thanks,

Mathieu

>
> - Josh Triplett

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 5 ==
Date: Thurs, Jan 7 2010 10:00 am
From: "Paul E. McKenney"

On Thu, Jan 07, 2010 at 12:44:35PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Thu, Jan 07, 2010 at 06:18:36PM +0100, Peter Zijlstra wrote:
> > > On Thu, 2010-01-07 at 08:52 -0800, Paul E. McKenney wrote:
> > > > On Thu, Jan 07, 2010 at 09:44:15AM +0100, Peter Zijlstra wrote:
> > > > > On Wed, 2010-01-06 at 22:35 -0800, Josh Triplett wrote:
> > > > > >
> > > > > > The number of threads doesn't matter nearly as much as the number of
> > > > > > threads typically running at a time compared to the number of
> > > > > > processors. Of course, we can't measure that as easily, but I don't
> > > > > > know that your proposed heuristic would approximate it well.
> > > > >
> > > > > Quite agreed, and not disturbing RT tasks is even more important.
> > > >
> > > > OK, so I stand un-Reviewed-by twice in one morning. ;-)
> > > >
> > > > > A simple:
> > > > >
> > > > > for_each_cpu(cpu, current->mm->cpu_vm_mask) {
> > > > > if (cpu_curr(cpu)->mm == current->mm)
> > > > > smp_call_function_single(cpu, func, NULL, 1);
> > > > > }
> > > > >
> > > > > seems far preferable over anything else, if you really want you can use
> > > > > a cpumask to copy cpu_vm_mask in and unset bits and use the mask with
> > > > > smp_call_function_any(), but that includes having to allocate the
> > > > > cpumask, which might or might not be too expensive for Mathieu.
> > > >
> > > > This would be vulnerable to the sys_membarrier() CPU seeing an old value
> > > > of cpu_curr(cpu)->mm, and that other task seeing the old value of the
> > > > pointer we are trying to RCU-destroy, right?
> > >
> > > Right, so I was thinking that since you want a mb to be executed when
> > > calling sys_membarrier(). If you observe a matching ->mm but the cpu has
> > > since scheduled, we're good since it scheduled (but we'll still send the
> > > IPI anyway), if we do not observe it because the task gets scheduled in
> > > after we do the iteration we're still good because it scheduled.
> >
> > Something like the following for sys_membarrier(), then?
> >
> > smp_mb();
>
> This smp_mb() is redundant, as we issue it through the for_each_cpu loop
> on the local CPU already.

But we need to do the smp_mb() -before- checking the first cpu_curr(cpu)->mm.

> > for_each_cpu(cpu, current->mm->cpu_vm_mask) {
> > if (cpu_curr(cpu)->mm == current->mm)
> > smp_call_function_single(cpu, func, NULL, 1);
> > }
> >
> > Then the code changing ->mm on the other CPU also needs to have a
> > full smp_mb() somewhere after the change to ->mm, but before starting
> > user-space execution. Which it might well just due to overhead, but
> > we need to make sure that someone doesn't optimize us out of existence.
>
> I believe we also need one between execution of the userspace task and
> change to ->mm. If we have these guarantees I think we are fine.

Agreed, in case an outgoing RCU read-side critical section does a store
into an RCU-protected data structure. Unconventional, but definitely
permitted.

Thanx, Paul

> Mathieu
>
> >
> > Thanx, Paul
> >
> > > As to needing to keep rcu_read_lock() around the iteration, for sure we
> > > need that to ensure the remote task_struct reference we take is valid.
> > >
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 3 of 5 ==
Date: Thurs, Jan 7 2010 10:00 am
From: "Paul E. McKenney"

On Thu, Jan 07, 2010 at 12:44:37PM -0500, Steven Rostedt wrote:
> On Thu, 2010-01-07 at 09:31 -0800, Paul E. McKenney wrote:
>
> > Something like the following for sys_membarrier(), then?
> >
> > smp_mb();
> > for_each_cpu(cpu, current->mm->cpu_vm_mask) {
> > if (cpu_curr(cpu)->mm == current->mm)
> > smp_call_function_single(cpu, func, NULL, 1);
> > }
> >
> > Then the code changing ->mm on the other CPU also needs to have a
> > full smp_mb() somewhere after the change to ->mm, but before starting
> > user-space execution. Which it might well just due to overhead, but
> > we need to make sure that someone doesn't optimize us out of existence.
>
> To change the mm requires things like flushing the TLB. I'd be surprised
> if the change of the mm does not already do a smp_mb() somewhere.

Agreed, but "somewhere" does not fill me with warm fuzzies. ;-)

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 4 of 5 ==
Date: Thurs, Jan 7 2010 10:10 am
From: Steven Rostedt

On Thu, 2010-01-07 at 09:56 -0800, Paul E. McKenney wrote:
> On Thu, Jan 07, 2010 at 12:44:37PM -0500, Steven Rostedt wrote:
> > On Thu, 2010-01-07 at 09:31 -0800, Paul E. McKenney wrote:
> >
> > > Something like the following for sys_membarrier(), then?
> > >
> > > smp_mb();
> > > for_each_cpu(cpu, current->mm->cpu_vm_mask) {
> > > if (cpu_curr(cpu)->mm == current->mm)
> > > smp_call_function_single(cpu, func, NULL, 1);
> > > }
> > >
> > > Then the code changing ->mm on the other CPU also needs to have a
> > > full smp_mb() somewhere after the change to ->mm, but before starting
> > > user-space execution. Which it might well just due to overhead, but
> > > we need to make sure that someone doesn't optimize us out of existence.
> >
> > To change the mm requires things like flushing the TLB. I'd be surprised
> > if the change of the mm does not already do a smp_mb() somewhere.
>
> Agreed, but "somewhere" does not fill me with warm fuzzies. ;-)

Another question would be, does flushing the TLB imply a mb()?

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 5 of 5 ==
Date: Thurs, Jan 7 2010 10:40 am
From: Oleg Nesterov

On 01/07, Peter Zijlstra wrote:
>
> On Wed, 2010-01-06 at 23:40 -0500, Mathieu Desnoyers wrote:
>
> http://marc.info/?t=126283939400002
>
> > Index: linux-2.6-lttng/kernel/sched.c
> > ===================================================================
> > --- linux-2.6-lttng.orig/kernel/sched.c 2010-01-06 22:11:32.000000000 -0500
> > +++ linux-2.6-lttng/kernel/sched.c 2010-01-06 23:20:42.000000000 -0500
> > @@ -10822,6 +10822,36 @@ struct cgroup_subsys cpuacct_subsys = {
> > };
> >

twitter

Thursday, January 7, 2010

linux.kernel - 26 new messages in 12 topics - digest

0 Comments:

Post a Comment

About Me

Previous Posts