Thursday, February 18, 2010

linux.kernel - 26 new messages in 12 topics - digest

linux.kernel
http://groups.google.com/group/linux.kernel?hl=en

linux.kernel@googlegroups.com

Today's topics:

* input/touchscreen: Synaptics Touchscreen Driver - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/35b4266c46f84616?hl=en
* NO_HZ migration of TCP ack timers - 3 messages, 3 authors
http://groups.google.com/group/linux.kernel/t/e86dd5c1a70294be?hl=en
* x86-32: use SSE for atomic64_read/set if available - 10 messages, 4 authors
http://groups.google.com/group/linux.kernel/t/c7fe1bc8eb70e0f9?hl=en
* Input updates for 2.6.33-rc7 - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/e05b88b1fc64a058?hl=en
* Kernel panic due to page migration accessing memory holes - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/aa7ff852c220cdf9?hl=en
* libata: cache device select - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/d741a6a353f7ff1d?hl=en
* Panic at tcp_xmit_retransmit_queue - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/75f63e704505ce49?hl=en
* xtime_lock: Convert to raw_seqlock - 2 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/c5f7346f3aa4e57d?hl=en
* Kernel Bug in ATA or SMART area - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/3c73f50c29815691?hl=en
* x86 rwsem optimization extreme - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/8bd57c643290c7fe?hl=en
* bitops: compile time optimization for hweight_long(CONSTANT) - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/f58ac10e7917a328?hl=en
* tracing: Unify arch_syscall_addr() implementations - 3 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/f25a85a28b5e0740?hl=en

==============================================================================
TOPIC: input/touchscreen: Synaptics Touchscreen Driver
http://groups.google.com/group/linux.kernel/t/35b4266c46f84616?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 2:00 am
From: Dmitry Torokhov


Hi Christopher,

On Wed, Feb 17, 2010 at 02:37:36PM -0800, Christopher Heiny wrote:
> [This is a resend of a patch sent earlier this month - apologies
> for the duplication, but I botched some of the addressing in the
> previous submission (neglecting to include Dmitry), as well as the
> cover note.]
>
> This patch adds an initial driver supporting Synaptics ClearPad
> touchscreens that use the RMI4 protocol, as defined here:
>
> http://www.synaptics.com/sites/default/files/511-000136-01_revA.pdf
>
>
> Differences to the previous RFC PATCH sent include:
> - proper line wrapping (sending with get-sendemail)
> - extensive changes to make checkpatch.pl happy. Not all errors were
> eliminated, because in a couple of cases we couldn't figure out what the
> problem was.
> - i2c interface updated to reflect recent i2c changes in the kernel.
>


Thank you for making the changes and yes, please CC me and linux-input
mailing list on the subsequent submissions - this way you should
normally get quicker response.

PLease give me a couple days to look over the driver and ping me if you
don't hear from me.

--
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: NO_HZ migration of TCP ack timers
http://groups.google.com/group/linux.kernel/t/e86dd5c1a70294be?hl=en
==============================================================================

== 1 of 3 ==
Date: Thurs, Feb 18 2010 2:00 am
From: Anton Blanchard

Hi Andi,

> If the nohz balancer CPU is otherwise idle, shouldn't it have enough
> cycles to handle acks for everyone? Is the problem the cache line
> transfer time?

Yeah, I think the timer spinlock on the nohz balancer cpu ends up being a
global lock for every other cpu trying to migrate their ack timers to it.

> Sounds like something that should be controlled by the cpufreq governour's
> idle predictor? Only migrate if predicted idle time is long enough.
> It's essentially the same problem as deciding how deeply idle to put
> a CPU. Heavy measures only pay off if the expected time is long enough.

Interesting idea, it seems like we do need a better understanding of
how idle a cpu is, not just that it is idle when mod_timer is called.

Anton
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 2 of 3 ==
Date: Thurs, Feb 18 2010 2:10 am
From: Andi Kleen


On Thu, Feb 18, 2010 at 08:55:30PM +1100, Anton Blanchard wrote:
>
> Hi Andi,
>
> > If the nohz balancer CPU is otherwise idle, shouldn't it have enough
> > cycles to handle acks for everyone? Is the problem the cache line
> > transfer time?
>
> Yeah, I think the timer spinlock on the nohz balancer cpu ends up being a
> global lock for every other cpu trying to migrate their ack timers to it.

And they do that often for short idle periods?

For longer idle periods that should be not too bad.

-Andi

--
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 3 of 3 ==
Date: Thurs, Feb 18 2010 2:40 am
From: Arun R Bharadwaj


* Andi Kleen <andi@firstfloor.org> [2010-02-18 09:08:35]:

> Anton Blanchard <anton@samba.org> writes:
>
> > echo 0 > /proc/sys/kernel/timer_migration
> >
> > makes the problem go away.
> >
> > I think the problem is the CPU is most likely to be idle when an rx networking
> > interrupt comes in. It seems the wrong thing to do to migrate any ack timers
> > off the current cpu taking the interrupt, and with enough networks we train
> > wreck transferring everyones ack timers to the nohz load balancer cpu.
>
> If the nohz balancer CPU is otherwise idle, shouldn't it have enough
> cycles to handle acks for everyone? Is the problem the cache line
> transfer time?
>
> But yes if it's non idle the migration might need to spread out
> to more CPUs.
>
> >
> > What should we do? Should we use mod_timer_pinned here? Or is this an issue
>
> Sounds like something that should be controlled by the cpufreq governour's
> idle predictor? Only migrate if predicted idle time is long enough.
> It's essentially the same problem as deciding how deeply idle to put
> a CPU. Heavy measures only pay off if the expected time is long enough.
>

cpuidle infrastructure hs statistics about the idle times for
all the cpus. Maybe we can look to use this infrastructure to decide
whether to migrate timers or not?

arun

> > other areas might see (eg the block layer) and we should instead avoid
> > migrating timers created out of interrupts.
>
> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: x86-32: use SSE for atomic64_read/set if available
http://groups.google.com/group/linux.kernel/t/c7fe1bc8eb70e0f9?hl=en
==============================================================================

== 1 of 10 ==
Date: Thurs, Feb 18 2010 2:00 am
From: Avi Kivity


On 02/18/2010 02:47 AM, H. Peter Anvin wrote:
>
>>> Unless the performance advantage is provably very compelling, I'm
>>> inclined to say that this is not worth it.
>>>
>> There is the advantage of not taking the cacheline for writing in atomic64_read.
>> Also locked cmpxchg8b is slow and if we were to restore the TS flag
>> lazily on userspace return, it would significantly improve the
>> function in all cases (with the current code, it depends on how fast
>> the architecture does clts/stts vs lock cmpxchg8b).
>> Of course the big-picture impact depends on the users of the interface.
>>
> It does, and I would prefer to not take it until there is a user of the
> interface which motivates the performance. Ingo, do you have a feel for
> how performance-critical this actually is?
>

One heavy user is set_64() in the pagetable code. That's already in an
expensive operation due to the page fault so the impact will be quite
low, probably.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 2 of 10 ==
Date: Thurs, Feb 18 2010 2:10 am
From: Luca Barbieri


> One heavy user is set_64() in the pagetable code.  That's already in an
> expensive operation due to the page fault so the impact will be quite low,
> probably.
It currently does not use the atomic64_t infrastructure and thus won't
be affected currently, but can very easily be converted to cast the
pointer to atomic64_t* and use atomic64_set.

I think we set ptes in other places than the page fault handler.
Is any of them performance critical?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 3 of 10 ==
Date: Thurs, Feb 18 2010 2:20 am
From: Andi Kleen


On Thu, Feb 18, 2010 at 10:53:06AM +0100, Luca Barbieri wrote:
> > You seem to have forgotten to add benchmark results that show this is
> > actually worth while? And is there really any user on 32bit
> > that needs 64bit atomic_t?
> perf is currently the main user.
> On Core2, lock cmpxchg8b takes about 24 cycles and writes the
> cacheline, while movlps takes 1 cycle.
> clts/stts probably wipes out the savings if we need to use it, but we
> can keep TS off and restore it lazily on return to userspace.

s/probably/very likely/

CR changes are slow and synchronize the CPU. The later is always slow.

It sounds like you didn't time it?

> > I'm also suspicious of your use of global register variables.
> > This means they won't be saved on entry/exit of the functions.
> > Does that really work?
> I think it does.
> The functions never change the global register variables, and thus
> they are preserved.

Sounds fragile.

It'll generate worse code because gcc can't use these registers
at all in the C code. Some gcc versions also tend to give up when they run
out of registers too badly.

> Calls are done in inline assembly, which saves the variables if they
> are actually used as parameters (the global register variables are
> only visible in a portion of the C file, of course).

So why don't you simply use normal asm inputs/outputs?

-Andi

--
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 4 of 10 ==
Date: Thurs, Feb 18 2010 2:30 am
From: Peter Zijlstra


On Wed, 2010-02-17 at 12:42 +0100, Luca Barbieri wrote:
> +DEFINE_PER_CPU_ALIGNED(struct sse_atomic64_percpu, sse_atomic64_percpu);
> +
> +/* using the fpu/mmx looks infeasible due to the need to save the FPU environment, which is very slow
> + * SSE2 is slightly slower on Core 2 and less compatible, so avoid it for now
> + */
> +long long sse_atomic64_read_cx8call(long long dummy, const atomic64_t *v)
> +{
> + long long res;
> + unsigned long cr0 = 0;
> + struct thread_info *me = current_thread_info();
> + preempt_disable();
> + if (!(me->status & TS_USEDFPU)) {
> + cr0 = read_cr0();
> + if (cr0 & X86_CR0_TS)
> + clts();
> + }
> + asm volatile(
> + "movlps %%xmm0, " __percpu_arg(0) "\n\t"
> + "movlps %3, %%xmm0\n\t"
> + "movlps %%xmm0, " __percpu_arg(1) "\n\t"
> + "movlps " __percpu_arg(0) ", %%xmm0\n\t"
> + : "+m" (per_cpu__sse_atomic64_percpu.xmm0_low), "=m" (per_cpu__sse_atomic64_percpu.low), "=m" (per_cpu__sse_atomic64_percpu.high)
> + : "m" (v->counter));
> + if (cr0 & X86_CR0_TS)
> + write_cr0(cr0);
> + res = (long long)(unsigned)percpu_read(sse_atomic64_percpu.low) | ((long long)(unsigned)percpu_read(sse_atomic64_percpu.high) << 32);
> + preempt_enable();
> + return res;
> +}
> +EXPORT_SYMBOL(sse_atomic64_read_cx8call);

Care to explain how this is IRQ and NMI safe?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 5 of 10 ==
Date: Thurs, Feb 18 2010 2:30 am
From: Peter Zijlstra


On Wed, 2010-02-17 at 12:42 +0100, Luca Barbieri wrote:
> This patch makes atomic64 use either the generic implementation or
> the rewritten cmpxchg8b one just introduced by inserting a "call" to
> either, using the alternatives system to dynamically switch the calls.
>
> This allows to use atomic64_t on 386/486 which lack cmpxchg8b

IIRC we dropped <i586 SMP support, and since we don't have a PMU on
those chips atomic64_t doesn't need to be NMI safe, so a simple
UP-IRQ-disable implementation should suffice.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 6 of 10 ==
Date: Thurs, Feb 18 2010 2:30 am
From: Luca Barbieri


> CR changes are slow and synchronize the CPU. The later is always slow.
>
> It sounds like you didn't time it?
I didn't, because I think it strongly depends on the microarchitecture
and I don't have a comprehensive set of machines to test on, so it
would just be a single data point.

The lock prefix on cmpxchg8b is also serializing so it might be as bad.

Anyway, if we use this, we should keep TS cleared in kernel mode and
lazily restore it on return to userspace.
This would make clts/stts performance mostly moot.

I agree that this feature would need to added too before putting the
SSE atomic64 code in a released kernel.

> It'll generate worse code because gcc can't use these registers
> at all in the C code. Some gcc versions also tend to give up when they run
> out of registers too badly.
Yes, but the C implementations are small and simple, and are only used
on 386/486.
Furthermore, the data in the global register variables is the main
input to the computation.

> So why don't you simply use normal asm inputs/outputs?
I do, on the caller side.

In the callee, I don't see any other robust way to implement parameter
passing in ebx/esi other than global register variables (without
resorting to pure assembly, which would prevent reusing the generic
atomic64 implementation).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 7 of 10 ==
Date: Thurs, Feb 18 2010 2:30 am
From: Peter Zijlstra


On Thu, 2010-02-18 at 10:53 +0100, Luca Barbieri wrote:
> perf is currently the main user.
> On Core2, lock cmpxchg8b takes about 24 cycles and writes the
> cacheline, while movlps takes 1 cycle.

Then run a 64bit kernel already, then its a simple 1 cycle read.

The only platform this might possibly be worth the effort for it Atom,
the rest of the world has moved on to 64bit a long time ago.

There might still be a few pentium-m users out there that might
appreciate this too, but still..

That said, _iff_ this can be done nicely there's no objection.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 8 of 10 ==
Date: Thurs, Feb 18 2010 3:00 am
From: Luca Barbieri


On Thu, Feb 18, 2010 at 11:25 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-02-17 at 12:42 +0100, Luca Barbieri wrote:
>> +DEFINE_PER_CPU_ALIGNED(struct sse_atomic64_percpu, sse_atomic64_percpu);
>> +
>> +/* using the fpu/mmx looks infeasible due to the need to save the FPU environment, which is very slow
>> + * SSE2 is slightly slower on Core 2 and less compatible, so avoid it for now
>> + */
>> +long long sse_atomic64_read_cx8call(long long dummy, const atomic64_t *v)
>> +{
>> + � � � long long res;
>> + � � � unsigned long cr0 = 0;
>> + � � � struct thread_info *me = current_thread_info();
>> + � � � preempt_disable();
>> + � � � if (!(me->status & TS_USEDFPU)) {
>> + � � � � � � � cr0 = read_cr0();
>> + � � � � � � � if (cr0 & X86_CR0_TS)
>> + � � � � � � � � � � � clts();
>> + � � � }
>> + � � � asm volatile(
>> + � � � � � � � � � � � "movlps %%xmm0, " __percpu_arg(0) "\n\t"
>> + � � � � � � � � � � � "movlps %3, %%xmm0\n\t"
>> + � � � � � � � � � � � "movlps %%xmm0, " __percpu_arg(1) "\n\t"
>> + � � � � � � � � � � � "movlps " __percpu_arg(0) ", %%xmm0\n\t"
>> + � � � � � � � � � � � � � : "+m" (per_cpu__sse_atomic64_percpu.xmm0_low), "=m" (per_cpu__sse_atomic64_percpu.low), "=m" (per_cpu__sse_atomic64_percpu.high)
>> + � � � � � � � � � � � � � : "m" (v->counter));
>> + � � � if (cr0 & X86_CR0_TS)
>> + � � � � � � � write_cr0(cr0);
>> + � � � res = (long long)(unsigned)percpu_read(sse_atomic64_percpu.low) | ((long long)(unsigned)percpu_read(sse_atomic64_percpu.high) << 32);
>> + � � � preempt_enable();
>> + � � � return res;
>> +}
>> +EXPORT_SYMBOL(sse_atomic64_read_cx8call);
>
> Care to explain how this is IRQ and NMI safe?

Unfortunately it isn't, due to the per-CPU variables, and thus needs
to be fixed to align the stack and use it instead
(__attribute__((force_align_arg_pointer)) should do the job).
Sorry for this, I initially used the stack and later changed it to
guarantee alignment without rechecking the IRQ/NMI safety.

If we use the stack instead of per-CPU variables, all IRQs and NMIs
preserve CR0 and the SSE registers, so this would be safe, right?

kernel_fpu_begin/end cannot be used in interrupts, so that shouldn't
be a concern.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 9 of 10 ==
Date: Thurs, Feb 18 2010 3:00 am
From: Luca Barbieri


> IIRC we dropped <i586 SMP support, and since we don't have a PMU on
> those chips atomic64_t doesn't need to be NMI safe, so a simple
> UP-IRQ-disable implementation should suffice.

We need the generic version with spinlocks for other architectures,
and reusing it is the cheapest way to support 386/486.

We thus get 386/486 SMP for free, and on UP the spinlocks simplify to
just IRQ disabling.

The only thing we could do is to #ifdef out the hashed spinlock array
in the generic implementation on UP builds, which would save about 1KB
of memory.
That is independent from this patch though.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 10 of 10 ==
Date: Thurs, Feb 18 2010 3:10 am
From: Peter Zijlstra


On Thu, 2010-02-18 at 11:50 +0100, Luca Barbieri wrote:
> If we use the stack instead of per-CPU variables, all IRQs and NMIs
> preserve CR0 and the SSE registers, so this would be safe, right?

You'd have to take special care to deal with nested IRQs I think.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: Input updates for 2.6.33-rc7
http://groups.google.com/group/linux.kernel/t/e05b88b1fc64a058?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 2:10 am
From: Dmitry Torokhov


Hi Linus,

Please pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git for-linus
or
master.kernel.org:/pub/scm/linux/kernel/git/dtor/input.git for-linus

to receive updates for the input subsystem.

Changelog:
---------

Alan Jenkins (1):
Input: i8042 - fix KBC jam during hibernate

Matthew Garrett (1):
Input: add KEY_RFKILL


Diffstat:
--------

drivers/input/serio/i8042.c | 8 ++++++++
include/linux/input.h | 1 +
2 files changed, 9 insertions(+), 0 deletions(-)

--
Dmitry

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: Kernel panic due to page migration accessing memory holes
http://groups.google.com/group/linux.kernel/t/aa7ff852c220cdf9?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 2:10 am
From: Mel Gorman


On Thu, Feb 18, 2010 at 06:36:04PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 18 Feb 2010 00:22:24 -0800
> Michael Bohan <mbohan@codeaurora.org> wrote:
>
> > On 2/17/2010 5:03 PM, KAMEZAWA Hiroyuki wrote:
> > > On Wed, 17 Feb 2010 16:45:54 -0800
> > > Michael Bohan<mbohan@codeaurora.org> wrote:
> > >> As a temporary fix, I added some code to move_freepages_block() that
> > >> inspects whether the range exceeds our first memory bank -- returning 0
> > >> if it does. This is not a clean solution, since it requires exporting
> > >> the ARM specific meminfo structure to extract the bank information.
> > >>
> > >>
> > > Hmm, my first impression is...
> > >
> > > - Using FLATMEM, memmap is created for the number of pages and memmap should
> > > not have aligned size.
> > > - Using SPARSEMEM, memmap is created for aligned number of pages.
> > >
> > > Then, the range [zone->start_pfn ... zone->start_pfn + zone->spanned_pages]
> > > should be checked always.
> > >
> > >
> > > 803 static int move_freepages_block(struct zone *zone, struct page *page,
> > > 804 int migratetype)
> > > 805 {
> > > 816 if (start_pfn< zone->zone_start_pfn)
> > > 817 start_page = page;
> > > 818 if (end_pfn>= zone->zone_start_pfn + zone->spanned_pages)
> > > 819 return 0;
> > > 820
> > > 821 return move_freepages(zone, start_page, end_page, migratetype);
> > > 822 }
> > >
> > > "(end_pfn>= zone->zone_start_pfn + zone->spanned_pages)" is checked.
> > > What zone->spanned_pages is set ? The zone's range is
> > > [zone->start_pfn ... zone->start_pfn+zone->spanned_pages], so this
> > > area should have initialized memmap. I wonder zone->spanned_pages is too big.
> > >
> >
> > In the block of code above running on my target, the zone_start_pfn is
> > is 0x200 and the spanned_pages is 0x44100. This is consistent with the
> > values shown from the zoneinfo file below. It is also consistent with
> > my memory map:
> >
> > bank0:
> > start: 0x00200000
> > size: 0x07B00000
> >
> > bank1:
> > start: 0x40000000
> > size: 0x04300000
> >
> > Thus, spanned_pages here is the highest address reached minus the start
> > address of the lowest bank (eg. 0x40000000 + 0x04300000 - 0x00200000).
> >
> > Both of these banks exist in the same zone. This means that the check
> > in move_freepages_block() will never be satisfied for cases that overlap
> > with the prohibited pfns, since the zone spans invalid pfns. Should
> > each bank be associated with its own zone?
> >
>
> Hmm. okay then..(CCing Mel.)
>
> [Fact]
> - There are 2 banks of memory and a memory hole on your machine.
> As
> 0x00200000 - 0x07D00000
> 0x40000000 - 0x43000000
>
> - Each bancks are in the same zone.
> - You use FLATMEM.
> - You see panic in move_freepages().
> - Your host's MAX_ORDER=11....buddy allocator's alignment is 0x400000
> Then, it seems 1st bank is not algined.

It's not and assumptions are made about it being aligned.

> - You see panic in move_freepages().
> - When you added special range check for bank0 in move_freepages(), no panic.
> So, it seems the kernel see somehing bad at accessing memmap for a memory
> hole between bank0 and bank1.
>
>
> When you use FLATMEM, memmap/migrate-type-bitmap should be allocated for
> the whole range of [start_pfn....max_pfn) regardless of memory holes.
> Then, I think you have memmap even for a memory hole [0x07D00000...0x40000000)
>

It would have at the start but then ....


> Then, the question is why move_freepages() panic at accessing *unused* memmaps
> for memory hole. All memmap(struct page) are initialized in
> memmap_init()
> -> memmap_init_zone()
> -> ....
> Here, all page structs are initialized (page->flags, page->lru are initialized.)
>

ARM frees unused portions of memmap to save memory. It's why memmap_valid_within()
exists when CONFIG_ARCH_HAS_HOLES_MEMORYMODEL although previously only
reading /proc/pagetypeinfo cared.

In that case, the FLATMEM memory map had unexpected holes which "never"
happens and that was the workaround. The problem here is that there are
unaligned zones but no pfn_valid() implementation that can identify
them as you'd have with SPARSEMEM. My expectation is that you are using
the pfn_valid() implementation from asm-generic

#define pfn_valid(pfn) ((pfn) < max_mapnr)

which is insufficient in your case.

> Then, looking back into move_freepages().
> ==
> 778 for (page = start_page; page <= end_page;) {
> 779 /* Make sure we are not inadvertently changing nodes */
> 780 VM_BUG_ON(page_to_nid(page) != zone_to_nid(zone));
> 781
> 782 if (!pfn_valid_within(page_to_pfn(page))) {
> 783 page++;
> 784 continue;
> 785 }
> 786
> 787 if (!PageBuddy(page)) {
> 788 page++;
> 789 continue;
> 790 }
> 791
> 792 order = page_order(page);
> 793 list_del(&page->lru);
> 794 list_add(&page->lru,
> 795 &zone->free_area[order].free_list[migratetype]);
> 796 page += 1 << order;
> 797 pages_moved += 1 << order;
> 798 }
> ==
> Assume an access to page struct itself doesn't cause panic.
> Touching page struct's member of page->lru at el to cause panic,
> So, PageBuddy should be set.
>
> Then, there are 2 chances.
> 1. page_to_nid(page) != zone_to_nid(zone).
> 2. PageBuddy() is set by mistake.
> (PG_reserved page never be set PG_buddy.)
>
> For both, something corrupted in unused memmap area.
> There are 2 possibility.
> (1) memmap for memory hole was not initialized correctly.
> (2) something wrong currupt memmap. (by overwrite.)
>
> I doubt (2) rather than (1).
>

I think it's more likely the at the memmap he is accessing has been
freed and is effectively random data.

> One of difficulty here is that your kernel is 2.6.29. Can't you try 2.6.32 and
> reproduce trouble ? Or could you check page flags for memory holes ?
> For holes, nid should be zero and PG_buddy shouldn't be set and PG_reserved
> should be set...
>
> And checking memmap initialization of memory holes in memmap_init_zone()
> may be good start point for debug, I guess.
>
> Off topic:
> BTW, memory hole seems huge for your size of memory....using SPARSEMEM
> is a choice.
>

SPARSEMEM would give you an implementation of pfn_valid() that you could
use here. The choices that spring to mind are;

1. reduce MAX_ORDER so they are aligned (easiest)
2. use SPARSEMEM (easy, but not necessary what you want to do, might
waste memory unless you drop MAX_ORDER as well)
3. implement a pfn_valid() that can handle the holes and set
CONFIG_HOLES_IN_ZONE so it's called in move_freepages() to
deal with the holes (should pass this by someone more familiar
with ARM than I)
4. Call memmap_valid_within in move_freepages (very very ugly, not
suitable for upstream merging)

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: libata: cache device select
http://groups.google.com/group/linux.kernel/t/d741a6a353f7ff1d?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 2:20 am
From: Alan Cox


O> I totally agree with this patch, but question the timings used to justify it.
> Surely the overhead is only 1-2usec for the case where the device
> is the one that was already selected (on a "smart" interface) ?

IFF you have a smart interface. A lot of the controllers in the PCI space
don't appear to be that clever.

> And for the case where the currently selected device is different
> than the desired device (the 1msec case), this patch makes little/no difference?

Correct, but even with two devices per cable (which is not the most
common setup) you win. Worst case (which I've never seen) you draw.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: Panic at tcp_xmit_retransmit_queue
http://groups.google.com/group/linux.kernel/t/75f63e704505ce49?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 2:40 am
From: Bruno Prémont


On Mon, 15 Feb 2010 15:21:58 "Ilpo Järvinen" wrote:
> On Wed, 3 Feb 2010, Ilpo Järvinen wrote:
>
> > On Mon, 1 Feb 2010, sbs wrote:
> >
> > > actually removing netconsole from kernel didnt help.
> > > i found many guys with the same problem but with different
> > > hardware configurations here:
> > >
> > > freez in TCP stack :
> > > http://bugzilla.kernel.org/show_bug.cgi?id=14470
> > >
> > > is there someone who can investigate it?
> > >
> > >
> > > On Tue, Jan 19, 2010 at 7:13 PM, sbs <gexlie@gmail.com> wrote:
> > > > We are hiting kernel panics on servers with nVidia MCP55 NICs
> > > > once a day; it appears usualy under a high network trafic
> > > > ( around 10000Mbit/s) but it is not a rule, it has happened
> > > > even on low trafic.
> > > >
> > > > Servers are used as nginx+static content
> > > > On 2 equal servers this panic happens aprox 2 times a day
> > > > depending on network load. Machine completly freezes till the
> > > > netconsole reboots.
> > > >
> > > > Kernel: 2.6.32.3
> > > >
> > > > what can it be? whats wrong with tcp_xmit_retransmit_queue()
> > > > function ? can anyone explain or fix?
> >
> > You might want to try with to debug patch below. It might even make
> > the box to survive the event (if I got it coded right).
>
> Here should be a better version of the debug patch, hopefully the
> infinite looping is now gone.

I can reproduce the freeze pretty easily, even on an idle server,
all I need is netconsole enabled, an ssh connection to server and
permission to write to /proc/sysrq-trigger.

The following command, executed via SSH triggers the frozen system:
echo t > /proc/sysrq-trigger
when netconsole is enabled. Doing the same from local console has no
negative effect (idle system).
Unfortunately I can't get any useful information out of the system as
nothing reaches VGA console and interaction with the system is not
possible anymore (cursor is still blinking on VGA console).

Unfortunately I currently have no setup here to analyze dead system via
kexec crash kernel that would be run on watchdog.

System I'm using is HP Proliant DL360 G5 (4 logical CPUs, two sockets),
bnx2 NIC.
Eventually I will try with some other system to reproduce there as
well (to rule out NIC driver).

Any hints on how to get pertinent data out of that system would be
really nice!

Regards,
Bruno
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: xtime_lock: Convert to raw_seqlock
http://groups.google.com/group/linux.kernel/t/c5f7346f3aa4e57d?hl=en
==============================================================================

== 1 of 2 ==
Date: Thurs, Feb 18 2010 2:50 am
From: Peter Zijlstra


On Wed, 2010-02-17 at 18:47 +0000, Thomas Gleixner wrote:
>
> xtime_lock needs a raw_spinlock in preempt-rt. Convert it to
> raw_seqlock and fix up all users.
>
s/raw_spinlock/raw_seqlock/ ?

Maybe add an explanation on _why_ -rt needs this for the uninformed
amongst us.

-rt switches to sleeping spinlocks, but since the vdso is basically
userspace it cannot schedule, hence we need to keep using actual
spinlocks (this is also the reason the vdso things must not call into
lockdep)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 2 of 2 ==
Date: Thurs, Feb 18 2010 3:10 am
From: Thomas Gleixner


On Thu, 18 Feb 2010, Peter Zijlstra wrote:

> On Wed, 2010-02-17 at 18:47 +0000, Thomas Gleixner wrote:
> >
> > xtime_lock needs a raw_spinlock in preempt-rt. Convert it to
> > raw_seqlock and fix up all users.
> >
> s/raw_spinlock/raw_seqlock/ ?
>
> Maybe add an explanation on _why_ -rt needs this for the uninformed
> amongst us.
>
> -rt switches to sleeping spinlocks, but since the vdso is basically
> userspace it cannot schedule, hence we need to keep using actual
> spinlocks (this is also the reason the vdso things must not call into
> lockdep)

No, the read_seq side is not taking the lock. It's just the write side
which is taking the spinlock to serialize against other writers.

xtime_lock is write locked in the timer interrupt context and therefor
cannot take a sleeping spinlock.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: Kernel Bug in ATA or SMART area
http://groups.google.com/group/linux.kernel/t/3c73f50c29815691?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 3:00 am
From: Mikael Pettersson


Axel Uhl writes:
> Here's what /var/log/dmesg contains:
...
> Tejun Heo wrote:
> > Hello,
> >
> > On 02/12/2010 05:46 PM, Axel Uhl wrote:
> >> I don't have a /var/log/boot.msg, only a /var/log/boot. Its contents:
> >
> > Then please attach output of dmesg after boot.

Judging from your initial message and this one, it appears that the
problematic disk is driven by sata_via, but sata_via shares IRQ with
an awful lot of other junk.

Suggestions:
1. Enable IO/APIC support in the kernel. According to your previous lspci
the chipset does have one. This should reduce IRQ sharing and make IRQ
handling generally better. If there's still some sharing going on,
try moving PCI cards to other slots.
2. Move the problematic disk around to e.g. the Promise controller. Do the
stray exceptions persist? If so, then it's the disk that's at fault.
As long as the libata's error handling recovers, things should work anyway.
3. (Unrelated but...) Why use Old IDE to drive the VIA PATA controller?
Just use pata_via for that one, enable SCSI SR+SG support, and disable IDE.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: x86 rwsem optimization extreme
http://groups.google.com/group/linux.kernel/t/8bd57c643290c7fe?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 3:00 am
From: Ingo Molnar

* Zachary Amsden <zamsden@redhat.com> wrote:

> >
> >Zachary Amsden<zamsden@redhat.com> writes
> >>Incidentally, the cost of putting all the rwsem code inline, using the
> >>straightforward approach, for git-tip, using defconfig on x86_64 is
> >>3565 bytes / 20971778 bytes total, or 0.0168%, using gcc 4.4.3.
> >The nice advantage of putting lock code inline is that it gets
> >accounted to the caller in all profilers.
> >
> >-Andi
> >
>
> Unfortunately, only for the uncontended case. The hot case still ends up
> in a call to the lock text section.

Nor is it really true that it's 'a problem for profilers' - call graph
recording works just fine, in fact it can be better for a call-graph record
if the locking sites are not sprinkled around the kernel and inlined.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: bitops: compile time optimization for hweight_long(CONSTANT)
http://groups.google.com/group/linux.kernel/t/f58ac10e7917a328?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 3:00 am
From: Peter Zijlstra


On Wed, 2010-02-17 at 14:57 +0100, Michal Marek wrote:
> On 12.2.2010 20:05, H. Peter Anvin wrote:
> > On 02/12/2010 09:47 AM, Borislav Petkov wrote:
> >>
> >> However, this is generic code and for the above to work we have to
> >> enforce x86-specific CFLAGS for it. What is the preferred way to do
> >> that?
> >>
> >
> > That's a question for Michal and the kbuild list. Michal?
>
> (I was offline last week).
>
> The _preferred_ way probably is not to do it :), but otherwise you can
> set CFLAGS_hweight.o depending on CONFIG_X86(_32|_64), just like you do
> in arch/x86/lib/Makefile already.

I guess one way to achieve that is to create a arch/x86/lib/hweight.c
that includes lib/hweight.c and give the x86 one special compile flags
and not build the lib on.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: tracing: Unify arch_syscall_addr() implementations
http://groups.google.com/group/linux.kernel/t/f25a85a28b5e0740?hl=en
==============================================================================

== 1 of 3 ==
Date: Thurs, Feb 18 2010 3:20 am
From: Frederic Weisbecker


From: Mike Frysinger <vapier@gentoo.org>

Most implementations of arch_syscall_addr() are the same, so create a
default version in common code and move the one piece that differs (the
syscall table) to asm/syscall.h. New arch ports don't have to waste
time copying & pasting this simple function.

The s390/sparc versions need to be different, so document why.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1264498803-17278-1-git-send-email-vapier@gentoo.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
Documentation/trace/ftrace-design.txt | 5 ++---
arch/s390/include/asm/syscall.h | 7 +++++++
arch/s390/kernel/ftrace.c | 10 ----------
arch/sh/include/asm/syscall.h | 2 ++
arch/sh/kernel/ftrace.c | 9 ---------
arch/sparc/include/asm/syscall.h | 7 +++++++
arch/sparc/kernel/ftrace.c | 11 -----------
arch/x86/include/asm/syscall.h | 2 ++
arch/x86/kernel/ftrace.c | 10 ----------
include/linux/ftrace.h | 6 ++++++
kernel/trace/trace_syscalls.c | 5 +++++
11 files changed, 31 insertions(+), 43 deletions(-)

diff --git a/Documentation/trace/ftrace-design.txt b/Documentation/trace/ftrace-design.txt
index 239f14b..99df110 100644
--- a/Documentation/trace/ftrace-design.txt
+++ b/Documentation/trace/ftrace-design.txt
@@ -218,11 +218,10 @@ HAVE_SYSCALL_TRACEPOINTS

You need very few things to get the syscalls tracing in an arch.

+- Support HAVE_ARCH_TRACEHOOK (see arch/Kconfig).
- Have a NR_syscalls variable in <asm/unistd.h> that provides the number
of syscalls supported by the arch.
-- Implement arch_syscall_addr() that resolves a syscall address from a
- syscall number.
-- Support the TIF_SYSCALL_TRACEPOINT thread flags
+- Support the TIF_SYSCALL_TRACEPOINT thread flags.
- Put the trace_sys_enter() and trace_sys_exit() tracepoints calls from ptrace
in the ptrace syscalls tracing path.
- Tag this arch as HAVE_SYSCALL_TRACEPOINTS.
diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h
index e0a73d3..8429686 100644
--- a/arch/s390/include/asm/syscall.h
+++ b/arch/s390/include/asm/syscall.h
@@ -15,6 +15,13 @@
#include <linux/sched.h>
#include <asm/ptrace.h>

+/*
+ * The syscall table always contains 32 bit pointers since we know that the
+ * address of the function to be called is (way) below 4GB. So the "int"
+ * type here is what we want [need] for both 32 bit and 64 bit systems.
+ */
+extern const unsigned int sys_call_table[];
+
static inline long syscall_get_nr(struct task_struct *task,
struct pt_regs *regs)
{
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index 5a82bc6..9e69449 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -200,13 +200,3 @@ out:
return parent;
}

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home


Real Estate