twitter: linux.kernel - 26 new messages in 16 topics

linux.kernel
http://groups.google.com/group/linux.kernel?hl=en

Today's topics:

* cpufreq: Add APERF/MPERF support for AMD processors - 7 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/f0aa58b5e6d920fb?hl=en
* smp_call_function_many SMP race - 2 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/90a370c7d8dd310f?hl=en
* KSM & hugepages - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/940cfb5fd65ac16f?hl=en
* .gitignore: ignore *.lzo files - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/097186c3c77a5979?hl=en
* kconfig: recalc symbol value before showing search results - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/589540bdd321a019?hl=en
* Can not boot with CONFIG_NO_BOOTMEM=y on i686 - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/82c8d53557498245?hl=en
* RFC: direct MTD support for SquashFS - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/ffc7606040d00b24?hl=en
* intel-agp.c: Fix crash when accessing nonexistent GTT entries in i915 - 1
messages, 1 author
http://groups.google.com/group/linux.kernel/t/bcbed4037b6559cf?hl=en
* em28xx: "Empia Em28xx Audio" too long - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/2f36a4c45f7a79b4?hl=en
* [PATCH 10/18] gfs2: Provide config option for enabling trace points - 1
messages, 1 author
http://groups.google.com/group/linux.kernel/t/ccd49f7aa77bdcd3?hl=en
* About ACL for IPC Object - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/7b33ed8f36b8a44b?hl=en
* Export fragmentation index via /proc/extfrag_index - 4 messages, 1 author
http://groups.google.com/group/linux.kernel/t/dcbcf3307d0ed626?hl=en
* 2.6.34-rc2 - crash on shutdown - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/afdca5351efc06ff?hl=en
* kbuild: Include gen_initramfs_list.sh and the file list in the .d file - 1
messages, 1 author
http://groups.google.com/group/linux.kernel/t/1e1bf0526ec061ce?hl=en
* data consistency of high page - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/1ce9c73efb251070?hl=en
* netpoll: add generic support for bridge and bonding devices - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/716f4d07472adb41?hl=en

==============================================================================
TOPIC: cpufreq: Add APERF/MPERF support for AMD processors
http://groups.google.com/group/linux.kernel/t/f0aa58b5e6d920fb?hl=en
==============================================================================

== 1 of 7 ==
Date: Tues, Mar 23 2010 4:30 am
From: Thomas Renninger

On Monday 22 March 2010 19:38:39 Borislav Petkov wrote:
> From: Mark Langsdorf <mark.langsdorf@amd.com>
>
> Starting with model 10 of Family 0x10, AMD processors may have
> support for APERF/MPERF. Add support for identifying it and using
> it within cpufreq. Move the APERF/MPERF functions out of the
> acpi-cpufreq code and into their own file so they can easily be
> shared.
>
> Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
> ---
> arch/x86/kernel/cpu/amd.c | 6 +++
> arch/x86/kernel/cpu/cpufreq/Makefile | 4 +-
> arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c | 44 +-----------------------
> arch/x86/kernel/cpu/cpufreq/mperf.c | 50 ++++++++++++++++++++++++++++
> arch/x86/kernel/cpu/cpufreq/mperf.h | 9 +++++
> arch/x86/kernel/cpu/cpufreq/powernow-k8.c | 8 ++++
> 6 files changed, 77 insertions(+), 44 deletions(-)
> create mode 100644 arch/x86/kernel/cpu/cpufreq/mperf.c
> create mode 100644 arch/x86/kernel/cpu/cpufreq/mperf.h
>
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index e485825..796f662 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -537,6 +537,12 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
> set_cpu_cap(c, X86_FEATURE_MFENCE_RDTSC);
> }
>
> + if (c->cpuid_level >= 6) {
> + unsigned ecx = cpuid_ecx(6);
> + if (ecx & 0x01)
> + set_cpu_cap(c, X86_FEATURE_APERFMPERF);
> + }
This is nearly identical to (beside c->cpuid_level > 6),
in arch/x86/kernel/cpu/intel.c:
if (c->cpuid_level > 6) {
unsigned ecx = cpuid_ecx(6);
if (ecx & 0x01)
set_cpu_cap(c, X86_FEATURE_APERFMPERF);
}
I expect you are correct... or could it get moved to general x86 init code?

Thomas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 7 ==
Date: Tues, Mar 23 2010 4:50 am
From: Borislav Petkov

From: Thomas Renninger <trenn@suse.de>
Date: Tue, Mar 23, 2010 at 12:07:29PM +0100

> > +#define define_one_global_ro(_name) \
> > +static struct global_attr _name = \
> > +__ATTR(_name, 0444, show_##_name, NULL)
> > +
> > +#define define_one_global_rw(_name) \
> > +static struct global_attr _name = \
> > +__ATTR(_name, 0644, show_##_name, store_##_name)
>
> These sound like too general names in global space.
> And are unrelated to cpufreq(.h).

maybe call them cpufreq_define_(global|freq)_* then?

> Eventually you get them into sysfs.h with another name
> or just duplicate them?

Well, struct freq_attr for example is cpufreq-specific attribute,
AFAICT. So, keeping them in cpufreq.h should be fine, no?

--
Regards/Gruss,
Boris.

--
Advanced Micro Devices, Inc.
Operating Systems Research Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 3 of 7 ==
Date: Tues, Mar 23 2010 5:00 am
From: Thomas Renninger

On Tuesday 23 March 2010 12:44:09 Borislav Petkov wrote:
> From: Thomas Renninger <trenn@suse.de>
> Date: Tue, Mar 23, 2010 at 12:07:29PM +0100
>
> > > +#define define_one_global_ro(_name) \
> > > +static struct global_attr _name = \
> > > +__ATTR(_name, 0444, show_##_name, NULL)
> > > +
> > > +#define define_one_global_rw(_name) \
> > > +static struct global_attr _name = \
> > > +__ATTR(_name, 0644, show_##_name, store_##_name)
> >
> > These sound like too general names in global space.
> > And are unrelated to cpufreq(.h).
>
> maybe call them cpufreq_define_(global|freq)_* then?
>
> > Eventually you get them into sysfs.h with another name
> > or just duplicate them?
>
> Well, struct freq_attr for example is cpufreq-specific attribute,
> AFAICT. So, keeping them in cpufreq.h should be fine, no?
You don't need much of these (one or two?).
I'd leave this cleanup out for your patch series.
You care about the boost and aperf/mperf stuff and not about this
cleanup?

== 4 of 7 ==
Date: Tues, Mar 23 2010 5:00 am
From: Borislav Petkov

(Adding hpa to Cc)

From: Thomas Renninger <trenn@suse.de>
Date: Tue, Mar 23, 2010 at 12:17:16PM +0100

> > +/* core performance boost */
> > +static bool cpb_capable, cpb_disabled;
> Whatabout using a cpufeature (arch/x86/include/asm/cpufeature.h)
> instead of cpb_capable. Then people could see this feature in
> /proc/cpuinfo and other code parts could check for it easily if
> needed later.

I don't have a problem with that per se. It's just that /proc/cpuinfo
is a widely used interface and, AFAIR, changing it is not taken that
lightly.

Peter, what do you think?

> It could already be set in arch/x86/kernel/cpu/amd.c and
> powernow-k8 could use cpu_has(cpu, X86_FEATURE_CPB);

I'd still like to cache the cpb_capable value locally instead of getting
x86_cpuinfo percpu var and querying it. Especially if this happens often
and not only at driver init.

> Instead of cpb_disabled, I'd use cpb_enabled. Checking for
> !cpb_disabled whether it's enabled, is ugly to read.

Fair enough, cpb_disabled reflects the bit semantics in the MSR but why
not, don't matter which to me.

[..]

> > + _cpb_toggle_msrs(t);
> > + printk(KERN_INFO PFX "Core Boosting enabled.\n");
> Always printk on every toggle?
> That should not happen often and a user might want to get noticed if
> an app does this behind his back -> should be fine w/ or w/o, just not
> sure whether it's intended.

Well, actually, this should be on by default and the user or an app
shouldn't be fidgeting with it all the time. It's there only for
benchmarking purposes so that you can disable it when you really have
to. But I guess it actually is going to get used if its there so maybe
we should have to rethink our approach. Hmm...

--
Regards/Gruss,
Boris.

== 5 of 7 ==
Date: Tues, Mar 23 2010 5:00 am
From: Borislav Petkov

From: Thomas Renninger <trenn@suse.de>
Date: Tue, Mar 23, 2010 at 12:26:22PM +0100

> On Monday 22 March 2010 19:38:39 Borislav Petkov wrote:
> > From: Mark Langsdorf <mark.langsdorf@amd.com>
> >
> > Starting with model 10 of Family 0x10, AMD processors may have
> > support for APERF/MPERF. Add support for identifying it and using
> > it within cpufreq. Move the APERF/MPERF functions out of the
> > acpi-cpufreq code and into their own file so they can easily be
> > shared.
> >
> > Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
> > Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
> > ---
> > arch/x86/kernel/cpu/amd.c | 6 +++
> > arch/x86/kernel/cpu/cpufreq/Makefile | 4 +-
> > arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c | 44 +-----------------------
> > arch/x86/kernel/cpu/cpufreq/mperf.c | 50 ++++++++++++++++++++++++++++
> > arch/x86/kernel/cpu/cpufreq/mperf.h | 9 +++++
> > arch/x86/kernel/cpu/cpufreq/powernow-k8.c | 8 ++++
> > 6 files changed, 77 insertions(+), 44 deletions(-)
> > create mode 100644 arch/x86/kernel/cpu/cpufreq/mperf.c
> > create mode 100644 arch/x86/kernel/cpu/cpufreq/mperf.h
> >
> > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> > index e485825..796f662 100644
> > --- a/arch/x86/kernel/cpu/amd.c
> > +++ b/arch/x86/kernel/cpu/amd.c
> > @@ -537,6 +537,12 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
> > set_cpu_cap(c, X86_FEATURE_MFENCE_RDTSC);
> > }
> >
> > + if (c->cpuid_level >= 6) {
> > + unsigned ecx = cpuid_ecx(6);
> > + if (ecx & 0x01)
> > + set_cpu_cap(c, X86_FEATURE_APERFMPERF);
> > + }
> This is nearly identical to (beside c->cpuid_level > 6),
> in arch/x86/kernel/cpu/intel.c:
> if (c->cpuid_level > 6) {
> unsigned ecx = cpuid_ecx(6);
> if (ecx & 0x01)
> set_cpu_cap(c, X86_FEATURE_APERFMPERF);
> }
> I expect you are correct... or could it get moved to general x86 init code?

Sounds good...

--
Regards/Gruss,
Boris.

== 6 of 7 ==
Date: Tues, Mar 23 2010 5:00 am
From: Thomas Renninger

On Monday 22 March 2010 19:38:41 Borislav Petkov wrote:
> From: Borislav Petkov <borislav.petkov@amd.com>
>
> Modify the scaling_cur_freq interface to show the actual core frequency
> when boosting is supported and enabled on the core.
This looks wrong.

scaling_cur freq should show the frequency the kernel/cpufreq
subsystem thinks it's in.
You show the average freq and the time of the measured average
frequency depends on when the cpufreq subsystem called getavg()
the last time.
Also the time frame of the average freq the cpufreq subsystem
gets when calling getavg() now depends on whether and how often
userspace calls scaling_cur_freq which influences switching policy.

Latest cpufrequtils (ver 006) supports cpufreq-aperf to check whether
cores enter boost mode. Len Brown afaik also has a userspace tool, but
if it has any advantages, it should IMO get integrated into cpufrequtils
which people know to use when looking at cpufreq.

I once thought about adding scaling_avg_freq which gets an own
aperf_mperf counter, but you don't know whether another app read out the
average freq in between and your expected measured time frame is wrong then.
You could remember aperf/mperf per pid and free the saved aperf/mperf value
if the process dies..., but what for if this can be read out in userspace.

== 7 of 7 ==
Date: Tues, Mar 23 2010 5:10 am
From: Borislav Petkov

From: Thomas Renninger <trenn@suse.de>
Date: Tue, Mar 23, 2010 at 12:55:30PM +0100

> > Well, struct freq_attr for example is cpufreq-specific attribute,
> > AFAICT. So, keeping them in cpufreq.h should be fine, no?
> You don't need much of these (one or two?).

I don't think I get what you mean here..?

> I'd leave this cleanup out for your patch series.
> You care about the boost and aperf/mperf stuff and not about this
> cleanup?

Why? What's wrong with cleaning up obviously duplicated code?

--
Regards/Gruss,
Boris.

==============================================================================
TOPIC: smp_call_function_many SMP race
http://groups.google.com/group/linux.kernel/t/90a370c7d8dd310f?hl=en
==============================================================================

== 1 of 2 ==
Date: Tues, Mar 23 2010 4:30 am
From: Anton Blanchard

I noticed a failure where we hit the following WARN_ON in
generic_smp_call_function_interrupt:

if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
continue;

data->csd.func(data->csd.info);

refs = atomic_dec_return(&data->refs);
WARN_ON(refs < 0); <-------------------------

We atomically tested and cleared our bit in the cpumask, and yet the number
of cpus left (ie refs) was 0. How can this be?

It turns out commit c0f68c2fab4898bcc4671a8fb941f428856b4ad5 (generic-ipi:
cleanup for generic_smp_call_function_interrupt()) is at fault. It removes
locking from smp_call_function_many and in doing so creates a rather
complicated race.

The problem comes about because:

- The smp_call_function_many interrupt handler walks call_function.queue
without any locking.
- We reuse a percpu data structure in smp_call_function_many.
- We do not wait for any RCU grace period before starting the next
smp_call_function_many.

Imagine a scenario where CPU A does two smp_call_functions back to back, and
CPU B does an smp_call_function in between. We concentrate on how CPU C handles
the calls:

CPU A CPU B CPU C

smp_call_function
smp_call_function_interrupt
walks call_function.queue
sees CPU A on list

smp_call_function

smp_call_function_interrupt
walks call_function.queue
sees (stale) CPU A on list
smp_call_function
reuses percpu *data
set data->cpumask
sees and clears bit in cpumask!
sees data->refs is 0!

set data->refs (too late!)

The important thing to note is since the interrupt handler walks a potentially
stale call_function.queue without any locking, then another cpu can view the
percpu *data structure at any time, even when the owner is in the process
of initialising it.

The following test case hits the WARN_ON 100% of the time on my PowerPC box
(having 128 threads does help :)

#include <linux/module.h>
#include <linux/init.h>

#define ITERATIONS 100

static void do_nothing_ipi(void *dummy)
{
}

static void do_ipis(struct work_struct *dummy)
{
int i;

for (i = 0; i < ITERATIONS; i++)
smp_call_function(do_nothing_ipi, NULL, 1);

printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
}

static struct work_struct work[NR_CPUS];

static int __init testcase_init(void)
{
int cpu;

for_each_online_cpu(cpu) {
INIT_WORK(&work[cpu], do_ipis);
schedule_work_on(cpu, &work[cpu]);
}

return 0;
}

static void __exit testcase_exit(void)
{
}

module_init(testcase_init)
module_exit(testcase_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");

I tried to fix it by ordering the read and the write of ->cpumask and ->refs.
In doing so I missed a critical case but Paul McKenney was able to spot
my bug thankfully :) To ensure we arent viewing previous iterations the
interrupt handler needs to read ->refs then ->cpumask then ->refs _again_.

Thanks to Milton Miller and Paul McKenney for helping to debug this issue.

---

My head hurts. This needs some serious analysis before we can be sure it
fixes all the races. With all these memory barriers, maybe the previous
spinlocks weren't so bad after all :)

Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c 2010-03-23 05:09:08.000000000 -0500
+++ linux-2.6/kernel/smp.c 2010-03-23 06:12:40.000000000 -0500
@@ -193,6 +193,31 @@ void generic_smp_call_function_interrupt
list_for_each_entry_rcu(data, &call_function.queue, csd.list) {
int refs;

+ /*
+ * Since we walk the list without any locks, we might
+ * see an entry that was completed, removed from the
+ * list and is in the process of being reused.
+ *
+ * Just checking data->refs then data->cpumask is not good
+ * enough because we could see a non zero data->refs from a
+ * previous iteration. We need to check data->refs, then
+ * data->cpumask then data->refs again. Talk about
+ * complicated!
+ */
+
+ if (atomic_read(&data->refs) == 0)
+ continue;
+
+ smp_rmb();
+
+ if (!cpumask_test_cpu(cpu, data->cpumask))
+ continue;
+
+ smp_rmb();
+
+ if (atomic_read(&data->refs) == 0)
+ continue;
+
if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
continue;

@@ -446,6 +471,14 @@ void smp_call_function_many(const struct
data->csd.info = info;
cpumask_and(data->cpumask, mask, cpu_online_mask);
cpumask_clear_cpu(this_cpu, data->cpumask);
+
+ /*
+ * To ensure the interrupt handler gets an up to date view
+ * we order the cpumask and refs writes and order the
+ * read of them in the interrupt handler.
+ */
+ smp_wmb();
+
atomic_set(&data->refs, cpumask_weight(data->cpumask));

raw_spin_lock_irqsave(&call_function.lock, flags);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 2 ==
Date: Tues, Mar 23 2010 5:30 am
From: Peter Zijlstra

On Tue, 2010-03-23 at 22:15 +1100, Anton Blanchard wrote:
>
> It turns out commit c0f68c2fab4898bcc4671a8fb941f428856b4ad5 (generic-ipi:
> cleanup for generic_smp_call_function_interrupt()) is at fault. It removes
> locking from smp_call_function_many and in doing so creates a rather
> complicated race.

A rather simple question since my brain isn't quite ready processing the
content here..

Isn't reverting that one patch a simpler solution than adding all that
extra logic? If not, then the above statement seems false and we had a
bug even with that preempt_enable/disable() pair.

Just wondering.. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: KSM & hugepages
http://groups.google.com/group/linux.kernel/t/940cfb5fd65ac16f?hl=en
==============================================================================

== 1 of 1 ==
Date: Tues, Mar 23 2010 4:40 am
From: Michael Tokarev

Hello.

I noticed an interesting thing here, with qemu-kvm, KSM and
hugepages.

When I initially enabled KSM, for my two windows guests I've
seen ~100 000 pages in /sys/kernel/mm/ksm/pages_shared .
That's quite good, and overall memory usage improved.

Now, I also enabled hugepages in kvm, which speed things
up quite significantly (the speedup is noticeable).

But now, when both KSM and hugepages are activated, I don't
see KSM in action anymore. /sys/../mm/pages_shared shows
56 pages, which is nothing.

So I wonder what's up:
o that's 56 _huge_ pages (which means the actual saving
is 56*2M = 112Mb, which isn't really bad). If that's
the case, /sys/../mm/ interface lacks proper units
reporting;
o due to large pages there's much less chance to find
two pages with identical contents, so very little can
be shared;
o KSM does not scan hugepages at all
o something else.

What is the issue here?

Thanks!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: .gitignore: ignore *.lzo files
http://groups.google.com/group/linux.kernel/t/097186c3c77a5979?hl=en
==============================================================================

== 1 of 1 ==
Date: Tues, Mar 23 2010 4:40 am
From: Michal Marek

On 17.3.2010 19:52, Philipp Kohlbecher wrote:
> Ignore files compressed with lzop.
>
> Signed-off-by: Philipp Kohlbecher <xt28@gmx.de>

Applied, thanks.

Michal
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: kconfig: recalc symbol value before showing search results
http://groups.google.com/group/linux.kernel/t/589540bdd321a019?hl=en
==============================================================================

== 1 of 1 ==
Date: Tues, Mar 23 2010 4:40 am
From: Michal Marek

On 19.3.2010 07:57, Li Zefan wrote:
> A symbol's value won't be recalc-ed until we save config file or
> enter the menu where the symbol sits.
>
> So If I enable OPTIMIZE_FOR_SIZE, and search FUNCTION_GRAPH_TRACER:
>
> Symbol: FUNCTION_GRAPH_TRACER [=y]
> Prompt: Kernel Function Graph Tracer
> Defined at kernel/trace/Kconfig:140
> Depends on: ... [=y] && (!X86_32 [=y] || !CC_OPTIMIZE_FOR_SIZE [=y])
> ...
>
> From the dependency it should result in FUNCTION_GRAPH_TRACER=n,
> but it still shows FUNCTION_GRAPH_TRACER=y.
>
> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>

Nice, applied.

==============================================================================
TOPIC: Can not boot with CONFIG_NO_BOOTMEM=y on i686
http://groups.google.com/group/linux.kernel/t/82c8d53557498245?hl=en
==============================================================================

== 1 of 1 ==
Date: Tues, Mar 23 2010 4:40 am
From: Stanislaw Gruszka

On Sat, Mar 20, 2010 at 11:26:06AM -0700, Yinghai Lu wrote:
> > After update to 2.6.34-rc1, I was experimented by strange oopses during
> > boot, what looked like memory corruption. Bisection shows that first bad
> > commit is 59be5a8e8ce765cf739ec7f07176219972de7481 ("x86: Make 32bit
> > support NO_BOOTMEM"). When I disable CONFIG_NO_BOOTMEM I'm able to start
> > system. Not sure what info is need to track down this issue, so please
> > let me know.
>
> can you check patch
>
> https://patchwork.kernel.org/patch/87081/

Patch helps somehow. Instead of many random oopses, now I have one and
the same oops, here is photo:
http://people.redhat.com/sgruszka/20100322_001.jpg

Oops is in pcpu_alloc+0x1aa, in code this is

(gdb) l *(pcpu_alloc +0x1aa)
0xc04c2272 is in prefetch (/mnt/rhel5/usr/src/kernels/linux-2.6-debuginfo/arch/x86/include/asm/processor.h:886).
881 * It's not worth to care about 3dnow prefetches for the K6
882 * because they are microcoded there and very slow.
883 */
884 static inline void prefetch(const void *x)
885 {
886 alternative_input(BASE_PREFETCH,
887 "prefetchnta (%1)",
888 X86_FEATURE_XMM,
889 "r" (x));
890 }
(gdb) l *(pcpu_alloc +0x1a0)
0xc04c2268 is in pcpu_alloc (mm/percpu.c:1137).
1132 */
1133 goto restart;
1134 }
1135
1136 off = pcpu_alloc_area(chunk, size, align);
1137 if (off >= 0)
1138 goto area_found;
1139 }
1140 }
1141
(gdb) l *(pcpu_alloc +0x1b0)
0xc04c2278 is in pcpu_alloc (mm/percpu.c:1116).
1111 }
1112
1113 restart:
1114 /* search through normal chunks */
1115 for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
1116 list_for_each_entry(chunk, &pcpu_slot[slot], list) {
1117 if (size > chunk->contig_hint)
1118 continue;
1119
1120 new_alloc = pcpu_need_to_extend(chunk);

So seems pcpu_slot[slot] is somehow corrupted. Looking further give
pcpu_slot is allocated by:

pcpu_slot = alloc_bootmem(pcpu_nr_slots * sizeof(pcpu_slot[0]));

So still we have some problem with CONFIG_NO_BOOTMEM on 32 bits.

Stanislaw
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: RFC: direct MTD support for SquashFS
http://groups.google.com/group/linux.kernel/t/ffc7606040d00b24?hl=en
==============================================================================

== 1 of 1 ==
Date: Tues, Mar 23 2010 4:40 am
From: Ferenc Wagner

> A couple of specific comments...
>
> +/* A backend is initialized for each SquashFS block read operation,
> + * making further sequential reads possible from the block.
> + */
> +static void *bdev_init(struct squashfs_sb_info *msblk, u64 index,
> size_t length)
> +{
> + struct squashfs_bdev *bdev = msblk->backend_data;
> + struct buffer_head *bh;
> +
> + bh = kcalloc((msblk->block_size >> bdev->devblksize_log2) + 1,
> + sizeof(*bh), GFP_KERNEL);
>
> You should alloc against the larger of msblk->block_size and
> METADATA_SIZE (8 Kbytes). Block_size could be 4 Kbytes only.

I plugged in a max(). Couldn't that trailing +1 be converted into a +2
like this?

bh = kcalloc((max(msblk->block_size, METADATA_SIZE) + 2) >> bdev->devblksize_log2

> +static int fill_bdev_super(struct super_block *sb, void *data, int silent)
>
> This function looks rather 'back-to-front' to me. I'm assuming that
> squashfs_fill_super2() will be the current fill superblock function?
> This function wants to read data off the filesystem through the
> backend, and yet the backend (bdev, mblk->backend_data) hasn't been
> initialised when it's called...

I solved it by introducing a callback function for adding the backend.
That may be overkill, but it seems to give the most shared code.

The attached patch series survived some testing here. My only doubt:
the current backend interface necessitates a memory copy from the buffer
heads. This is no problem for mtd and lzma which copy the data anyway,
but makes this code less efficient in the bdev+zlib case.

I've got one more patch, which I forgot to export, to pull out the
common logic from the backend init functions back into squashfs_read_data().
With the bdev backend, that entails reading the first block twice in a
row most of the time. This again could be worked around by extending
the backend interface, but I'm not sure if it's worth it.

How does this look like now?
--
Regards,
Feri.

==============================================================================
TOPIC: intel-agp.c: Fix crash when accessing nonexistent GTT entries in i915
http://groups.google.com/group/linux.kernel/t/bcbed4037b6559cf?hl=en
==============================================================================

== 1 of 1 ==
Date: Tues, Mar 23 2010 4:50 am
From: Miguel Ojeda

On Tue, Mar 23, 2010 at 5:14 AM, Christian Kujau <lists@nerdbynature.de> wrote:
> On Mon, 22 Mar 2010 at 20:57, Andrew Morton wrote:
>> On Sun, 21 Mar 2010 16:30:20 +0100 Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> wrote:
>> > I bisected in order to find the commit 5877960869333e42ebeb733e8d9d5630ff96d350.
>
> I believe this[0] is fc61901373987ad61851ed001fe971f3ee8d96a3 upstream:

Indeed. Also in

http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.32.y.git;a=commit;h=fc61901373987ad61851ed001fe971f3ee8d96a3

>
> --------
> agp/intel-agp: Clear entire GTT on startup
>
> Some BIOSes fail to initialise the GTT, which will cause DMA faults when
> the IOMMU is enabled. We need to clear the whole thing to point at the
> scratch page, not just the part that Linux is going to use.
>
> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
> [anholt: Note that this may also help with stability in the presence of
> driver bugs, by not drawing to memory we don't own]
> Signed-off-by: Eric Anholt <eric@anholt.net>
> --------
>
> Christian.
>
> [0] http://github.com/pfactum/pf-kernel/commit/5877960869333e42ebeb733e8d9d5630ff96d350
> --
> BOFH excuse #384:
>
> it's an ID-10-T error
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: em28xx: "Empia Em28xx Audio" too long
http://groups.google.com/group/linux.kernel/t/2f36a4c45f7a79b4?hl=en
==============================================================================

== 1 of 1 ==
Date: Tues, Mar 23 2010 4:50 am
From: Dan Carpenter

card->driver is 15 characters and a NULL. The original code
goes past the end of the array.

Signed-off-by: Dan Carpenter <error27@gmail.com>
---
V2: Takashi Iwai asked me to change the space to a hyphen since this is
used as an identifier in alsa-lib.

diff --git a/drivers/media/video/em28xx/em28xx-audio.c b/drivers/media/video/em28xx/em28xx-audio.c
index bd78338..e182abf 100644
--- a/drivers/media/video/em28xx/em28xx-audio.c
+++ b/drivers/media/video/em28xx/em28xx-audio.c
@@ -491,7 +491,7 @@ static int em28xx_audio_init(struct em28xx *dev)
strcpy(pcm->name, "Empia 28xx Capture");

snd_card_set_dev(card, &dev->udev->dev);
- strcpy(card->driver, "Empia Em28xx Audio");
+ strcpy(card->driver, "Em28xx-Audio");
strcpy(card->shortname, "Em28xx Audio");
strcpy(card->longname, "Empia Em28xx Audio");

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: [PATCH 10/18] gfs2: Provide config option for enabling trace points
http://groups.google.com/group/linux.kernel/t/ccd49f7aa77bdcd3?hl=en
==============================================================================

== 1 of 1 ==
Date: Tues, Mar 23 2010 5:00 am
From: Steven Whitehouse

Hi,

Now in the GFS2 -nmw git tree. Thanks,

Steve.

On Tue, 2010-03-23 at 01:32 +0100, Jan Kara wrote:
> CC: cluster-devel@redhat.com
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> fs/gfs2/Kconfig | 8 ++++++++
> fs/gfs2/trace_gfs2.h | 2 ++
> 2 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/fs/gfs2/Kconfig b/fs/gfs2/Kconfig
> index 4dcddf8..872ced2 100644
> --- a/fs/gfs2/Kconfig
> +++ b/fs/gfs2/Kconfig
> @@ -38,3 +38,11 @@ config GFS2_FS_LOCKING_DLM
> Most users of GFS2 will require this. It provides the locking
> interface between GFS2 and the DLM, which is required to use GFS2
> in a cluster environment.
> +
> +config GFS2_TRACER
> + bool "GFS2 tracing"
> + depends on GFS2_FS && EVENT_TRACING
> + help
> + Provide trace points in block allocation functions, locking, and
> + journaling code.
> +
> diff --git a/fs/gfs2/trace_gfs2.h b/fs/gfs2/trace_gfs2.h
> index 148d55c..5f4faf3 100644
> --- a/fs/gfs2/trace_gfs2.h
> +++ b/fs/gfs2/trace_gfs2.h
> @@ -1,5 +1,7 @@
> #undef TRACE_SYSTEM
> +#undef TRACE_CONFIG
> #define TRACE_SYSTEM gfs2
> +#define TRACE_CONFIG CONFIG_GFS2_TRACER
>
> #if !defined(_TRACE_GFS2_H) || defined(TRACE_HEADER_MULTI_READ)
> #define _TRACE_GFS2_H

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: About ACL for IPC Object
http://groups.google.com/group/linux.kernel/t/7b33ed8f36b8a44b?hl=en
==============================================================================

== 1 of 1 ==
Date: Tues, Mar 23 2010 5:00 am
From: Christoph Hellwig

On Tue, Mar 23, 2010 at 05:01:31PM +0800, zhou peng wrote:
> Hi,
>
> I have added ACL support to POSIX msg queue on linux kernel 2.6.32.
> Casey Schaufler, would you or anyone like review the patch for me please?
> The patch is attached.

There is quite a lot of boilerplate code in the patch. You might want
to take a look at fs/generic_acl.c and how mm/shmem.c uses it to get
away with a lot less code.

==============================================================================
TOPIC: Export fragmentation index via /proc/extfrag_index
http://groups.google.com/group/linux.kernel/t/dcbcf3307d0ed626?hl=en
==============================================================================

== 1 of 4 ==
Date: Tues, Mar 23 2010 5:10 am
From: Mel Gorman

On Tue, Mar 23, 2010 at 09:22:04AM +0900, KOSAKI Motohiro wrote:
> > > > + /*
> > > > + * Index is between 0 and 1 so return within 3 decimal places
> > > > + *
> > > > + * 0 => allocation would fail due to lack of memory
> > > > + * 1 => allocation would fail due to fragmentation
> > > > + */
> > > > + return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> > > > +}
> > >
> > > Dumb question.
> > > your paper (http://portal.acm.org/citation.cfm?id=1375634.1375641) says
> > > fragmentation_index = 1 - (TotalFree/SizeRequested)/BlocksFree
> > > but your code have extra '1000+'. Why?
> >
> > To get an approximation to three decimal places.
>
> Do you mean this is poor man's round up logic?

Not exactly.

The intention is to have a value of 968 instead of 0.968231. i.e.
instead of a value between 0 and 1, it'll be a value between 0 and 1000
that matches the first three digits after the decimal place.

> Why don't you use DIV_ROUND_UP? likes following,
>
> return 1000 - (DIV_ROUND_UP(info->free_pages * 1000 / requested) / info->free_blocks_total);
>

Because it's not doing the same thing unless I missed something.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 4 ==
Date: Tues, Mar 23 2010 5:30 am
From: Mel Gorman

rmap_walk_anon() was triggering errors in memory compaction that look like
use-after-free errors. The problem is that between the page being isolated
from the LRU and rcu_read_lock() being taken, the mapcount of the page
dropped to 0 and the anon_vma gets freed. This can happen during memory
compaction if pages being migrated belong to a process that exits before
migration completes. Hence, the use-after-free race looks like

1. Page isolated for migration
2. Process exits
3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
is garbage.

This patch checks the mapcount after the rcu lock is taken. If the
mapcount is zero, the anon_vma is assumed to be freed and no further
action is taken.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/migrate.c | 13 +++++++++++++
1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 98eaaf2..6eb1efe 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -603,6 +603,19 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
*/
if (PageAnon(page)) {
rcu_read_lock();
+
+ /*
+ * If the page has no mappings any more, just bail. An
+ * unmapped anon page is likely to be freed soon but worse,
+ * it's possible its anon_vma disappeared between when
+ * the page was isolated and when we reached here while
+ * the RCU lock was not held
+ */
+ if (!page_mapcount(page)) {
+ rcu_read_unlock();
+ goto uncharge;
+ }
+
rcu_locked = 1;
anon_vma = page_anon_vma(page);
atomic_inc(&anon_vma->migrate_refcount);
--
1.6.5

== 3 of 4 ==
Date: Tues, Mar 23 2010 5:30 am
From: Mel Gorman

Fragmentation index is a value that makes sense when an allocation of a
given size would fail. The index indicates whether an allocation failure is
due to a lack of memory (values towards 0) or due to external fragmentation
(value towards 1). For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zone
basis via /proc/extfrag_index

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
Documentation/filesystems/proc.txt | 14 ++++++-
mm/vmstat.c | 81 +++++++++++++++++++++++++++++++++
2 files changed, 94 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 5c4b0fb..582ff3d 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -421,6 +421,7 @@ Table 1-5: Kernel info in /proc
filesystems Supported filesystems
driver Various drivers grouped here, currently rtc (2.4)
execdomains Execdomains, related to security (2.4)
+ extfrag_index Additional page allocator information (see text) (2.5)
fb Frame Buffer devices (2.4)
fs File system parameters, currently nfs/exports (2.4)
ide Directory containing info about the IDE subsystem
@@ -610,7 +611,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
available in ZONE_NORMAL, etc...

More information relevant to external fragmentation can be found in
-pagetypeinfo and unusable_index
+pagetypeinfo, unusable_index and extfrag_index.

> cat /proc/pagetypeinfo
Page block order: 9
@@ -661,6 +662,17 @@ value between 0 and 1. The higher the value, the more of free memory is
unusable and by implication, the worse the external fragmentation is. This
can be expressed as a percentage by multiplying by 100.

+> cat /proc/extfrag_index
+Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
+Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954
+
+The external fragmentation index, is only meaningful if an allocation
+would fail and indicates what the failure is due to. A value of -1 such as
+in many of the examples above states that the allocation would succeed.
+If it would fail, the value is between 0 and 1. A value tending towards
+0 implies the allocation failed due to a lack of memory. A value tending
+towards 1 implies it failed due to external fragmentation.
+
..............................................................................

meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ca42e10..7377da6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -553,6 +553,67 @@ static int unusable_show(struct seq_file *m, void *arg)
return 0;
}

+/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int fragmentation_index(unsigned int order, struct contig_page_info *info)
+{
+ unsigned long requested = 1UL << order;
+
+ if (!info->free_blocks_total)
+ return 0;
+
+ /* Fragmentation index only makes sense when a request would fail */
+ if (info->free_blocks_suitable)
+ return -1000;
+
+ /*
+ * Index is between 0 and 1 so return within 3 decimal places
+ *
+ * 0 => allocation would fail due to lack of memory
+ * 1 => allocation would fail due to fragmentation
+ */
+ return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
+}
+
+
+static void extfrag_show_print(struct seq_file *m,
+ pg_data_t *pgdat, struct zone *zone)
+{
+ unsigned int order;
+ int index;
+
+ /* Alloc on stack as interrupts are disabled for zone walk */
+ struct contig_page_info info;
+
+ seq_printf(m, "Node %d, zone %8s ",
+ pgdat->node_id,
+ zone->name);
+ for (order = 0; order < MAX_ORDER; ++order) {
+ fill_contig_page_info(zone, order, &info);
+ index = fragmentation_index(order, &info);
+ seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+ }
+
+ seq_putc(m, '\n');
+}
+
+/*
+ * Display fragmentation index for orders that allocations would fail for
+ */
+static int extfrag_show(struct seq_file *m, void *arg)
+{
+ pg_data_t *pgdat = (pg_data_t *)arg;
+
+ walk_zones_in_node(m, pgdat, extfrag_show_print);
+
+ return 0;
+}
+
static void pagetypeinfo_showfree_print(struct seq_file *m,
pg_data_t *pgdat, struct zone *zone)
{
@@ -722,6 +783,25 @@ static const struct file_operations unusable_file_ops = {
.release = seq_release,
};

+static const struct seq_operations extfrag_op = {
+ .start = frag_start,
+ .next = frag_next,
+ .stop = frag_stop,
+ .show = extfrag_show,
+};
+
+static int extfrag_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &extfrag_op);
+}
+
+static const struct file_operations extfrag_file_ops = {
+ .open = extfrag_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
#ifdef CONFIG_ZONE_DMA
#define TEXT_FOR_DMA(xx) xx "_dma",
#else
@@ -1067,6 +1147,7 @@ static int __init setup_vmstat(void)
proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
+ proc_create("extfrag_index", S_IRUGO, NULL, &extfrag_file_ops);
proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);

twitter

Tuesday, March 23, 2010

linux.kernel - 26 new messages in 16 topics - digest

0 Comments:

Post a Comment

About Me

Previous Posts