twitter: linux.kernel - 25 new messages in 18 topics

linux.kernel
http://groups.google.com/group/linux.kernel?hl=en

Today's topics:

* [PATCH] perf_events, x86: PEBS support - 3 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/0acdc85baea49653?hl=en
* Improving OOM killer - 3 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/389db2dcf6479d30?hl=en
* mm-count-lowmem-rss.patch removed from -mm tree - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/1585f52119cb313e?hl=en
* PCI: try enabling "pci=use_crs" again - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/a44a2df9f48247da?hl=en
* slub: ARCH_SLAB_MINALIGN defaults to 8 on x86_32. is this too big? - 1
messages, 1 author
http://groups.google.com/group/linux.kernel/t/90562c9ef16cf2a0?hl=en
* x86/agp: fix agp_amd64_init module initialization regression - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/c0f99e4bfec74792?hl=en
* agpgart-amd64 not initialized in 2.6.33-rc5 if iommu=allowed in kernel
command line - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/b9c35e386ebb7a6d?hl=en
* ARM: Change the mandatory barriers implementation - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/1873254b2b8984f3?hl=en
* [PATCH] vmscan: balance local_irq_disable() and local_irq_enable() - 1
messages, 1 author
http://groups.google.com/group/linux.kernel/t/9c6cfefcada8187f?hl=en
* Re-enabling non-GPL driver access to disk partition information - 1 messages,
1 author
http://groups.google.com/group/linux.kernel/t/cdc0617079a6d776?hl=en
* ACPI / EC: Remove race between EC driver and suspend process (rev. 3) - 1
messages, 1 author
http://groups.google.com/group/linux.kernel/t/3c6a9b49b5586034?hl=en
* syslog: use defined constants instead of raw numbers - 2 messages, 1 author
http://groups.google.com/group/linux.kernel/t/e1594790606b7db6?hl=en
* inodes: Support generic defragmentation - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/b72566b846445e03?hl=en
* mxc: Fix Drive Strength Field in the IOMUX controller - 2 messages, 2
authors
http://groups.google.com/group/linux.kernel/t/281ca1a532ca6d76?hl=en
* geode: Fix cip/blk confusion - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/f62b3ba04ecdc0f9?hl=en
* [PATCH] PM: disable nonboot cpus before suspending devices - 2 messages, 2
authors
http://groups.google.com/group/linux.kernel/t/e1dc3aa81b42a297?hl=en
* v2 accelerate grace period if last non-dynticked CPU - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/54fa4fe37724c33d?hl=en
* Hi Waiting - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/e7a082ce30089a49?hl=en

==============================================================================
TOPIC: [PATCH] perf_events, x86: PEBS support
http://groups.google.com/group/linux.kernel/t/0acdc85baea49653?hl=en
==============================================================================

== 1 of 3 ==
Date: Wed, Feb 3 2010 4:00 pm
From: Stephane Eranian

On Thu, Feb 4, 2010 at 12:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2010-02-03 at 15:40 +0100, Peter Zijlstra wrote:
>>
>> If only they would reset the counter on overflow instead of on record,
>> that would solve quite a few issues I imagine.
>
> So I tried enabling the regular PMC overflow interrupt and reprogramming
> the counter from that, but touching the counter seems to destroy the
> PEBS assist, so much for that idea.
>
Yes, you have to leave the INT bit off, otherwise you get an
interrupt for each overflow, thus you lose the buffer advantage.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 3 ==
Date: Wed, Feb 3 2010 4:10 pm
From: Peter Zijlstra

On Thu, 2010-02-04 at 00:51 +0100, Stephane Eranian wrote:
> On Thu, Feb 4, 2010 at 12:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, 2010-02-03 at 15:40 +0100, Peter Zijlstra wrote:
> >>
> >> If only they would reset the counter on overflow instead of on record,
> >> that would solve quite a few issues I imagine.
> >
> > So I tried enabling the regular PMC overflow interrupt and reprogramming
> > the counter from that, but touching the counter seems to destroy the
> > PEBS assist, so much for that idea.
> >
> Yes, you have to leave the INT bit off, otherwise you get an
> interrupt for each overflow, thus you lose the buffer advantage.

Well sure, but that's not the point. I was thinking that if we need to
do single event pebs anyway, we might as well try to reprogram on the
PMC overflow interrupt instead of on the PEBS overflow and curb some of
that drift.

Also, it makes keeping the event count value a lot easier. But alas.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 3 of 3 ==
Date: Wed, Feb 3 2010 4:30 pm
From: Stephane Eranian

On Thu, Feb 4, 2010 at 1:03 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-02-04 at 00:51 +0100, Stephane Eranian wrote:
>> On Thu, Feb 4, 2010 at 12:50 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Wed, 2010-02-03 at 15:40 +0100, Peter Zijlstra wrote:
>> >>
>> >> If only they would reset the counter on overflow instead of on record,
>> >> that would solve quite a few issues I imagine.
>> >
>> > So I tried enabling the regular PMC overflow interrupt and reprogramming
>> > the counter from that, but touching the counter seems to destroy the
>> > PEBS assist, so much for that idea.
>> >
>> Yes, you have to leave the INT bit off, otherwise you get an
>> interrupt for each overflow, thus you lose the buffer advantage.
>
> Well sure, but that's not the point. I was thinking that if we need to
> do single event pebs anyway, we might as well try to reprogram on the
> PMC overflow interrupt instead of on the PEBS overflow and curb some of
> that drift.
>
With INT on, you get the interrupt on the first overflow and incur the
regular skid. There is nothing you can do to make PEBS better from
SW. The HW has to improve. I have reported those issues to Intel
a long time ago. They understand them quite well and I am hopeful
things will improve over time.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: Improving OOM killer
http://groups.google.com/group/linux.kernel/t/389db2dcf6479d30?hl=en
==============================================================================

== 1 of 3 ==
Date: Wed, Feb 3 2010 4:10 pm
From: David Rientjes

On Wed, 3 Feb 2010, Lubos Lunak wrote:

> > unsigned int badness(struct task_struct *p,
> > unsigned long totalram)
> > {
> > struct task_struct *child;
> > struct mm_struct *mm;
> > int forkcount = 0;
> > long points;
> >
> > task_lock(p);
> > mm = p->mm;
> > if (!mm) {
> > task_unlock(p);
> > return 0;
> > }
> > points = (get_mm_rss(mm) +
> > get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
> > totalram;
> > task_unlock(p);
> >
> > list_for_each_entry(child, &p->children, sibling)
> > /* No lock, child->mm won't be dereferenced */
> > if (child->mm && child->mm != mm)
> > forkcount++;
> >
> > /* Forkbombs get penalized 10% of available RAM */
> > if (forkcount > 500)
> > points += 100;
>
> As far as I'm concerned, this is a huge improvement over the current code
> (and, incidentally :), quite close to what I originally wanted). I'd be
> willing to test it in few real-world desktop cases if you provide a patch.
>

There're some things that still need to be worked out, like discounting
hugetlb pages on each allowed node, respecting current's cpuset mems,
etc., but I think it gives us a good rough draft of where we might end up.
I did use the get_mm_rss() that you suggested, but I think it's more
helpful in the context of a fraction of total memory allowed so the other
heursitics (forkbomb, root tasks, nice'd tasks, etc) are penalizing the
points in a known quantity rather than a manipulation of that baseline.

Do you have any comments about the forkbomb detector or its threshold that
I've put in my heuristic? I think detecting these scenarios is still an
important issue that we need to address instead of simply removing it from
consideration entirely.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 3 ==
Date: Wed, Feb 3 2010 4:10 pm
From: David Rientjes

On Wed, 3 Feb 2010, Lubos Lunak wrote:

>
> Given that the badness() proposal I see in your another mail uses
> get_mm_rss(), I take it that you've meanwhile changed your mind on the VmSize
> vs VmRSS argument and considered that argument irrelevant now.

The argument was never to never factor rss into the heuristic, the
argument was to prevent the loss of functionality of oom_adj and being
able to define memory leakers from userspace. With my proposal, I believe
the new semantics of oom_adj are even clearer than before and allow users
to either discount or bias a task with a quantity that they are familiar
with: memory.

My rough draft was written in a mail editor, so it's completely untested
and even has a couple of flaws: we need to discount free hugetlb memory
from allowed nodes, we need to intersect the passed nodemask with
current's cpuset, etc.

> I will comment
> only on the suggested use of oom_adj on the desktop here. And actually I hope
> that if something reasonably similar to your badness() proposal replaces the
> current one it will make any use of oom_adj not needed on the desktop in the
> usual case, so this may be irrelevant as well.
>

If you define "on the desktop" performance of the oom killer merely as
protecting a windows environment, then it should be helpful. I'd still
recommend using OOM_DISABLE for those tasks, though, because I agree that
for users in that environment, KDE getting oom killed is just not a viable
solution.

> > The kernel cannot possibly know what you consider a "vital" process, for
> > that understanding you need to tell it using the very powerful
> > /proc/pid/oom_adj tunable. I suspect if you were to product all of
> > kdeinit's children by patching it to be OOM_DISABLE so that all threads it
> > forks will inherit that value you'd actually see much improved behavior.
>
> No. Almost everything in KDE is spawned by kdeinit, so everything would get
> the adjustment, which means nothing would in practice get the adjustment.
>

It depends on whether you change the oom_adj of children that you no
longer want to protect which have been forked from kdeinit.

> > I'd also encourage you to talk to the KDE developers to ensure that proper
> > precautions are taken to protect it in such conditions since people who
> > use such desktop environments typically don't want them to be sacrificed
> > for memory.
>
> I am a KDE developer, it's written in my signature. And I've already talked
> enough to the KDE developer who has done the oom_adj code that's already
> there, as that's also me. I don't know kernel internals, but that doesn't
> mean I'm completely clueless about the topic of the discussion I've started.
>

Then I'd recommend that you protect those tasks with OOM_DISABLE,
otherwise they will always be candidates for oom kill; the only way to
explicitly prevent that is by changing oom_adj or moving it to its own
memory controller cgroup. A kernel oom heursitic that is implemented for
a wide variety of platforms, including desktops, servers, and embedded
devices, will never identify KDE as a vital task that cannot possibly be
killed unless you tell the kernel it has that priority. Whether you
choose to use that power or not is up to the KDE team.

> 1) I think you missed that I said that every KDE application with the current
> algorithm can be potentially a contender for selection, and I provided
> numbers to demonstrate that in a selected case. Just because such application
> is not vital does not mean it's good for it to get killed instead of an
> obvious offender.
>

This is exaggerating the point quite a bit, I don't think every single KDE
thread is going to have a badness() score that is higher than all other
system tasks all the time. I think that there are the likely candidates
that you've identified (kdeinit, ksmserver, etc) that are much more prone
to high badness() scores given their total_vm size and the number of
children they fork, but I don't think this is representative of every KDE
thread.

> 2) You probably do not realize the complexity involved in using oom_adj in a
> desktop. Even when doing that manually I would have some difficulty finding
> the right setup for my own desktop use. It'd be probably virtually impossible
> to write code that would do it at least somewhat right with all the widely
> differing various desktop setups that dynamically change.
>

Used in combination with /proc/pid/oom_score, it gives you a pretty good
snapshot of how oom killer priorities look at any moment in time. In your
particular use case, however, you seem to be arguing from a perspective of
only protecting certain tasks that you've identified from being oom killed
for desktop environments, namely KDE. For that, there is no confusion to
be had: use OOM_DISABLE. For server environments that I'm also concerned
about, the oom_adj range is much more important to define a killing
priority when used in combination with cpusets.

> 3) oom_adj is ultimately just a kludge to handle special cases where the
> heuristic doesn't get it right for whatever strange reason. But even you
> yourself in another mail presented a heuristic that I believe would make any
> use of oom_adj on the desktop unnecessary in the usual cases. The usual
> desktop is not a special case.
>

The kernel will _always_ need user input into which tasks it believes to
be vital. For you, that's KDE. For me, that's one of our job schedulers.

> > The heuristics are always well debated in this forum and there's little
> > chance that we'll ever settle on a single formula that works for all
> > possible use cases. That makes oom_adj even more vital to the overall
> > efficiency of the oom killer, I really hope you start to use it to your
> > advantage.
>
> I really hope your latest badness() heuristics proposal allows us to dump
> even the oom_adj use we already have.
>

For your environment, I hope the same. In production servers we'll still
need the ability to tune /proc/pid/oom_adj to define memory leakers and
tasks using far more memory than expected, so perhaps my rough draft can
be a launching pad into a positive discussion about the future of the
heuristic based on consensus and input from all impacted parties.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 3 of 3 ==
Date: Wed, Feb 3 2010 4:20 pm
From: Rik van Riel

On 02/03/2010 07:05 PM, David Rientjes wrote:
> On Wed, 3 Feb 2010, Lubos Lunak wrote:

>>> /* Forkbombs get penalized 10% of available RAM */
>>> if (forkcount> 500)
>>> points += 100;

> Do you have any comments about the forkbomb detector or its threshold that
> I've put in my heuristic? I think detecting these scenarios is still an
> important issue that we need to address instead of simply removing it from
> consideration entirely.

I believe that malicious users are best addressed in person,
or preemptively through cgroups and rlimits.

Having a process with over 500 children is quite possible
with things like apache, Oracle, postgres and other forking
daemons.

Killing the parent process can result in the service
becoming unavailable, and in some cases even data
corruption.

--
All rights reversed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: mm-count-lowmem-rss.patch removed from -mm tree
http://groups.google.com/group/linux.kernel/t/1585f52119cb313e?hl=en
==============================================================================

== 1 of 1 ==
Date: Wed, Feb 3 2010 4:10 pm
From: KAMEZAWA Hiroyuki

On Wed, 03 Feb 2010 15:22:33 -0800
akpm@linux-foundation.org wrote:

>
> The patch titled
> mm: count lowmem rss
> has been removed from the -mm tree. Its filename was
> mm-count-lowmem-rss.patch
>
> This patch was dropped because it is obsolete
>
> The current -mm tree may be found at http://userweb.kernel.org/~akpm/mmotm/
>

I'm sorry that
mm-add-lowmem-detection-logic.patch
mm-add-lowmem-detection-logic-fix.patch
are obsolete, too.

I think reverting will not cause any HUNK...

Regards,
-Kame

> ------------------------------------------------------
> Subject: mm: count lowmem rss
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Some case of OOM-Kill are caused by memory shortage in lowmem area. For
> example, NORMAL_ZONE is exhausted on x86-32/HIGHMEM kernel.
>
> Presently, oom-killer doesn't have lowmem usage information of processes
> and selects victim processes based on global memory usage information. In
> bad case, this can cause chains of kills of innocent processes without
> progress, oom-serial-killer.
>
> For making oom-killer lowmem aware, this patch adds counters for
> accounting lowmem usage per process. (patches for oom-killer is not
> included in this.)
>
> Adding counter is easy but one of concern is the cost for new counter.
> But this patch doesn't adds # of counting cost but adds an "if" statement
> to check if a page is lowmem. With micro benchmark, almost no regression.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
> fs/proc/task_mmu.c | 4 -
> include/linux/mm.h | 27 ++++++++++--
> include/linux/mm_types.h | 7 ++-
> mm/filemap_xip.c | 2
> mm/fremap.c | 2
> mm/memory.c | 81 ++++++++++++++++++++++++++++---------
> mm/oom_kill.c | 8 ++-
> mm/rmap.c | 10 ++--
> mm/swapfile.c | 2
> 9 files changed, 106 insertions(+), 37 deletions(-)
>
> diff -puN fs/proc/task_mmu.c~mm-count-lowmem-rss fs/proc/task_mmu.c
> --- a/fs/proc/task_mmu.c~mm-count-lowmem-rss
> +++ a/fs/proc/task_mmu.c
> @@ -68,11 +68,11 @@ unsigned long task_vsize(struct mm_struc
> int task_statm(struct mm_struct *mm, int *shared, int *text,
> int *data, int *resident)
> {
> - *shared = get_mm_counter(mm, MM_FILEPAGES);
> + *shared = get_file_rss(mm);
> *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
> >> PAGE_SHIFT;
> *data = mm->total_vm - mm->shared_vm;
> - *resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
> + *resident = *shared + get_anon_rss(mm);
> return mm->total_vm;
> }
>
> diff -puN include/linux/mm.h~mm-count-lowmem-rss include/linux/mm.h
> --- a/include/linux/mm.h~mm-count-lowmem-rss
> +++ a/include/linux/mm.h
> @@ -938,11 +938,10 @@ static inline void dec_mm_counter(struct
>
>

twitter

Wednesday, February 3, 2010

linux.kernel - 25 new messages in 18 topics - digest

No comments:

Post a Comment