Friday, April 16, 2010

linux.kernel - 25 new messages in 11 topics - digest

linux.kernel
http://groups.google.com/group/linux.kernel?hl=en

linux.kernel@googlegroups.com

Today's topics:

* Add device drivers (GbE, Packet Hub) for Topcliff - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/afb65d4a33fca38c?hl=en
* powerpc: Add rcu_read_lock() to gup_fast() implementation - 3 messages, 2
authors
http://groups.google.com/group/linux.kernel/t/925071c312df7f75?hl=en
* vmscan: Setup pagevec as late as possible in shrink_page_list() - 3 messages,
1 author
http://groups.google.com/group/linux.kernel/t/639b5b2be5a70410?hl=en
* [watchdog] combine nmi_watchdog and softlockup - 8 messages, 4 authors
http://groups.google.com/group/linux.kernel/t/88fa3f5fcc3ff0b4?hl=en
* hw-breakpoints: Separate constraint space for data and instruction
breakpoints - 2 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/1b633f76fd51c2e0?hl=en
* 2.6.33.1: USB camera hung_task_timeout_secs >120 seconds - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/0d4e14c8db94fe73?hl=en
* 2.6.34-rc4 : OOPS in unmap_vma - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/597826aa25eb10d4?hl=en
* vmscan: simplify shrink_inactive_list() - 3 messages, 1 author
http://groups.google.com/group/linux.kernel/t/019901c2ac5d237e?hl=en
* Sound goes too fast due to commit 7b3a177b0 - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/2f9470dadc204a7c?hl=en
* rmap: add exclusively owned pages to the newest anon_vma - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/eec470a501781823?hl=en
* [PATCH 1/2] microblaze: add stack unwinder - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/4744c321d1658dc9?hl=en

==============================================================================
TOPIC: Add device drivers (GbE, Packet Hub) for Topcliff
http://groups.google.com/group/linux.kernel/t/afb65d4a33fca38c?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Apr 16 2010 7:40 am
From: Jonathan Corbet


On Fri, 16 Apr 2010 19:09:04 +0900
"Masayuki Ohtake" <masa-korg@dsn.okisemi.com> wrote:

> I developed the device drivers for Linux kernel 2.6.33-1.
> This time, I added the following drivers
> - GbE device
> - Packet HUB device
>
> Would you check them?

Thanks for making your code available. May I suggest, though, that you
will get a much better response if you post the patches directly to the
mailing list? That is how our process works; it is simply harder to
review code if you have to do digging through web sites to find it.

Networking-specific patches should be posted to the netdev list as well.

More information on how to post code for review can be found in the
kernel source tree:

Documentation/HOWTO
Documentation/development-process

Thanks,

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: powerpc: Add rcu_read_lock() to gup_fast() implementation
http://groups.google.com/group/linux.kernel/t/925071c312df7f75?hl=en
==============================================================================

== 1 of 3 ==
Date: Fri, Apr 16 2010 7:40 am
From: "Paul E. McKenney"


On Fri, Apr 16, 2010 at 04:23:39PM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 07:17 -0700, Paul E. McKenney wrote:
>
> > > > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > > > sections are non-preemptible. Therefore, in CONFIG_PREEMPT_RT, these
> > > > read-side critical sections must use raw spinlocks.
> > >
> > > OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> > > rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> > > synchronize_rcu} functions to {*}_preempt and then add a new
> > > CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> > > {*}_preempt, we've basically got what I've been asking for for a while,
> > > no?
> >
> > What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?
>
> Same as for a preempt one, since you'd have to be able to schedule()
> while holding it to be able to do things like mutex_lock().

So what you really want is something like rcu_read_lock_sleep() rather
than rcu_read_lock_preempt(), right? The point is that you want to do
more than merely preempt, given that it is legal to do general blocking
while holding a mutex, correct?

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 2 of 3 ==
Date: Fri, Apr 16 2010 8:00 am
From: Peter Zijlstra


On Fri, 2010-04-16 at 07:32 -0700, Paul E. McKenney wrote:
> On Fri, Apr 16, 2010 at 04:23:39PM +0200, Peter Zijlstra wrote:
> > On Fri, 2010-04-16 at 07:17 -0700, Paul E. McKenney wrote:
> >
> > > > > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > > > > sections are non-preemptible. Therefore, in CONFIG_PREEMPT_RT, these
> > > > > read-side critical sections must use raw spinlocks.
> > > >
> > > > OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> > > > rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> > > > synchronize_rcu} functions to {*}_preempt and then add a new
> > > > CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> > > > {*}_preempt, we've basically got what I've been asking for for a while,
> > > > no?
> > >
> > > What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?
> >
> > Same as for a preempt one, since you'd have to be able to schedule()
> > while holding it to be able to do things like mutex_lock().
>
> So what you really want is something like rcu_read_lock_sleep() rather
> than rcu_read_lock_preempt(), right? The point is that you want to do
> more than merely preempt, given that it is legal to do general blocking
> while holding a mutex, correct?

Right, but CONFIG_TREE_PREEMPT_RCU=y ends up being that. We could change
the name to _sleep, but we've been calling it preemptible-rcu for a long
while now.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 3 of 3 ==
Date: Fri, Apr 16 2010 8:10 am
From: "Paul E. McKenney"


On Fri, Apr 16, 2010 at 04:56:50PM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 07:32 -0700, Paul E. McKenney wrote:
> > On Fri, Apr 16, 2010 at 04:23:39PM +0200, Peter Zijlstra wrote:
> > > On Fri, 2010-04-16 at 07:17 -0700, Paul E. McKenney wrote:
> > >
> > > > > > Of course, with call_rcu_sched(), the corresponding RCU read-side critical
> > > > > > sections are non-preemptible. Therefore, in CONFIG_PREEMPT_RT, these
> > > > > > read-side critical sections must use raw spinlocks.
> > > > >
> > > > > OK, so if we fully remove CONFIG_TREE_PREEMPT_RCU (defaulting to y),
> > > > > rename all the {call_rcu, rcu_read_lock, rcu_read_unlock,
> > > > > synchronize_rcu} functions to {*}_preempt and then add a new
> > > > > CONFIG_PREEMPT_RCU that simply maps {*} to either {*}_sched or
> > > > > {*}_preempt, we've basically got what I've been asking for for a while,
> > > > > no?
> > > >
> > > > What would rcu_read_lock_preempt() do in a !CONFIG_PREEMPT kernel?
> > >
> > > Same as for a preempt one, since you'd have to be able to schedule()
> > > while holding it to be able to do things like mutex_lock().
> >
> > So what you really want is something like rcu_read_lock_sleep() rather
> > than rcu_read_lock_preempt(), right? The point is that you want to do
> > more than merely preempt, given that it is legal to do general blocking
> > while holding a mutex, correct?
>
> Right, but CONFIG_TREE_PREEMPT_RCU=y ends up being that. We could change
> the name to _sleep, but we've been calling it preemptible-rcu for a long
> while now.

It is actually not permitted to do general blocking in a preemptible RCU
read-side critical section. Otherwise, someone is going to block waiting
for a network packet that never comes, thus OOMing the system.

Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: vmscan: Setup pagevec as late as possible in shrink_page_list()
http://groups.google.com/group/linux.kernel/t/639b5b2be5a70410?hl=en
==============================================================================

== 1 of 3 ==
Date: Fri, Apr 16 2010 7:40 am
From: Mel Gorman


On Fri, Apr 16, 2010 at 04:54:03PM +0900, KOSAKI Motohiro wrote:
> > shrink_page_list() sets up a pagevec to release pages as according as they
> > are free. It uses significant amounts of stack on the pagevec. This
> > patch adds pages to be freed via pagevec to a linked list which is then
> > freed en-masse at the end. This avoids using stack in the main path that
> > potentially calls writepage().
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/vmscan.c | 34 ++++++++++++++++++++++++++--------
> > 1 files changed, 26 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9bc1ede..2c22c83 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -619,6 +619,22 @@ static enum page_references page_check_references(struct page *page,
> > return PAGEREF_RECLAIM;
> > }
> >
> > +static void free_page_list(struct list_head *free_list)
> > +{
> > + struct pagevec freed_pvec;
> > + struct page *page, *tmp;
> > +
> > + pagevec_init(&freed_pvec, 1);
> > +
> > + list_for_each_entry_safe(page, tmp, free_list, lru) {
> > + list_del(&page->lru);
> > + if (!pagevec_add(&freed_pvec, page)) {
> > + __pagevec_free(&freed_pvec);
> > + pagevec_reinit(&freed_pvec);
> > + }
> > + }
>
> Need this two line at this? because we need consider number of
> list element are not 14xN.
>
> if (pagevec_count(&freed_pvec))
> __pagevec_free(&freed_pvec);
>

Whoops, yes indeed. Otherwise this potentially leaks and as
SWAP_CLUSTER_MAX is 32, it's often not going to be 14xN

>
> > +}
> > +
> > /*
> > * shrink_page_list() returns the number of reclaimed pages
> > */
> > @@ -627,13 +643,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > enum pageout_io sync_writeback)
> > {
> > LIST_HEAD(ret_pages);
> > - struct pagevec freed_pvec;
> > + LIST_HEAD(free_list);
> > int pgactivate = 0;
> > unsigned long nr_reclaimed = 0;
> >
> > cond_resched();
> >
> > - pagevec_init(&freed_pvec, 1);
> > while (!list_empty(page_list)) {
> > enum page_references references;
> > struct address_space *mapping;
> > @@ -808,10 +823,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > __clear_page_locked(page);
> > free_it:
> > nr_reclaimed++;
> > - if (!pagevec_add(&freed_pvec, page)) {
> > - __pagevec_free(&freed_pvec);
> > - pagevec_reinit(&freed_pvec);
> > - }
> > +
> > + /*
> > + * Is there need to periodically free_page_list? It would
> > + * appear not as the counts should be low
> > + */
> > + list_add(&page->lru, &free_list);
> > continue;
> >
> > cull_mlocked:
> > @@ -834,9 +851,10 @@ keep:
> > list_add(&page->lru, &ret_pages);
> > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > }
> > +
> > + free_page_list(&free_list);
> > +
> > list_splice(&ret_pages, page_list);
> > - if (pagevec_count(&freed_pvec))
> > - __pagevec_free(&freed_pvec);
> > count_vm_events(PGACTIVATE, pgactivate);
> > return nr_reclaimed;
> > }
> > --
> > 1.6.5
> >
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 2 of 3 ==
Date: Fri, Apr 16 2010 7:40 am
From: Mel Gorman


On Fri, Apr 16, 2010 at 08:19:00PM +0900, KOSAKI Motohiro wrote:
> > When shrink_inactive_list() isolates pages, it updates a number of
> > counters using temporary variables to gather them. These consume stack
> > and it's in the main path that calls ->writepage(). This patch moves the
> > accounting updates outside of the main path to reduce stack usage.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/vmscan.c | 63 +++++++++++++++++++++++++++++++++++-----------------------
> > 1 files changed, 38 insertions(+), 25 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2c22c83..4225319 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1061,7 +1061,8 @@ static unsigned long clear_active_flags(struct list_head *page_list,
> > ClearPageActive(page);
> > nr_active++;
> > }
> > - count[lru]++;
> > + if (count)
> > + count[lru]++;
> > }
> >
> > return nr_active;
> > @@ -1141,12 +1142,13 @@ static int too_many_isolated(struct zone *zone, int file,
> > * TODO: Try merging with migrations version of putback_lru_pages
> > */
> > static noinline void putback_lru_pages(struct zone *zone,
> > - struct zone_reclaim_stat *reclaim_stat,
> > + struct scan_control *sc,
> > unsigned long nr_anon, unsigned long nr_file,
> > struct list_head *page_list)
> > {
> > struct page *page;
> > struct pagevec pvec;
> > + struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
>
> Seems unrelated change here.
> Otherwise looks good.
>

It was needed somewhat otherwise the split in accounting looked odd. It
could be done as two patches but it felt trickier to review.

> - kosaki
> >
> > pagevec_init(&pvec, 1);
> >
> > @@ -1185,6 +1187,37 @@ static noinline void putback_lru_pages(struct zone *zone,
> > pagevec_release(&pvec);
> > }
> >
> > +static noinline void update_isolated_counts(struct zone *zone,
> > + struct scan_control *sc,
> > + unsigned long *nr_anon,
> > + unsigned long *nr_file,
> > + struct list_head *isolated_list)
> > +{
> > + unsigned long nr_active;
> > + unsigned int count[NR_LRU_LISTS] = { 0, };
> > + struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> > +
> > + nr_active = clear_active_flags(isolated_list, count);
> > + __count_vm_events(PGDEACTIVATE, nr_active);
> > +
> > + __mod_zone_page_state(zone, NR_ACTIVE_FILE,
> > + -count[LRU_ACTIVE_FILE]);
> > + __mod_zone_page_state(zone, NR_INACTIVE_FILE,
> > + -count[LRU_INACTIVE_FILE]);
> > + __mod_zone_page_state(zone, NR_ACTIVE_ANON,
> > + -count[LRU_ACTIVE_ANON]);
> > + __mod_zone_page_state(zone, NR_INACTIVE_ANON,
> > + -count[LRU_INACTIVE_ANON]);
> > +
> > + *nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> > + *nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> > + __mod_zone_page_state(zone, NR_ISOLATED_ANON, *nr_anon);
> > + __mod_zone_page_state(zone, NR_ISOLATED_FILE, *nr_file);
> > +
> > + reclaim_stat->recent_scanned[0] += *nr_anon;
> > + reclaim_stat->recent_scanned[1] += *nr_file;
> > +}
> > +
> > /*
> > * shrink_inactive_list() is a helper for shrink_zone(). It returns the number
> > * of reclaimed pages
> > @@ -1196,11 +1229,9 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> > LIST_HEAD(page_list);
> > unsigned long nr_scanned;
> > unsigned long nr_reclaimed = 0;
> > - struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
> > int lumpy_reclaim = 0;
> > unsigned long nr_taken;
> > unsigned long nr_active;
> > - unsigned int count[NR_LRU_LISTS] = { 0, };
> > unsigned long nr_anon;
> > unsigned long nr_file;
> >
> > @@ -1244,25 +1275,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> > return 0;
> > }
> >
> > - nr_active = clear_active_flags(&page_list, count);
> > - __count_vm_events(PGDEACTIVATE, nr_active);
> > -
> > - __mod_zone_page_state(zone, NR_ACTIVE_FILE,
> > - -count[LRU_ACTIVE_FILE]);
> > - __mod_zone_page_state(zone, NR_INACTIVE_FILE,
> > - -count[LRU_INACTIVE_FILE]);
> > - __mod_zone_page_state(zone, NR_ACTIVE_ANON,
> > - -count[LRU_ACTIVE_ANON]);
> > - __mod_zone_page_state(zone, NR_INACTIVE_ANON,
> > - -count[LRU_INACTIVE_ANON]);
> > -
> > - nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> > - nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> > - __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
> > - __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
> > -
> > - reclaim_stat->recent_scanned[0] += nr_anon;
> > - reclaim_stat->recent_scanned[1] += nr_file;
> > + update_isolated_counts(zone, sc, &nr_anon, &nr_file, &page_list);
> >
> > spin_unlock_irq(&zone->lru_lock);
> >
> > @@ -1281,7 +1294,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> > * The attempt at page out may have made some
> > * of the pages active, mark them inactive again.
> > */
> > - nr_active = clear_active_flags(&page_list, count);
> > + nr_active = clear_active_flags(&page_list, NULL);
> > count_vm_events(PGDEACTIVATE, nr_active);
> >
> > nr_reclaimed += shrink_page_list(&page_list, sc,
> > @@ -1293,7 +1306,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> > __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> > __count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
> >
> > - putback_lru_pages(zone, reclaim_stat, nr_anon, nr_file, &page_list);
> > + putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
> > return nr_reclaimed;
> > }
> >
> > --
> > 1.6.5
> >
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 3 of 3 ==
Date: Fri, Apr 16 2010 8:00 am
From: Mel Gorman


On Thu, Apr 15, 2010 at 06:21:33PM +0100, Mel Gorman wrote:
> This is just an RFC to reduce some of the more obvious stack usage in page
> reclaim. It's a bit rushed and I haven't tested this yet but am sending
> it out as there may be others working on similar material and would rather
> avoid overlap. I built on some of Kosaki Motohiro's work.
>

So the first pass seems to have been reasonably well received. Kosaki,
Rik and Johannes, how you do typically test reclaim-related patches for
regressions? My initial sniff-tests look ok with the page leak sorted out
but I typically am not searching for vmscan regressions other than lumpy
reclaim.

> On X86 bit, stack usage figures (generated using a modified bloat-o-meter

This should have been X86-64. The stack shrinkage is less on X86
obviously because of the difference size of pointers and the like.

> that uses checkstack.pl as its input) change in the following ways after
> the series of patches.
>
> add/remove: 2/0 grow/shrink: 0/4 up/down: 804/-1688 (-884)
> function old new delta
> putback_lru_pages - 676 +676
> update_isolated_counts - 128 +128
> do_try_to_free_pages 172 128 -44
> kswapd 1324 1168 -156
> shrink_page_list 1616 1224 -392
> shrink_zone 2320 1224 -1096
>
> There are some growths there but critically they are no longer in the path
> that would call writepages. In the main path, there is about 1K of stack
> lopped off giving a small amount of breathing room.
>
> KOSAKI Motohiro (3):
> vmscan: kill prev_priority completely
> vmscan: move priority variable into scan_control
> vmscan: simplify shrink_inactive_list()
>
> Mel Gorman (7):
> vmscan: Remove useless loop at end of do_try_to_free_pages
> vmscan: Remove unnecessary temporary vars in do_try_to_free_pages
> vmscan: Split shrink_zone to reduce stack usage
> vmscan: Remove unnecessary temporary variables in shrink_zone()
> vmscan: Setup pagevec as late as possible in shrink_inactive_list()
> vmscan: Setup pagevec as late as possible in shrink_page_list()
> vmscan: Update isolated page counters outside of main path in
> shrink_inactive_list()
>
> include/linux/mmzone.h | 15 --
> mm/page_alloc.c | 2 -
> mm/vmscan.c | 447 +++++++++++++++++++++++-------------------------
> mm/vmstat.c | 2 -
> 4 files changed, 210 insertions(+), 256 deletions(-)
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: [watchdog] combine nmi_watchdog and softlockup
http://groups.google.com/group/linux.kernel/t/88fa3f5fcc3ff0b4?hl=en
==============================================================================

== 1 of 8 ==
Date: Fri, Apr 16 2010 7:40 am
From: Cyrill Gorcunov


On Fri, Apr 16, 2010 at 03:47:14AM +0200, Frederic Weisbecker wrote:
[...]
> > +
> > +/* Callback function for perf event subsystem */
> > +void watchdog_overflow_callback(struct perf_event *event, int nmi,
> > + struct perf_sample_data *data,
> > + struct pt_regs *regs)
> > +{
> > + int this_cpu = smp_processor_id();
> > + unsigned long touch_ts = per_cpu(watchdog_touch_ts, this_cpu);
> > +
> > + if (touch_ts == 0) {
> > + __touch_watchdog();
> > + return;
> > + }
> > +
> > + /* check for a hardlockup
> > + * This is done by making sure our timer interrupt
> > + * is incrementing. The timer interrupt should have
> > + * fired multiple times before we overflow'd. If it hasn't
> > + * then this is a good indication the cpu is stuck
> > + */
> > + if (is_hardlockup(this_cpu)) {
> > + /* only print hardlockups once */
> > + if (cpumask_test_cpu(this_cpu, to_cpumask(hardlockup_mask)))
> > + return;
> > +
> > + if (hardlockup_panic)
> > + panic("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> > + else
> > + WARN(1, "Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> > +
> > + cpumask_set_cpu(this_cpu, to_cpumask(hardlockup_mask));
>
>
>
> May be have an arch spin lock there to update your cpu mask safely.
>

Hmm, this is NMI handler path so from what we protect this per-cpu data?
Do I miss something? /me confused

-- Cyrill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 2 of 8 ==
Date: Fri, Apr 16 2010 7:50 am
From: Frederic Weisbecker


On Fri, Apr 16, 2010 at 06:32:32PM +0400, Cyrill Gorcunov wrote:
> On Fri, Apr 16, 2010 at 03:47:14AM +0200, Frederic Weisbecker wrote:
> [...]
> > > +
> > > +/* Callback function for perf event subsystem */
> > > +void watchdog_overflow_callback(struct perf_event *event, int nmi,
> > > + struct perf_sample_data *data,
> > > + struct pt_regs *regs)
> > > +{
> > > + int this_cpu = smp_processor_id();
> > > + unsigned long touch_ts = per_cpu(watchdog_touch_ts, this_cpu);
> > > +
> > > + if (touch_ts == 0) {
> > > + __touch_watchdog();
> > > + return;
> > > + }
> > > +
> > > + /* check for a hardlockup
> > > + * This is done by making sure our timer interrupt
> > > + * is incrementing. The timer interrupt should have
> > > + * fired multiple times before we overflow'd. If it hasn't
> > > + * then this is a good indication the cpu is stuck
> > > + */
> > > + if (is_hardlockup(this_cpu)) {
> > > + /* only print hardlockups once */
> > > + if (cpumask_test_cpu(this_cpu, to_cpumask(hardlockup_mask)))
> > > + return;
> > > +
> > > + if (hardlockup_panic)
> > > + panic("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> > > + else
> > > + WARN(1, "Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> > > +
> > > + cpumask_set_cpu(this_cpu, to_cpumask(hardlockup_mask));
> >
> >
> >
> > May be have an arch spin lock there to update your cpu mask safely.
> >
>
> Hmm, this is NMI handler path so from what we protect this per-cpu data?
> Do I miss something? /me confused


The cpu mask is not per cpu here, this is a shared bitmap, so you
can race against other cpus NMIs.

That said, as I suggested, having a per cpu var that we set when we
warned would be much better than a spinlock here.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 3 of 8 ==
Date: Fri, Apr 16 2010 7:50 am
From: Frederic Weisbecker


On Fri, Apr 16, 2010 at 10:12:13AM -0400, Don Zickus wrote:
> On Fri, Apr 16, 2010 at 03:47:14AM +0200, Frederic Weisbecker wrote:
> > > config PERF_EVENTS_NMI
> > > bool
> > > + depends on PERF_EVENTS
> > > help
> > > Arch has support for nmi_watchdog
> >
> >
> >
> > That looks too general. It's more about the fact the arch supports
> > cpu cycle events and generates NMIs on overflow.
>
> I was trying to figure out a way to add the PERF_EVENTS dependency as I
> didn't want to impose it on the CONFIG_NMI_WATCHDOG if that config
> supported softlockup (which doesn't need the PERF_EVENTS).

Yeah and this is fine. I was talking about the help description.



> > I'm confused, do we have two versions of the softlockup
> > detector now? You should drop the older one.
>
> Originally Ingo talked about a migration path, so I was going to support
> the older one in case the new one was having issues, sort of like what he
> suggested about moving the nmi code from arch/x86/kernel/apic/nmi.c to
> kernel/watchdog.c. But I can probably drop the softlockup case as the
> migration isn't as tricky as the nmi case.

Ok.

> > > + return;
> > > + }
> > > +
> > > + cpumask_clear_cpu(this_cpu, to_cpumask(hardlockup_mask));
> >
> >
> >
> > Hmm...this is probably not necessary.
>
> I was just thinking of the case where dispite the WARN above, the cpu
> actually recovered and then failed again separately. But I probably won't
> spend anymore time defending it. :-)

This is really just a corner case, I guess you don't need to
bother with that. It is actually racy against other cpus and adding
a spinlock here (in the everything is fine path) would be an overkill.

In fact, having two per cpu vars named hardlockup_warned and
softlockup_warned would be better than cpumasks. I'm sorry I
suggested you the cpumask, but such per cpu vars will avoid
you dealing with these synchonization issues. And one of the primary
rules is usually to never take a lock from NMIs if we can :)



> > You probably want a backtrace cpu mask here as well
> > (but better don't use the same than the hardlockup thing)
>
> yup.


So actually, per_cpu softlockup_warned would be better :)


> > Also you should half-drop the DETECT_SOFTLOCKUP thing:
> > keep it's definition but drop the ability to choose it from
> > the prompt:
> >
> > config DETECT_SOFTLOCKUP
> > bool
> > depends on DEBUG_KERNEL && !S390
> > default y
> >
> > This way we keep it for compatibility with def_configs, it will
> > enable the WATCHDOG by default if it is "y", we can schedule
> > its removal later.

> I understand the general idea but not quite the implementation idea. I will work
> on it and see what I come up with.


We current have:

config DETECT_SOFTLOCKUP
bool "Blah"
depends on DEBUG_KERNEL && !S390
default y
help
.......

The idea is to remove the "Blah" so that the user can't select it
anymore from make menuconfig, and to remove the help too as it's useless
too.

So that config WATCHDOG can be default y if DETECT_SOFTLOCKUP.
Then if someone comes with a config that has DETECT_SOFTLOCKUP,
it's new implementation (WATCHDOG) will enabled by default.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 4 of 8 ==
Date: Fri, Apr 16 2010 7:50 am
From: Don Zickus


On Fri, Apr 16, 2010 at 06:32:32PM +0400, Cyrill Gorcunov wrote:
> On Fri, Apr 16, 2010 at 03:47:14AM +0200, Frederic Weisbecker wrote:
> [...]
> > > +
> > > +/* Callback function for perf event subsystem */
> > > +void watchdog_overflow_callback(struct perf_event *event, int nmi,
> > > + struct perf_sample_data *data,
> > > + struct pt_regs *regs)
> > > +{
> > > + int this_cpu = smp_processor_id();
> > > + unsigned long touch_ts = per_cpu(watchdog_touch_ts, this_cpu);
> > > +
> > > + if (touch_ts == 0) {
> > > + __touch_watchdog();
> > > + return;
> > > + }
> > > +
> > > + /* check for a hardlockup
> > > + * This is done by making sure our timer interrupt
> > > + * is incrementing. The timer interrupt should have
> > > + * fired multiple times before we overflow'd. If it hasn't
> > > + * then this is a good indication the cpu is stuck
> > > + */
> > > + if (is_hardlockup(this_cpu)) {
> > > + /* only print hardlockups once */
> > > + if (cpumask_test_cpu(this_cpu, to_cpumask(hardlockup_mask)))
> > > + return;
> > > +
> > > + if (hardlockup_panic)
> > > + panic("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> > > + else
> > > + WARN(1, "Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> > > +
> > > + cpumask_set_cpu(this_cpu, to_cpumask(hardlockup_mask));
> >
> >
> >
> > May be have an arch spin lock there to update your cpu mask safely.
> >
>
> Hmm, this is NMI handler path so from what we protect this per-cpu data?
> Do I miss something? /me confused

It's not per-cpu which is the problem I believe. If you have multiple
cpus dealing with hardlockups than they can overwrite each other's data.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 5 of 8 ==
Date: Fri, Apr 16 2010 8:00 am
From: Frederic Weisbecker


On Fri, Apr 16, 2010 at 04:53:11PM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-16 at 16:46 +0200, Frederic Weisbecker wrote:
> > > > May be have an arch spin lock there to update your cpu mask safely.
> > > >
> > >
> > > Hmm, this is NMI handler path so from what we protect this per-cpu data?
> > > Do I miss something? /me confused
> >
> >
> > The cpu mask is not per cpu here, this is a shared bitmap, so you
> > can race against other cpus NMIs.
> >
> > That said, as I suggested, having a per cpu var that we set when we
> > warned would be much better than a spinlock here.
>
> Every time you think NMI and spinlock think FAIL.


In fact I was first inspired by the x86 nmi watchdog handler
that does this spinlock to protect cpumask, but realize just
after my FAIL ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 6 of 8 ==
Date: Fri, Apr 16 2010 8:00 am
From: Peter Zijlstra


On Fri, 2010-04-16 at 16:46 +0200, Frederic Weisbecker wrote:
> > > May be have an arch spin lock there to update your cpu mask safely.
> > >
> >
> > Hmm, this is NMI handler path so from what we protect this per-cpu data?
> > Do I miss something? /me confused
>
>
> The cpu mask is not per cpu here, this is a shared bitmap, so you
> can race against other cpus NMIs.
>
> That said, as I suggested, having a per cpu var that we set when we
> warned would be much better than a spinlock here.

Every time you think NMI and spinlock think FAIL.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 7 of 8 ==
Date: Fri, Apr 16 2010 8:00 am
From: Cyrill Gorcunov


On Fri, Apr 16, 2010 at 04:46:17PM +0200, Frederic Weisbecker wrote:
...
> > > > + if (hardlockup_panic)
> > > > + panic("Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> > > > + else
> > > > + WARN(1, "Watchdog detected hard LOCKUP on cpu %d", this_cpu);
> > > > +
> > > > + cpumask_set_cpu(this_cpu, to_cpumask(hardlockup_mask));
> > >
> > >
> > >
> > > May be have an arch spin lock there to update your cpu mask safely.
> > >
> >
> > Hmm, this is NMI handler path so from what we protect this per-cpu data?
> > Do I miss something? /me confused
>
>
> The cpu mask is not per cpu here, this is a shared bitmap, so you
> can race against other cpus NMIs.
>
> That said, as I suggested, having a per cpu var that we set when we
> warned would be much better than a spinlock here.
>

yeah, saw DECLARE_BITMAP but read it as DEFINE_PER_CPU for some reason.
having any spinlock in irq handler is really under suspicious.

-- Cyrill
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 8 of 8 ==
Date: Fri, Apr 16 2010 8:10 am
From: Don Zickus


On Fri, Apr 16, 2010 at 04:43:04PM +0200, Frederic Weisbecker wrote:
> On Fri, Apr 16, 2010 at 10:12:13AM -0400, Don Zickus wrote:
> > On Fri, Apr 16, 2010 at 03:47:14AM +0200, Frederic Weisbecker wrote:
> > > > config PERF_EVENTS_NMI
> > > > bool
> > > > + depends on PERF_EVENTS
> > > > help
> > > > Arch has support for nmi_watchdog
> > >
> > >
> > >
> > > That looks too general. It's more about the fact the arch supports
> > > cpu cycle events and generates NMIs on overflow.
> >
> > I was trying to figure out a way to add the PERF_EVENTS dependency as I
> > didn't want to impose it on the CONFIG_NMI_WATCHDOG if that config
> > supported softlockup (which doesn't need the PERF_EVENTS).
>
>
>
> Yeah and this is fine. I was talking about the help description.

Oh. heh. ok, will expand that.

>
>
>
> > > I'm confused, do we have two versions of the softlockup
> > > detector now? You should drop the older one.
> >
> > Originally Ingo talked about a migration path, so I was going to support
> > the older one in case the new one was having issues, sort of like what he
> > suggested about moving the nmi code from arch/x86/kernel/apic/nmi.c to
> > kernel/watchdog.c. But I can probably drop the softlockup case as the
> > migration isn't as tricky as the nmi case.
>
>
>
> Ok.
>
> > > > + return;
> > > > + }
> > > > +
> > > > + cpumask_clear_cpu(this_cpu, to_cpumask(hardlockup_mask));
> > >
> > >
> > >
> > > Hmm...this is probably not necessary.
> >
> > I was just thinking of the case where dispite the WARN above, the cpu
> > actually recovered and then failed again separately. But I probably won't
> > spend anymore time defending it. :-)
>
>
>
> This is really just a corner case, I guess you don't need to
> bother with that. It is actually racy against other cpus and adding
> a spinlock here (in the everything is fine path) would be an overkill.
>
> In fact, having two per cpu vars named hardlockup_warned and
> softlockup_warned would be better than cpumasks. I'm sorry I
> suggested you the cpumask, but such per cpu vars will avoid
> you dealing with these synchonization issues. And one of the primary
> rules is usually to never take a lock from NMIs if we can :)

Yeah, I guess per cpu is better. I agree that locks in NMI are frowned
upon but I wasn't sure of it was dealt with.

I'll try to implement this. Any objections if I combined hardlockup and
softlockup with per cpu watchdog_warn and have bit masks for HARDLOCKUP
and SOFTLOCKUP? I hate to just waste per cpu space for this.

>
>
>
> > > You probably want a backtrace cpu mask here as well
> > > (but better don't use the same than the hardlockup thing)
> >
> > yup.
>
>
> So actually, per_cpu softlockup_warned would be better :)
>
>
> > > Also you should half-drop the DETECT_SOFTLOCKUP thing:
> > > keep it's definition but drop the ability to choose it from
> > > the prompt:
> > >
> > > config DETECT_SOFTLOCKUP
> > > bool
> > > depends on DEBUG_KERNEL && !S390
> > > default y
> > >
> > > This way we keep it for compatibility with def_configs, it will
> > > enable the WATCHDOG by default if it is "y", we can schedule
> > > its removal later.
>
> > I understand the general idea but not quite the implementation idea. I will work
> > on it and see what I come up with.
>
>
> We current have:
>
> config DETECT_SOFTLOCKUP
> bool "Blah"
> depends on DEBUG_KERNEL && !S390
> default y
> help
> .......
>
> The idea is to remove the "Blah" so that the user can't select it
> anymore from make menuconfig, and to remove the help too as it's useless
> too.
>
> So that config WATCHDOG can be default y if DETECT_SOFTLOCKUP.
> Then if someone comes with a config that has DETECT_SOFTLOCKUP,
> it's new implementation (WATCHDOG) will enabled by default.

Ah, I missed the bool part. I got it. Thanks for the clarification.

Cheers,
Don

>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: hw-breakpoints: Separate constraint space for data and instruction
breakpoints
http://groups.google.com/group/linux.kernel/t/1b633f76fd51c2e0?hl=en
==============================================================================

== 1 of 2 ==
Date: Fri, Apr 16 2010 7:40 am
From: Will Deacon


Hello Frederic,

Thanks for this.

On Tue, 2010-04-13 at 00:01 +0100, Frederic Weisbecker wrote:
> There are two outstanding fashions for archs to implement hardware
> breakpoints.
>
> The first is to separate breakpoint address pattern definition
> space between data and instruction breakpoints. We then have
> typically distinct instruction address breakpoint registers
> and data address breakpoint registers, delivered with
> separate control registers for data and instruction breakpoints
> as well. This is the case of PowerPc and ARM for example.
>
> The second consists in having merged breakpoint address space
> definition between data and instruction breakpoint. Address
> registers can host either instruction or data address and
> the access mode for the breakpoint is defined in a control
> register. This is the case of x86.
>
> This patch adds a new CONFIG_HAVE_MIXED_BREAKPOINTS_REGS config
> that archs can select if they belong to the second case. Those
> will have their slot allocation merged for instructions and
> data breakpoints.
>
> The others will have a separate slot tracking between data and
> instruction breakpoints.

This looks useful for supporting architectures with separate
data/instruction breakpoints.

A couple of points:

1.) Will this affect the arch backend at all?

2.) On ARM, it is possible to have different numbers of breakpoint registers
and watchpoint registers [which we do not know until runtime]. This
means that we have to define HBP_NUM as a potential upper bound,
which seems a bit wasteful. Perhaps there could be a mechanism to
register the available resources at runtime?

Thanks,

Will

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 2 of 2 ==
Date: Fri, Apr 16 2010 8:00 am
From: Frederic Weisbecker


On Fri, Apr 16, 2010 at 03:35:30PM +0100, Will Deacon wrote:
> Hello Frederic,
>
> Thanks for this.
>
> On Tue, 2010-04-13 at 00:01 +0100, Frederic Weisbecker wrote:
> > There are two outstanding fashions for archs to implement hardware
> > breakpoints.
> >
> > The first is to separate breakpoint address pattern definition
> > space between data and instruction breakpoints. We then have
> > typically distinct instruction address breakpoint registers
> > and data address breakpoint registers, delivered with
> > separate control registers for data and instruction breakpoints
> > as well. This is the case of PowerPc and ARM for example.
> >
> > The second consists in having merged breakpoint address space
> > definition between data and instruction breakpoint. Address
> > registers can host either instruction or data address and
> > the access mode for the breakpoint is defined in a control
> > register. This is the case of x86.
> >
> > This patch adds a new CONFIG_HAVE_MIXED_BREAKPOINTS_REGS config
> > that archs can select if they belong to the second case. Those
> > will have their slot allocation merged for instructions and
> > data breakpoints.
> >
> > The others will have a separate slot tracking between data and
> > instruction breakpoints.
>
> This looks useful for supporting architectures with separate
> data/instruction breakpoints.
>
> A couple of points:
>
> 1.) Will this affect the arch backend at all?


No change is required from the backend. In fact this config probably
only fits for x86, unless there are other archs that have shared
register addresses... Archs that only support data or inst breakpoints
should also select it, but only to save a bit of memory, there would
be no functional changes.



> 2.) On ARM, it is possible to have different numbers of breakpoint registers
> and watchpoint registers [which we do not know until runtime]. This
> means that we have to define HBP_NUM as a potential upper bound,
> which seems a bit wasteful. Perhaps there could be a mechanism to
> register the available resources at runtime?


Ah that's interesting. Ok, I will update my patch to integrate that.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: 2.6.33.1: USB camera hung_task_timeout_secs >120 seconds
http://groups.google.com/group/linux.kernel/t/0d4e14c8db94fe73?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Apr 16 2010 7:50 am
From: Alan Stern


On Fri, 16 Apr 2010, Martin Mokrejs wrote:

> > If 2.6.34-rc doesn't work, please post a usbmon trace showing what
> > happens when the device is plugged in. Instructions are in
> > Documentation/usb/usbmon.txt.
>
> I am unable to reproduce this with any version 2.6.30.10., 2.6.31.12,
> 2.6.33.1, 2.6.34-rc4-git2. I turned off the camera many times, removed the
> PCMCIA card while the USB media was recognized through kernel (not mounted),
> no luck. I do use it say once a week and the crash I swa for the first time,
> so I am not much optimistic on reproducibility. :(

That's good then -- you're running with no problems. :-)

> Could it be related tot he fact that because the laptop has USB1.1 onboard
> it loads uhci_hcd first. Eventually when I plugin the PCMCIA card it loads
> ehci_hcd and the kernel complains it should have been loaded before?
> Is there a way to tell this in /etc/modules.d/somefile ? I fear I have to
> compile them into the kernel statically, on this particular setup.

Don't worry about that. That warning doesn't indicate any errors; it's
just there to let people know what they _should_ do. The system will
work okay even when the drivers are loaded in the wrong order. Bad
effects are possible only if the UHCI and EHCI controllers share some
ports, and on your system they don't.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: 2.6.34-rc4 : OOPS in unmap_vma
http://groups.google.com/group/linux.kernel/t/597826aa25eb10d4?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Apr 16 2010 7:50 am
From: Maciej Rutecki


On środa, 14 kwietnia 2010 o 03:53:46 Parag Warudkar wrote:
> Not sure if this is related to the recent mm/vma fixes - got this
> while rebooting (kexec) latest git -
>

I created a Bugzilla entry at
https://bugzilla.kernel.org/show_bug.cgi?id=15795
for your bug report, please add your address to the CC list in there, thanks!

--
Maciej Rutecki
http://www.maciek.unixy.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: vmscan: simplify shrink_inactive_list()
http://groups.google.com/group/linux.kernel/t/019901c2ac5d237e?hl=en
==============================================================================

== 1 of 3 ==
Date: Fri, Apr 16 2010 8:00 am
From: Mel Gorman


On Thu, Apr 15, 2010 at 06:54:16PM +0200, Andi Kleen wrote:
> > It's a buying-time venture, I'll agree but as both approaches are only
> > about reducing stack stack they wouldn't be long-term solutions by your
> > criteria. What do you suggest?
>
> (from easy to more complicated):
>
> - Disable direct reclaim with 4K stacks

Do not like. While I can see why 4K stacks are a serious problem, I'd
sooner see 4K stacks disabled than have the kernel behave so differently
for direct reclaim. It's be tricky to spot regressions in reclaim that
were due to this .config option

> - Do direct reclaim only on separate stacks

This is looking more and more attractive.

> - Add interrupt stacks to any 8K stack architectures.

This is a similar but separate problem. It's similar in that interrupt
stacks can splice subsystems together in terms of stack usage.

> - Get rid of 4K stacks completely

Why would we *not* do this? I can't remember the original reasoning
behind 4K stacks but am guessing it helped fork-orientated workloads in
startup times in the days before lumpy reclaim and better fragmentation
control.

Who typically enables this option?

> - Think about any other stackings that could give large scale recursion
> and find ways to run them on separate stacks too.

The patch series I threw up about reducing stack was a cut-down
approach. Instead of using separate stacks, keep the stack usage out of
the main caller path where possible.

> - Long term: maybe we need 16K stacks at some point, depending on how
> good the VM gets. Alternative would be to stop making Linux more complicated,
> but that's unlikely to happen.
>

Make this Plan D if nothing else works out and we still hit a wall?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 2 of 3 ==
Date: Fri, Apr 16 2010 8:00 am
From: Mel Gorman


On Fri, Apr 16, 2010 at 09:40:13AM +1000, Dave Chinner wrote:
> On Thu, Apr 15, 2010 at 06:54:16PM +0200, Andi Kleen wrote:
> > > It's a buying-time venture, I'll agree but as both approaches are only
> > > about reducing stack stack they wouldn't be long-term solutions by your
> > > criteria. What do you suggest?
> >
> > (from easy to more complicated):
> >
> > - Disable direct reclaim with 4K stacks
>
> Just to re-iterate: we're blowing the stack with direct reclaim on
> x86_64 w/ 8k stacks.

Yep, that is not being disputed. By the way, what did you use to
generate your report? Was it CONFIG_DEBUG_STACK_USAGE or something else?
I used a modified bloat-o-meter to gather my data but it'd be nice to
be sure I'm seeing the same things as you (minus XFS unless I
specifically set it up).

> The old i386/4k stack problem is a red
> herring.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 3 of 3 ==
Date: Fri, Apr 16 2010 8:10 am
From: Mel Gorman


On Thu, Apr 15, 2010 at 09:42:17AM -0400, Chris Mason wrote:
> On Thu, Apr 15, 2010 at 11:28:37AM +0100, Mel Gorman wrote:
> > On Thu, Apr 15, 2010 at 11:34:36AM +1000, Dave Chinner wrote:
> > > On Wed, Apr 14, 2010 at 09:51:33AM +0100, Mel Gorman wrote:
> > > > On Wed, Apr 14, 2010 at 05:28:30PM +1000, Dave Chinner wrote:
> > > > > On Wed, Apr 14, 2010 at 03:52:44PM +0900, KOSAKI Motohiro wrote:
> > > > > > > On Tue, Apr 13, 2010 at 04:20:21PM -0400, Chris Mason wrote:
> > > > > > > > On Tue, Apr 13, 2010 at 08:34:29PM +0100, Mel Gorman wrote:
> > > > > > > > > > Basically, there is not enough stack space available to allow direct
> > > > > > > > > > reclaim to enter ->writepage _anywhere_ according to the stack usage
> > > > > > > > > > profiles we are seeing here....
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm not denying the evidence but how has it been gotten away with for years
> > > > > > > > > then? Prevention of writeback isn't the answer without figuring out how
> > > > > > > > > direct reclaimers can queue pages for IO and in the case of lumpy reclaim
> > > > > > > > > doing sync IO, then waiting on those pages.
> > > > > > > >
> > > > > > > > So, I've been reading along, nodding my head to Dave's side of things
> > > > > > > > because seeks are evil and direct reclaim makes seeks. I'd really loev
> > > > > > > > for direct reclaim to somehow trigger writepages on large chunks instead
> > > > > > > > of doing page by page spatters of IO to the drive.
> > > > > >
> > > > > > I agree that "seeks are evil and direct reclaim makes seeks". Actually,
> > > > > > making 4k io is not must for pageout. So, probably we can improve it.
> > > > > >
> > > > > >
> > > > > > > Perhaps drop the lock on the page if it is held and call one of the
> > > > > > > helpers that filesystems use to do this, like:
> > > > > > >
> > > > > > > filemap_write_and_wait(page->mapping);
> > > > > >
> > > > > > Sorry, I'm lost what you talk about. Why do we need per-file
> > > > > > waiting? If file is 1GB file, do we need to wait 1GB writeout?
> > > > >
> > > > > So use filemap_fdatawrite(page->mapping), or if it's better only
> > > > > to start IO on a segment of the file, use
> > > > > filemap_fdatawrite_range(page->mapping, start, end)....
> > > >
> > > > That does not help the stack usage issue, the caller ends up in
> > > > ->writepages. From an IO perspective, it'll be better from a seek point of
> > > > view but from a VM perspective, it may or may not be cleaning the right pages.
> > > > So I think this is a red herring.
> > >
> > > If you ask it to clean a bunch of pages around the one you want to
> > > reclaim on the LRU, there is a good chance it will also be cleaning
> > > pages that are near the end of the LRU or physically close by as
> > > well. It's not a guarantee, but for the additional IO cost of about
> > > 10% wall time on that IO to clean the page you need, you also get
> > > 1-2 orders of magnitude other pages cleaned. That sounds like a
> > > win any way you look at it...
> > >
> >
> > At worst, it'll distort the LRU ordering slightly. Lets say the the
> > file-adjacent-page you clean was near the end of the LRU. Before such a
> > patch, it may have gotten cleaned and done another lap of the LRU.
> > After, it would be reclaimed sooner. I don't know if we depend on such
> > behaviour (very doubtful) but it's a subtle enough change. I can't
> > predict what it'll do for IO congestion. Simplistically, there is more
> > IO so it's bad but if the write pattern is less seeky and we needed to
> > write the pages anyway, it might be improved.
> >
> > > I agree that it doesn't solve the stack problem (Chris' suggestion
> > > that we enable the bdi flusher interface would fix this);
> >
> > I'm afraid I'm not familiar with this interface. Can you point me at
> > some previous discussion so that I am sure I am looking at the right
> > thing?
>
> vi fs/direct-reclaim-helper.c, it has a few placeholders for where the
> real code needs to go....just look for the ~ marks.
>

I must be blind. What tree is this in? I can't see it v2.6.34-rc4,
mmotm or google.

> I mostly meant that the bdi helper threads were the best place to add
> knowledge about which pages we want to write for reclaim. We might need
> to add a thread dedicated to just doing the VM's dirty work, but that's
> where I would start discussing fancy new interfaces.
>
> >
> > > what I'm
> > > pointing out is that the arguments that it is too hard or there are
> > > no interfaces available to issue larger IO from reclaim are not at
> > > all valid.
> > >
> >
> > Sure, I'm not resisting fixing this, just your first patch :) There are four
> > goals here
> >
> > 1. Reduce stack usage
> > 2. Avoid the splicing of subsystem stack usage with direct reclaim
> > 3. Preserve lumpy reclaims cleaning of contiguous pages
> > 4. Try and not drastically alter LRU aging
> >
> > 1 and 2 are important for you, 3 is important for me and 4 will have to
> > be dealt with on a case-by-case basis.
> >
> > Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I
> > guess dirty pages can cycle around more so it'd need to be cared for.
>
> I'd like to add one more:
>
> 5. Don't dive into filesystem locks during reclaim.
>

Good add. It's not a new problem either. This came up at least two years
ago at around the first VM/FS summit and the response was a long the lines
of shuffling uncomfortably :/

> This is different from splicing code paths together, but
> the filesystem writepage code has become the center of our attempts at
> doing big fat contiguous writes on disk. We push off work as late as we
> can until just before the pages go down to disk.
>
> I'll pick on ext4 and btrfs for a minute, just to broaden the scope
> outside of XFS. Writepage comes along and the filesystem needs to
> actually find blocks on disk for all the dirty pages it has promised to
> write.
>
> So, we start a transaction, we take various allocator locks, modify
> different metadata, log changed blocks, take a break (logging is hard
> work you know, need_resched() triggered a by now), stuff it
> all into the file's metadata, log that, and finally return.
>
> Each of the steps above can block for a long time. Ext4 solves
> this by not doing them. ext4_writepage only writes pages that
> are already fully allocated on disk.
>
> Btrfs is much more efficient at not doing them, it just returns right
> away for PF_MEMALLOC.
>
> This is a long way of saying the filesystem writepage code is the
> opposite of what direct reclaim wants. Direct reclaim wants to
> find free ram now, and if it does end up in the mess describe above,
> it'll just get stuck for a long time on work entirely unrelated to
> finding free pages.
>

Ok, good summary, thanks. I was only partially aware of some of these.
i.e. I knew it was a problem but was not sensitive to how bad it was.
Your last point is interesting because lumpy reclaim for large orders under
heavy pressure can make the system stutter badly (e.g. during a huge
page pool resize). I had blamed just plain IO but messing around with
locks and tranactions could have been a large factor and I didn't go
looking for it.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: Sound goes too fast due to commit 7b3a177b0
http://groups.google.com/group/linux.kernel/t/2f9470dadc204a7c?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Apr 16 2010 8:00 am
From: Maciej Rutecki


On wtorek, 13 kwietnia 2010 o 23:54:26 �ric Piel wrote:
> Hello,
>
> Since 2.6.34-rc*, I have a regression on alsa which prevents the sound
> to be played correctly. When playing, the music goes too fast, skipping
> some parts. Typically, it's very easy to reproduce by doing:
> time mplayer -endpos 30 sound-file-which-lasts-more-than-thirty-sec.mp3
>

I created a Bugzilla entry at
https://bugzilla.kernel.org/show_bug.cgi?id=15796
for your bug report, please add your address to the CC list in there, thanks!

--
Maciej Rutecki
http://www.maciek.unixy.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: rmap: add exclusively owned pages to the newest anon_vma
http://groups.google.com/group/linux.kernel/t/eec470a501781823?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Apr 16 2010 8:00 am
From: Linus Torvalds


On Fri, 16 Apr 2010, Felipe Balbi wrote:
>
> while at that, would it make sense to first provide list_last_entry() since we
> already have list_first_entry() ??

Yeah, it probably would make sense. Especially as doing a simple grep for
'list_entry.*prev' does seem to imply that there might be quite a few
places that would be able to use it. Although some of them do seem to be
about finding the previous entry rather than the last in a list.

That said, doing the same grep for 'next' shows that a lot of places don't
use the list_first_entry() that we _do_ have, so..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: [PATCH 1/2] microblaze: add stack unwinder
http://groups.google.com/group/linux.kernel/t/4744c321d1658dc9?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Apr 16 2010 8:10 am
From: Michal Simek


Steven J. Magnani wrote:
> On Fri, 2010-04-16 at 10:32 +0200, Michal Simek wrote:
>> Please apply this patch and do the same testing as I did.
>>
>> diff --git a/arch/microblaze/kernel/reset.c b/arch/microblaze/kernel/reset.c
>> index a1721a3..76f6587 100644
>> --- a/arch/microblaze/kernel/reset.c
>> +++ b/arch/microblaze/kernel/reset.c
>> @@ -112,8 +112,8 @@ void of_platform_reset_gpio_probe(void)
>> void machine_restart(char *cmd)
>> {
>> printk(KERN_NOTICE "Machine restart...\n");
>> + machine_shutdown();
>> gpio_system_reset();
>> - dump_stack();
>> while (1)
>> ;
>> }
>> @@ -121,6 +121,7 @@ void machine_restart(char *cmd)
>> void machine_shutdown(void)
>> {
>> printk(KERN_NOTICE "Machine shutdown...\n");
>> + dump_stack();
>> while (1)
>> ;
>> }
>
> I get this:
>
> Call Trace:
> [<20003008>] microblaze_unwind+0x68/0x94
> [<20002c24>] show_stack+0x134/0x174
> [<20002c80>] dump_stack+0x1c/0x3c
> [<20001d84>] machine_shutdown+0x30/0x40
> [<20001dc0>] machine_restart+0x2c/0x48
> [<2002a540>] kernel_restart+0x5c/0x80
> [<2002a6b8>] sys_reboot+0x11c/0x218
> SYSCALL
> [<2d5c0248>] PID 118 [reboot]
>
> Is your problem on a MMU or noMMU system?

Interesting. It was noMMU system.

>
> If the unwinder stops early, it found a return address outside the
> kernel text, or couldn't find the addik that creates a stack frame - but
> there's not enough information here to tell what happened. In your case
> the backtrace goes off the rails before getting to the frame for
> dump_stack(), which is where the stack dump begins. We'd need to see the
> 21 words preceding the dump (show_stack's frame is 13 words, and
> microblaze_unwind's is 8) to know why.
>
> Can you uncomment the #define DEBUG in unwind.c and try it again?

will do next week and let you know.

Thanks,
Michal


>
> Thanks,
> ------------------------------------------------------------------------
> Steven J. Magnani "I claim this network for MARS!
> www.digidescorp.com Earthling, return my space modulator!"
>
> #include <standard.disclaimer>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


--
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


==============================================================================

You received this message because you are subscribed to the Google Groups "linux.kernel"
group.

To post to this group, visit http://groups.google.com/group/linux.kernel?hl=en

To unsubscribe from this group, send email to linux.kernel+unsubscribe@googlegroups.com

To change the way you get mail from this group, visit:
http://groups.google.com/group/linux.kernel/subscribe?hl=en

To report abuse, send email explaining the problem to abuse@googlegroups.com

==============================================================================
Google Groups: http://groups.google.com/?hl=en

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home


Real Estate