twitter: linux.kernel - 25 new messages in 18 topics

linux.kernel
http://groups.google.com/group/linux.kernel?hl=en

Today's topics:

==============================================================================
TOPIC: slab: add memory hotplug support
http://groups.google.com/group/linux.kernel/t/a8beda1232363b5e?hl=en
==============================================================================

== 1 of 2 ==
Date: Sat, Mar 27 2010 7:20 pm
From: David Rientjes

On Wed, 10 Mar 2010, Nick Piggin wrote:

> On Mon, Mar 08, 2010 at 03:19:48PM -0800, David Rientjes wrote:
> > On Fri, 5 Mar 2010, Nick Piggin wrote:
> >
> > > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> > > > +/*
> > > > + * Drains and frees nodelists for a node on each slab cache, used for memory
> > > > + * hotplug. Returns -EBUSY if all objects cannot be drained on memory
> > > > + * hot-remove so that the node is not removed. When used because memory
> > > > + * hot-add is canceled, the only result is the freed kmem_list3.
> > > > + *
> > > > + * Must hold cache_chain_mutex.
> > > > + */
> > > > +static int __meminit free_cache_nodelists_node(int node)
> > > > +{
> > > > + struct kmem_cache *cachep;
> > > > + int ret = 0;
> > > > +
> > > > + list_for_each_entry(cachep, &cache_chain, next) {
> > > > + struct array_cache *shared;
> > > > + struct array_cache **alien;
> > > > + struct kmem_list3 *l3;
> > > > +
> > > > + l3 = cachep->nodelists[node];
> > > > + if (!l3)
> > > > + continue;
> > > > +
> > > > + spin_lock_irq(&l3->list_lock);
> > > > + shared = l3->shared;
> > > > + if (shared) {
> > > > + free_block(cachep, shared->entry, shared->avail, node);
> > > > + l3->shared = NULL;
> > > > + }
> > > > + alien = l3->alien;
> > > > + l3->alien = NULL;
> > > > + spin_unlock_irq(&l3->list_lock);
> > > > +
> > > > + if (alien) {
> > > > + drain_alien_cache(cachep, alien);
> > > > + free_alien_cache(alien);
> > > > + }
> > > > + kfree(shared);
> > > > +
> > > > + drain_freelist(cachep, l3, l3->free_objects);
> > > > + if (!list_empty(&l3->slabs_full) ||
> > > > + !list_empty(&l3->slabs_partial)) {
> > > > + /*
> > > > + * Continue to iterate through each slab cache to free
> > > > + * as many nodelists as possible even though the
> > > > + * offline will be canceled.
> > > > + */
> > > > + ret = -EBUSY;
> > > > + continue;
> > > > + }
> > > > + kfree(l3);
> > > > + cachep->nodelists[node] = NULL;
> > >
> > > What's stopping races of other CPUs trying to access l3 and array
> > > caches while they're being freed?
> > >
> >
> > numa_node_id() will not return an offlined nodeid and cache_alloc_node()
> > already does a fallback to other onlined nodes in case a nodeid is passed
> > to kmalloc_node() that does not have a nodelist. l3->shared and l3->alien
> > cannot be accessed without l3->list_lock (drain, cache_alloc_refill,
> > cache_flusharray) or cache_chain_mutex (kmem_cache_destroy, cache_reap).
>
> Yeah, but can't it _have_ a nodelist (ie. before it is set to NULL here)
> while it is being accessed by another CPU and concurrently being freed
> on this one?
>

You're right, we can't free cachep->nodelists[node] for any node that is
being hot-removed to avoid a race in cache_alloc_node(). I thought we had
protection for this under cache_chain_mutex for most dereferences and
could disregard cache_alloc_refill() because numa_node_id() would never
return a node being removed under memory hotplug, that would be the
responsibility of cpu hotplug instead (offline the cpu first, then ensure
numa_node_id() can't return a node under hot-remove).

Thanks for pointing that out, it's definitely broken here.

As an alternative, I think we should do something like this on
MEM_GOING_OFFLINE:

int ret = 0;

mutex_lock(&cache_chain_mutex);
list_for_each_entry(cachep, &cache_chain, next) {
struct kmem_list3 *l3;

l3 = cachep->nodelists[node];
if (!l3)
continue;
drain_freelist(cachep, l3, l3->free_objects);

ret = list_empty(&l3->slabs_full) &&
list_empty(&l3->slabs_partial);
if (ret)
break;
}
mutex_unlock(&cache_chain_mutex);
return ret ? NOTIFY_BAD : NOTIFY_OK;

to preempt hot-remove of a node where there are slabs on the partial or
free list that can't be freed.

Then, for MEM_OFFLINE, we leave cachep->nodelists[node] to be valid in
case there are cache_alloc_node() racers or the node ever comes back
online; susbequent callers to kmalloc_node() for the offlined node would
actually return objects from fallback_alloc() since kmem_getpages() would
fail for a node without present pages.

If slab is allocated after the drain_freelist() above, we'll never
actually get MEM_OFFLINE since all pages can't be isolated for memory
hot-remove, thus, the node will never be offlined. kmem_getpages() can't
allocate isolated pages, so this race must happen after drain_freelist()
and prior to the pageblock being isolated.

So the MEM_GOING_OFFLINE check above is really more of a convenience to
short-circuit the hot-remove if we know we can't free all slab on that
node to avoid all the subsequent work that would happen only to run into
isolation failure later.

We don't need to do anything for MEM_CANCEL_OFFLINE since the only affect
of MEM_GOING_OFFLINE is to drain the freelist.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 2 ==
Date: Sat, Mar 27 2010 7:50 pm
From: David Rientjes

Slab lacks any memory hotplug support for nodes that are hotplugged
without cpus being hotplugged. This is possible at least on x86
CONFIG_MEMORY_HOTPLUG_SPARSE kernels where SRAT entries are marked
ACPI_SRAT_MEM_HOT_PLUGGABLE and the regions of RAM represent a seperate
node. It can also be done manually by writing the start address to
/sys/devices/system/memory/probe for kernels that have
CONFIG_ARCH_MEMORY_PROBE set, which is how this patch was tested, and
then onlining the new memory region.

When a node is hotadded, a nodelist for that node is allocated and
initialized for each slab cache. If this isn't completed due to a lack
of memory, the hotadd is aborted: we have a reasonable expectation that
kmalloc_node(nid) will work for all caches if nid is online and memory is
available.

Since nodelists must be allocated and initialized prior to the new node's
memory actually being online, the struct kmem_list3 is allocated off-node
due to kmalloc_node()'s fallback.

When an entire node would be offlined, its nodelists are subsequently
drained. If slab objects still exist and cannot be freed, the offline is
aborted. It is possible that objects will be allocated between this
drain and page isolation, so it's still possible that the offline will
still fail, however.

Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/slab.c | 157 ++++++++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 125 insertions(+), 32 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -115,6 +115,7 @@
#include <linux/reciprocal_div.h>
#include <linux/debugobjects.h>
#include <linux/kmemcheck.h>
+#include <linux/memory.h>

#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
@@ -1102,6 +1103,52 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
}

twitter

Saturday, March 27, 2010

linux.kernel - 25 new messages in 18 topics - digest

0 Comments:

Post a Comment

About Me

Previous Posts