Sunday, February 21, 2010

linux.kernel - 10 new messages in 7 topics - digest

linux.kernel
http://groups.google.com/group/linux.kernel?hl=en

linux.kernel@googlegroups.com

Today's topics:

* x86, irq: use 0x20 for the IRQ_MOVE_CLEANUP_VECTOR instead of 0x1f - 2
messages, 2 authors
http://groups.google.com/group/linux.kernel/t/6c329bfaac9767ba?hl=en
* x86 embedded - Problem getting past 'move compressed kernel before
decompression' - 2 messages, 1 author
http://groups.google.com/group/linux.kernel/t/720d67a961384dce?hl=en
* net: reserve ports for applications using fixed port numbers - 2 messages, 2
authors
http://groups.google.com/group/linux.kernel/t/c3a56d26e9f97e5e?hl=en
* sysctl: add proc_do_large_bitmap - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/7fe3356887639418?hl=en
* fs/exec.c: restrict initial stack space expansion to rlimit - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/f35189a652748353?hl=en
* module param_call: fix potential NULL pointer dereference - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/5465c2ebc0f01b35?hl=en
* tracing: fix typo in prof_sysexit_enable() - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/81d1398e4bb51767?hl=en

==============================================================================
TOPIC: x86, irq: use 0x20 for the IRQ_MOVE_CLEANUP_VECTOR instead of 0x1f
http://groups.google.com/group/linux.kernel/t/6c329bfaac9767ba?hl=en
==============================================================================

== 1 of 2 ==
Date: Sat, Feb 20 2010 9:30 pm
From: "Maciej W. Rozycki"


Hi,

I have finally managed to get back to it -- sorry for the delay, I'm
running out of my time.

On Mon, 1 Feb 2010, H. Peter Anvin wrote:

> > As we are using the code from 2.6.28 and no one noticed/complained about
> > this issue for more than 1.5 years, probably the pentium APIC issue is
> > not wide-spread.

Correct, the problem only affected B1, B3 and B5 steppings of the P54C
Pentium processor. These are probably extremely rare these days. It was
fixed later on.

But they can be run-time detected -- if we don't support them anymore
(assuming keeping them supported is too much of maintenance hassle; Linux
used to be proud to support hardware nobody else seemed to care of
anymore, so it's really disappointing to see it go), we should panic() on
bootstrap and print an appropriate message. They are CPUID family 5,
model 2 and steppings 1, 2 and 4, respectively.

Also the note in arch/x86/kernel/smp.c should be adjusted accordingly
stating that the erratum is no longer worked around (preferably stating
the last Linux version it was).

> I *think* it's applicable to all CPUs Pentium III or earlier (but not
> Pentium 4 -- I'm unsure about the Pentium M.) I don't know about
> non-Intel CPUs; I have a vague memory of the Transmeta Efficeon (the
> only Transmeta chip with an APIC) *not* having this limitation.
>
> The exact reference is SDM vol 3A 10.8.4, page 10-41 [rev 033US Dec 2009]:
>
> For the P6 family and Pentium processors, the IRR and ISR registers can
> queue no more than two interrupts per priority level, and will reject
> other interrupts that are received within the same priority level.
>
> However, section 10.8.2 bullet 3 on page 10-38 (and the flowchart on
> page 10-37) indicate that such an interrupt is returned to the IOAPIC
> for a later retry, i.e. it's not lost. As such, it's not clear to me
> from reading the SDM that there is actually a problem here...

Here's the text of the relevant erratum:

"4AP. Three Interrupts of the Same Priority Causes Lost Local Interrupt

PROBLEM: If three interrupts of the same priority level (priority is
defined in the 4MSB of the interrupt vector), arrive in the following
circumstance:

1. A interrupt is being serviced by the CPU, and the proper bit is set in
the ISR register.

2. A second interrupt is received from the serial bus.

3. At the same time a third interrupt is received from a local interrupt
source, which could include local pins (LVT), an APIC timer (Timer),
self-interrupt, or an APIC error interrupt.

If the first two conditions are met the third interrupt will be lost, and
not serviced.

IMPLICATION: The third interrupt will be ignored and not serviced if the
specific scenario happens as listed above.

WORKAROUND: The problem can be avoided if different priority levels are
assigned to serial interrupts, than to local interrupts.

STATUS: For the steppings affected see the Summary Table of Changes at the
beginning of this section."

so you can see the retry mechanism is not the problem here (or, to be
exact, the lack of an equivalent for local interrupts seems to be).

I'm not sure how fatal for Linux the implications are though; even then
it looks to me the approach we took was an overkill. It's enough to
guarantee that the APIC error interrupt, the APIC timer interrupt and
self-IPIs (do we use any at all though?) do not share their priority
level(s) with any external interrupt (but they can share the level with
one another). We only use ever LINT0/1 interrupts as NMIs (for the NMI
watchdog and the system error, respectively), or ExtINT (in the case of
LINT0), so this erratum does not apply to them.

So what priority level(s) do we use for the APIC error and timer
interrupts (and self-IPIs, if any) these days and how does it correspond
to the priorities of external interrupts? It looks like we can work
around this erratum indefinitely quite cheaply (and should document it
decently so that newcomers do not break it like it happened with many bits
in our APIC code many times already; yes, lost hope, I know...).

Maciej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 2 of 2 ==
Date: Sat, Feb 20 2010 9:40 pm
From: "H. Peter Anvin"


On 02/20/2010 09:20 PM, Maciej W. Rozycki wrote:
>
> Correct, the problem only affected B1, B3 and B5 steppings of the P54C
> Pentium processor. These are probably extremely rare these days. It was
> fixed later on.
>
> But they can be run-time detected -- if we don't support them anymore
> (assuming keeping them supported is too much of maintenance hassle; Linux
> used to be proud to support hardware nobody else seemed to care of
> anymore, so it's really disappointing to see it go), we should panic() on
> bootstrap and print an appropriate message. They are CPUID family 5,
> model 2 and steppings 1, 2 and 4, respectively.
>
> Also the note in arch/x86/kernel/smp.c should be adjusted accordingly
> stating that the erratum is no longer worked around (preferably stating
> the last Linux version it was).
>

My philosophy is generally that I'm happy to keep old hardware (that
actually exist in any kind of meaningful quantity) alive, but I'm not
willing to go through herculean efforts nor willing to make widely
available modern hardware suck over it.

It looks like this really isn't too hard to deal with, though.

> I'm not sure how fatal for Linux the implications are though; even then
> it looks to me the approach we took was an overkill. It's enough to
> guarantee that the APIC error interrupt, the APIC timer interrupt and
> self-IPIs (do we use any at all though?) do not share their priority
> level(s) with any external interrupt (but they can share the level with
> one another). We only use ever LINT0/1 interrupts as NMIs (for the NMI
> watchdog and the system error, respectively), or ExtINT (in the case of
> LINT0), so this erratum does not apply to them.

The APIC error is on vector 0xfe, the APIC timer is on vector 0xef, and
self IPI (vector 0xeb) we only use for MCA, which wouldn't be supported
on these processors.

However, these are mixed with externally-generated IPIs which will be
seen as serial interrupts: in particular 0xf0-0xfd are all different IPI
which share with the error vector 0xde, and 0xeb shares with 0xed is
used for "platform" IPIs.

It sounds like the right solution for supporting these processors would
be to reshuffle the special vectors so that we use one level (presumably
0xfx) for locally generated interrupts and one (presumably 0xex) for
external IPIs, and make sure that it is not possible for external
interrupts to get assigned to the local-only level. The assignment of
external interrupts, which seems to be where focus has been in the past,
is actually irrelevant (but might still be good for performance, by
maximizing the number of interrupts that can be held in the LAPIC and
not bounced.)

Either way, this doesn't exactly sound too bad. A bigger question is if
we want to do this globally or end up having different vector
assignments for these oddball CPUs. Testing it, too, will be almost
impossible...

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: x86 embedded - Problem getting past 'move compressed kernel before
decompression'
http://groups.google.com/group/linux.kernel/t/720d67a961384dce?hl=en
==============================================================================

== 1 of 2 ==
Date: Sat, Feb 20 2010 9:50 pm
From: "H. Peter Anvin"


On 02/20/2010 06:03 PM, Graeme Russ wrote:
>
> The following is something I have hacked together to jump into the 32-bit
> start address of the Linux Kernel:
>
> struct boot_params boot_params __attribute__((aligned(16)));
> struct setup_header *hdr = (struct setup_header *)(0x90000 + 0x1f1);
>
> void boot_zimage(void *setup_base)
> {
> memset(&boot_params, 0x00, sizeof boot_params);
> memcpy(&boot_params.hdr, hdr, sizeof (*hdr));
>
> boot_params.alt_mem_k = 128 * 1024;
> boot_params.e820_entries = 1;
> boot_params.e820_map[0].addr = 0x00000000;
> boot_params.e820_map[0].size = 128 * 1024;
> boot_params.e820_map[0].type = 1;
>
> asm( "movw $0x18, %%cx\n" \
> "movl %%ecx, %%ds\n" \
> "movl %%ecx, %%es\n" \
> "movl %%ecx, %%fs\n" \
> "movl %%ecx, %%gs\n" \
> "movl %%ecx, %%ss\n" \
> "xorl %%ebp, %%ebp\n" \
> "xorl %%edi, %%edi\n" \
> "xorl %%ebx, %%ebx\n" \
> "movl %0, %%esi\n"
^^
> "movl $0x100000, %%eax\n" \
> "jmpl *%%eax" : : "r"(&boot_params));
^

At this point you have probably clobbered the register that you have
your boot_params in.

Instead, do something like:

asm volatile(
"movl %0, %%ds\n" \
"movl %0, %%es\n" \
"movl %0, %%fs\n" \
"movl %0, %%gs\n" \
"movl %0, %%ss\n" \
"xorl %ebp, %ebp\n" \
"xorl %ebx, %ebx\n" \
"movl $0x100000, %%eax\n" \
"ljmpl $0x10,$0x100000"
: : "S" (&boot_params), "D" (0), "c" (0x18));

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 2 of 2 ==
Date: Sat, Feb 20 2010 10:00 pm
From: "H. Peter Anvin"


On 02/20/2010 09:45 PM, H. Peter Anvin wrote:
>
> Instead, do something like:
>
> asm volatile(
> "movl %0, %%ds\n" \
> "movl %0, %%es\n" \
> "movl %0, %%fs\n" \
> "movl %0, %%gs\n" \
> "movl %0, %%ss\n" \
> "xorl %ebp, %ebp\n" \
> "xorl %ebx, %ebx\n" \
> "movl $0x100000, %%eax\n" \
> "ljmpl $0x10,$0x100000"
> : : "S" (&boot_params), "D" (0), "c" (0x18));
>
> -hpa
>

Make that:

asm volatile(
"movl %2, %%ds\n" \
"movl %2, %%es\n" \
"movl %2, %%fs\n" \
"movl %2, %%gs\n" \
"movl %2, %%ss\n" \
"xorl %ebp, %ebp\n" \
"xorl %ebx, %ebx\n" \
"ljmpl $0x10,$0x100000"
: : "S" (&boot_params), "D" (0), "r" (0x18));

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: net: reserve ports for applications using fixed port numbers
http://groups.google.com/group/linux.kernel/t/c3a56d26e9f97e5e?hl=en
==============================================================================

== 1 of 2 ==
Date: Sat, Feb 20 2010 10:30 pm
From: Bill Fink


On Sat, 20 Feb 2010, Octavian Purdila wrote:

> On Saturday 20 February 2010 10:11:40 you wrote:
> > Octavian Purdila wrote:
> > > This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
> > > allows users to reserve ports for third-party applications.
> > >
> > > The reserved ports will not be used by automatic port assignments
> > > (e.g. when calling connect() or bind() with port number 0). Explicit
> > > port allocation behavior is unchanged.
> > >
> > > Changes from the previous version:
> > > - switch the /proc entry format to coma separated list of range ports
> > > - treat -EFAULT just like any other error and acknowledge written values
> > > - use isdigit() in proc_get_ulong
> > >
> > > Octavian Purdila (3):
> > > sysctl: refactor integer handling proc code
> > > sysctl: add proc_do_large_bitmap
> > > net: reserve ports for applications using fixed port numbers
> >
> > Hi,
> >
> > This version looks fine for me, but I need to give them a test, and
> > I will put feedbacks asap. Thanks for your work!
> >
> > Still two things:
> >
> > 1) bitops are always atomic on every arch, right? If yes, then ok.
>
> AFAIK, yes.
>
> > 2) I hope you could add some documentation to show the relations
> > between ip_local_port_range and ip_local_reserved_ports.
> >
>
> How does this sound:
>
> ip_local_reserved_ports - list of comma separated ranges
> Specify the ports which are reserved for known third-party
> applications. These ports will not be used by automatic port
> assignments (e.g. when calling connect() or bind() with port
> number 0). Explicit port allocation behavior is unchanged.
>
> The format used for both input and output is a comma separated
> list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and
> 10). Writing to the file will clear all previously reserved
> ports and update the current list with the one given in the
> input.
>
> Note that ip_local_port_range and ip_local_port_range settings

Change second ip_local_port_range to ip_local_reserved_ports.

-Bill

> are independent and both are considered by the kernel when
> determining which ports are available for automatic port
> assignments.
>
> You can reserve ports which are not in the current
> ip_local_port_range, e.g.:
>
> $ cat /proc/sys/net/ipv4/ip_local_port_range
> 32000 61000
> $ cat /proc/sys/net/ipv4/ip_local_reserved_ports
> 8080,9148
>
> although this is redundant. However such a setting is useful
> if later the port range is changed to a value that will
> include the reserved ports.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


== 2 of 2 ==
Date: Sat, Feb 20 2010 10:40 pm
From: Cong Wang


Octavian Purdila wrote:
> This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
> allows users to reserve ports for third-party applications.
>
> The reserved ports will not be used by automatic port assignments
> (e.g. when calling connect() or bind() with port number 0). Explicit
> port allocation behavior is unchanged.
>
> Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
> Signed-off-by: WANG Cong <amwang@redhat.com>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Eric W. Biederman <ebiederm@xmission.com>


My test case shows this works as expect, I mean reserving local ports.
So, for this one,

Acked-by: WANG Cong <amwang@redhat.com>

Thanks for your work!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: sysctl: add proc_do_large_bitmap
http://groups.google.com/group/linux.kernel/t/7fe3356887639418?hl=en
==============================================================================

== 1 of 1 ==
Date: Sat, Feb 20 2010 10:40 pm
From: Cong Wang


Octavian Purdila wrote:
> The new function can be used to read/write large bitmaps via /proc. A
> comma separated range format is used for compact output and input
> (e.g. 1,3-4,10-10).
>
> Writing into the file will first reset the bitmap then update it
> based on the given input.
>
> Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
> Cc: WANG Cong <amwang@redhat.com>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> ---
> include/linux/sysctl.h | 2 +
> kernel/sysctl.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 124 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index f66014c..7bb5cb6 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -980,6 +980,8 @@ extern int proc_doulongvec_minmax(struct ctl_table *, int,
> void __user *, size_t *, loff_t *);
> extern int proc_doulongvec_ms_jiffies_minmax(struct ctl_table *table, int,
> void __user *, size_t *, loff_t *);
> +extern int proc_do_large_bitmap(struct ctl_table *, int,
> + void __user *, size_t *, loff_t *);
>
> /*
> * Register a set of sysctl names by calling register_sysctl_table
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 5259727..ef2c13d 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2635,6 +2635,128 @@ static int proc_do_cad_pid(struct ctl_table *table, int write,
> return 0;
> }
>
> +/**
> + * proc_do_large_bitmap - read/write from/to a large bitmap
> + * @table: the sysctl table
> + * @write: %TRUE if this is a write to the sysctl file
> + * @buffer: the user buffer
> + * @lenp: the size of the user buffer
> + * @ppos: file position
> + *
> + * The bitmap is stored at table->data and the bitmap length (in bits)
> + * in table->maxlen.
> + *
> + * We use a range comma separated format (e.g. 1,3-4,10-10) so that
> + * large bitmaps may be represented in a compact manner. Writing into
> + * the file will clear the bitmap then update it with the given input.

My test shows it still accepts spaces, e.g.

echo '50000 50003 50005' > ip_local_reserved_ports

works same as

echo '50000,50003,50005' > ip_local_reserved_ports

Is this expected? We will only accept commas, right?


Also, if I write an invalid value, it does reject this, but the previous
value in that file is cleared, shouldn't we keep the previous one?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: fs/exec.c: restrict initial stack space expansion to rlimit
http://groups.google.com/group/linux.kernel/t/f35189a652748353?hl=en
==============================================================================

== 1 of 1 ==
Date: Sat, Feb 20 2010 10:50 pm
From: Michael Neuling


In message <20100219163238.671588178@kvm.kroah.org> you wrote:
> 2.6.32-stable review patch. If anyone has any objections, please let us know
.
>
> ------------------
>
> From: Michael Neuling <mikey@neuling.org>
>
> commit 803bf5ec259941936262d10ecc84511b76a20921 upstream.
>
> When reserving stack space for a new process, make sure we're not
> attempting to expand the stack by more than rlimit allows.

This breaks UML, so you also need to take this also:

http://patchwork.kernel.org/patch/79365/

It's in akpm's tree only so far.

Mikey


>
> This fixes a bug caused by b6a2fea39318e43fee84fa7b0b90d68bed92d2ba ("mm:
> variable length argument support") and unmasked by
> fc63cf237078c86214abcb2ee9926d8ad289da9b ("exec: setup_arg_pages() fails
> to return errors").
>
> This bug means that when limiting the stack to less the 20*PAGE_SIZE (eg.
> 80K on 4K pages or 'ulimit -s 79') all processes will be killed before
> they start. This is particularly bad with 64K pages, where a ulimit below
> 1280K will kill every process.
>
> To test, do:
>
> 'ulimit -s 15; ls'
>
> before and after the patch is applied. Before it's applied, 'ls' should
> be killed. After the patch is applied, 'ls' should no longer be killed.
>
> A stack limit of 15KB since it's small enough to trigger 20*PAGE_SIZE.
> Also 15KB not a multiple of PAGE_SIZE, which is a trickier case to handle
> correctly with this code.
>
> 4K pages should be fine to test with.
>
> [kosaki.motohiro@jp.fujitsu.com: cleanup]
> [akpm@linux-foundation.org: cleanup cleanup]
> Signed-off-by: Michael Neuling <mikey@neuling.org>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Americo Wang <xiyou.wangcong@gmail.com>
> Cc: Anton Blanchard <anton@samba.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: James Morris <jmorris@namei.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Serge Hallyn <serue@us.ibm.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
>
> ---
> fs/exec.c | 21 +++++++++++++++++++--
> 1 file changed, 19 insertions(+), 2 deletions(-)
>
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -572,6 +572,9 @@ int setup_arg_pages(struct linux_binprm
> struct vm_area_struct *prev = NULL;
> unsigned long vm_flags;
> unsigned long stack_base;
> + unsigned long stack_size;
> + unsigned long stack_expand;
> + unsigned long rlim_stack;
>
> #ifdef CONFIG_STACK_GROWSUP
> /* Limit stack size to 1GB */
> @@ -628,10 +631,24 @@ int setup_arg_pages(struct linux_binprm
> goto out_unlock;
> }
>
> + stack_expand = EXTRA_STACK_VM_PAGES * PAGE_SIZE;
> + stack_size = vma->vm_end - vma->vm_start;
> + /*
> + * Align this down to a page boundary as expand_stack
> + * will align it up.
> + */
> + rlim_stack = rlimit(RLIMIT_STACK) & PAGE_MASK;
> + rlim_stack = min(rlim_stack, stack_size);
> #ifdef CONFIG_STACK_GROWSUP
> - stack_base = vma->vm_end + EXTRA_STACK_VM_PAGES * PAGE_SIZE;
> + if (stack_size + stack_expand > rlim_stack)
> + stack_base = vma->vm_start + rlim_stack;
> + else
> + stack_base = vma->vm_end + stack_expand;
> #else
> - stack_base = vma->vm_start - EXTRA_STACK_VM_PAGES * PAGE_SIZE;
> + if (stack_size + stack_expand > rlim_stack)
> + stack_base = vma->vm_end - rlim_stack;
> + else
> + stack_base = vma->vm_start - stack_expand;
>

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home


Real Estate