twitter: linux.kernel - 26 new messages in 15 topics

linux.kernel
http://groups.google.com/group/linux.kernel?hl=en

Today's topics:

* tracing/kprobes: Make Kconfig dependencies generic - 3 messages, 3 authors
http://groups.google.com/group/linux.kernel/t/f25a85a28b5e0740?hl=en
* x86-32: use SSE for atomic64_read/set if available - 2 messages, 1 author
http://groups.google.com/group/linux.kernel/t/c7fe1bc8eb70e0f9?hl=en
* net: TCP thin-stream detection - 4 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/45be4449975ad2af?hl=en
* net: TCP thin-stream latency-improving modifications - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/fc7b6fca6e8d0f83?hl=en
* s2disk hang update - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/69e5c9798a1fe4e7?hl=en
* powerpc: implement arch_scale_smt_power for Power7 - 2 messages, 1 author
http://groups.google.com/group/linux.kernel/t/891f3a14ac88e3fb?hl=en
* 33-rc8 Running aplay with pulse as the default - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/0905202c42c4b8f3?hl=en
* kernel/xserver-xorg: X crash while either idle or busy - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/22f68f2333578c8b?hl=en
* PROBLEM: oops w/ bridge in 2.6.32.7 - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/64824c5fceda670a?hl=en
* perf record and multiple events - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/ee0cb0c44ccbb5e0?hl=en
* cpuset,mm: update tasks' mems_allowed in time (58568d2) - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/c97c21f117bf365d?hl=en
* KVM: SVM: Don't use kmap_atomic in nested_svm_map - 4 messages, 1 author
http://groups.google.com/group/linux.kernel/t/b608877f8fa7c926?hl=en
* Stupid futex question - 2.6.33-rc7-mmotm0210 - 2 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/3418d4e896d1113f?hl=en
* Linux mdadm superblock question. - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/f58e89a4f371364a?hl=en
* Call +234 802 972 9104 - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/6432b553efc79d0c?hl=en

==============================================================================
TOPIC: tracing/kprobes: Make Kconfig dependencies generic
http://groups.google.com/group/linux.kernel/t/f25a85a28b5e0740?hl=en
==============================================================================

== 1 of 3 ==
Date: Thurs, Feb 18 2010 4:40 am
From: Frederic Weisbecker

On Thu, Feb 18, 2010 at 07:12:08AM -0500, Mike Frysinger wrote:
> On Thu, Feb 18, 2010 at 07:09, Heiko Carstens wrote:
> > On Thu, Feb 18, 2010 at 06:18:20AM -0500, Mike Frysinger wrote:
> >> On Thu, Feb 18, 2010 at 06:13, Frederic Weisbecker wrote:
> >> > --- a/arch/Kconfig
> >> > +++ b/arch/Kconfig
> >> > @@ -121,6 +121,9 @@ config HAVE_DMA_ATTRS
> >> > config USE_GENERIC_SMP_HELPERS
> >> > bool
> >> >
> >> > +config HAVE_REGS_AND_STACK_ACCESS_API
> >> > + bool
> >> > +
> >>
> >> could you add an appropriate help/comment so arch peeps know what
> >> needs to be implemented before they can select this
> >
> > That's why I added the commit ID for the regs and stack access api
> > to the changelog. imho that should be sufficient.
> > Besides that the next commit would implement it for s390 as a
> > blueprint for others. That is... for those that missed the initial
> > x86 implementation.
>
> people shouldnt have to dive into the changelog to try and divine
> documentation. it's hardly standard, so people fall on it in a
> pima-last-resort kind of way. being explicit in the file up front by
> writing real documentation says other people a lot more time.
> -mike

Yeah, would be nice to have a comment above the config definition
to explain what it implies.

Heiko, mind sending a delta patch for that?

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 3 ==
Date: Thurs, Feb 18 2010 5:30 am
From: Heiko Carstens

On Thu, Feb 18, 2010 at 01:34:57PM +0100, Frederic Weisbecker wrote:
> On Thu, Feb 18, 2010 at 07:12:08AM -0500, Mike Frysinger wrote:
> > On Thu, Feb 18, 2010 at 07:09, Heiko Carstens wrote:
> > > On Thu, Feb 18, 2010 at 06:18:20AM -0500, Mike Frysinger wrote:
> > >> On Thu, Feb 18, 2010 at 06:13, Frederic Weisbecker wrote:
> > >> > --- a/arch/Kconfig
> > >> > +++ b/arch/Kconfig
> > >> > @@ -121,6 +121,9 @@ config HAVE_DMA_ATTRS
> > >> > config USE_GENERIC_SMP_HELPERS
> > >> > bool
> > >> >
> > >> > +config HAVE_REGS_AND_STACK_ACCESS_API
> > >> > + bool
> > >> > +
> > >>
> > >> could you add an appropriate help/comment so arch peeps know what
> > >> needs to be implemented before they can select this
> > >
> > > That's why I added the commit ID for the regs and stack access api
> > > to the changelog. imho that should be sufficient.
> > > Besides that the next commit would implement it for s390 as a
> > > blueprint for others. That is... for those that missed the initial
> > > x86 implementation.
> >
> > people shouldnt have to dive into the changelog to try and divine
> > documentation. it's hardly standard, so people fall on it in a
> > pima-last-resort kind of way. being explicit in the file up front by
> > writing real documentation says other people a lot more time.
> > -mike
>
>
> Yeah, would be nice to have a comment above the config definition
> to explain what it implies.
>
> Heiko, mind sending a delta patch for that?

Subject: [PATCH] tracing/kprobes: add short documentation for HAVE_REGS_AND_STACK_ACCESS_API

From: Heiko Carstens <heiko.carstens@de.ibm.com>

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
arch/Kconfig | 4 ++++
1 file changed, 4 insertions(+)

--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -123,6 +123,10 @@ config USE_GENERIC_SMP_HELPERS

config HAVE_REGS_AND_STACK_ACCESS_API
bool
+ help
+ This symbol should be selected by an architecure if it supports
+ the API needed to access registers and stack entries from pt_regs.
+ For example the kprobes-based event tracer needs this API.

config HAVE_CLK
bool
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 3 of 3 ==
Date: Thurs, Feb 18 2010 6:10 am
From: Mike Frysinger

On Thu, Feb 18, 2010 at 08:25, Heiko Carstens wrote:
> On Thu, Feb 18, 2010 at 01:34:57PM +0100, Frederic Weisbecker wrote:
>> On Thu, Feb 18, 2010 at 07:12:08AM -0500, Mike Frysinger wrote:
>> > On Thu, Feb 18, 2010 at 07:09, Heiko Carstens wrote:
>> > > On Thu, Feb 18, 2010 at 06:18:20AM -0500, Mike Frysinger wrote:
>> > >> On Thu, Feb 18, 2010 at 06:13, Frederic Weisbecker wrote:
>> > >> > --- a/arch/Kconfig
>> > >> > +++ b/arch/Kconfig
>> > >> > @@ -121,6 +121,9 @@ config HAVE_DMA_ATTRS
>> > >> > config USE_GENERIC_SMP_HELPERS
>> > >> > bool
>> > >> >
>> > >> > +config HAVE_REGS_AND_STACK_ACCESS_API
>> > >> > + bool
>> > >> > +
>> > >>
>> > >> could you add an appropriate help/comment so arch peeps know what
>> > >> needs to be implemented before they can select this
>> > >
>> > > That's why I added the commit ID for the regs and stack access api
>> > > to the changelog. imho that should be sufficient.
>> > > Besides that the next commit would implement it for s390 as a
>> > > blueprint for others. That is... for those that missed the initial
>> > > x86 implementation.
>> >
>> > people shouldnt have to dive into the changelog to try and divine
>> > documentation. it's hardly standard, so people fall on it in a
>> > pima-last-resort kind of way. being explicit in the file up front by
>> > writing real documentation says other people a lot more time.
>> > -mike
>>
>>
>> Yeah, would be nice to have a comment above the config definition
>> to explain what it implies.
>>
>> Heiko, mind sending a delta patch for that?
>
> Subject: [PATCH] tracing/kprobes: add short documentation for HAVE_REGS_AND_STACK_ACCESS_API
>
> From: Heiko Carstens <heiko.carstens@de.ibm.com>
>
> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> ---
> arch/Kconfig | 4 ++++
> 1 file changed, 4 insertions(+)
>
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -123,6 +123,10 @@ config USE_GENERIC_SMP_HELPERS
>
> config HAVE_REGS_AND_STACK_ACCESS_API
> bool
> + help
> + This symbol should be selected by an architecure if it supports
> + the API needed to access registers and stack entries from pt_regs.
> + For example the kprobes-based event tracer needs this API.

a bit vague ... arent there headers/functions people could look at ?
perhaps you're talking about the regset functions (which is an API to
access registers in pt_regs) ? or you're talking about asm/syscall.h
(which is an API to access registers in pt_regs) ?

i'm not asking to be a pain, i'm asking because i really havent a
clue. if i wanted to add support for this stuff to the Blackfin arch,
i wouldnt know where to start. even after reading this help i'd fall
back to grepping arch/x86/ and trying to divine a starting point from
there.
-mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: x86-32: use SSE for atomic64_read/set if available
http://groups.google.com/group/linux.kernel/t/c7fe1bc8eb70e0f9?hl=en
==============================================================================

== 1 of 2 ==
Date: Thurs, Feb 18 2010 4:40 am
From: Luca Barbieri

>> If we use the stack instead of per-CPU variables, all IRQs and NMIs
>> preserve CR0 and the SSE registers, so this would be safe, right?
>
> You'd have to take special care to deal with nested IRQs I think.

Could you elaborate on that?
Which issue could there be with nested IRQs?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 2 ==
Date: Thurs, Feb 18 2010 5:50 am
From: Luca Barbieri

> Depends on where on the stack you're going to save things, I through
> you'd take space in the thread_info struct, but I guess if you're simply
> going to push the reg onto the stack it should be fine.

Yes, this seems the best solution.
With frame pointers enabled, it's just a single andl $-8, %esp to
align the stack (otherwise, frame pointers are forced by gcc).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: net: TCP thin-stream detection
http://groups.google.com/group/linux.kernel/t/45be4449975ad2af?hl=en
==============================================================================

== 1 of 4 ==
Date: Thurs, Feb 18 2010 4:50 am
From: Andreas Petlund

Inline function to dynamically detect thin streams based on
the number of packets in flight. Used to dynamically trigger
thin-stream mechanisms if enabled by ioctl or sysctl.

Signed-off-by: Andreas Petlund <apetlund@simula.no>
---
Documentation/networking/tcp-thin.txt | 47 +++++++++++++++++++++++++++++++++
include/net/tcp.h | 8 +++++
2 files changed, 55 insertions(+), 0 deletions(-)
create mode 100644 Documentation/networking/tcp-thin.txt

diff --git a/Documentation/networking/tcp-thin.txt b/Documentation/networking/tcp-thin.txt
new file mode 100644
index 0000000..151e229
--- /dev/null
+++ b/Documentation/networking/tcp-thin.txt
@@ -0,0 +1,47 @@
+Thin-streams and TCP
+====================
+A wide range of Internet-based services that use reliable transport
+protocols display what we call thin-stream properties. This means
+that the application sends data with such a low rate that the
+retransmission mechanisms of the transport protocol are not fully
+effective. In time-dependent scenarios (like online games, control
+systems, stock trading etc.) where the user experience depends
+on the data delivery latency, packet loss can be devastating for
+the service quality. Extreme latencies are caused by TCP's
+dependency on the arrival of new data from the application to trigger
+retransmissions effectively through fast retransmit instead of
+waiting for long timeouts.
+
+After analysing a large number of time-dependent interactive
+applications, we have seen that they often produce thin streams
+and also stay with this traffic pattern throughout its entire
+lifespan. The combination of time-dependency and the fact that the
+streams provoke high latencies when using TCP is unfortunate.
+
+In order to reduce application-layer latency when packets are lost,
+a set of mechanisms has been made, which address these latency issues
+for thin streams. In short, if the kernel detects a thin stream,
+the retransmission mechanisms are modified in the following manner:
+
+1) If the stream is thin, fast retransmit on the first dupACK.
+2) If the stream is thin, do not apply exponential backoff.
+
+These enhancements are applied only if the stream is detected as
+thin. This is accomplished by defining a threshold for the number
+of packets in flight. If there are less than 4 packets in flight,
+fast retransmissions can not be triggered, and the stream is prone
+to experience high retransmission latencies.
+
+Since these mechanisms are targeted at time-dependent applications,
+they must be specifically activated by the application using the
+TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK IOCTLS or the
+tcp_thin_linear_timeouts and tcp_thin_dupack sysctls. Both
+modifications are turned off by default.
+
+References
+==========
+More information on the modifications, as well as a wide range of
+experimental data can be found here:
+"Improving latency for interactive, thin-stream applications over
+reliable transport"
+http://simula.no/research/nd/publications/Simula.nd.477/simula_pdf_file
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 75a00c8..0bdc3f6 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1386,6 +1386,14 @@ static inline void tcp_highest_sack_combine(struct sock *sk,
tcp_sk(sk)->highest_sack = new;
}

+/* Determines whether this is a thin stream (which may suffer from
+ * increased latency). Used to trigger latency-reducing mechanisms.
+ */
+static inline unsigned int tcp_stream_is_thin(struct tcp_sock *tp)
+{
+ return tp->packets_out < 4 && !tcp_in_initial_slowstart(tp);
+}
+
/* /proc */
enum tcp_seq_states {
TCP_SEQ_STATE_LISTENING,
--
1.6.3.3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 4 ==
Date: Thurs, Feb 18 2010 4:50 am
From: Andreas Petlund

This patch enables fast retransmissions after one dupACK for
TCP if the stream is identified as thin. This will reduce
latencies for thin streams that are not able to trigger fast
retransmissions due to high packet interarrival time. This
mechanism is only active if enabled by iocontrol or syscontrol
and the stream is identified as thin.

Signed-off-by: Andreas Petlund <apetlund@simula.no>
---
Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
include/linux/tcp.h | 4 +++-
include/net/tcp.h | 1 +
net/ipv4/sysctl_net_ipv4.c | 7 +++++++
net/ipv4/tcp.c | 7 +++++++
net/ipv4/tcp_input.c | 12 ++++++++++++
6 files changed, 42 insertions(+), 1 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index f147310..2571a62 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -499,6 +499,18 @@ tcp_thin_linear_timeouts - BOOLEAN
Documentation/networking/tcp-thin.txt
Default: 0

+tcp_thin_dupack - BOOLEAN
+ Enable dynamic triggering of retransmissions after one dupACK
+ for thin streams. If set, a check is performed upon reception
+ of a dupACK to determine if the stream is thin (less than 4
+ packets in flight). As long as the stream is found to be thin,
+ data is retransmitted on the first received dupACK. This
+ improves retransmission latency for non-aggressive thin
+ streams, often found to be time-dependent.
+ For more information on thin streams, see
+ Documentation/networking/tcp-thin.txt
+ Default: 0
+
UDP variables:

udp_mem - vector of 3 INTEGERs: min, pressure, max
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 3ba8b07..a778ee0 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -104,6 +104,7 @@ enum {
#define TCP_MD5SIG 14 /* TCP MD5 Signature (RFC2385) */
#define TCP_COOKIE_TRANSACTIONS 15 /* TCP Cookie Transactions */
#define TCP_THIN_LINEAR_TIMEOUTS 16 /* Use linear timeouts for thin streams*/
+#define TCP_THIN_DUPACK 17 /* Fast retrans. after 1 dupack */

/* for TCP_INFO socket option */
#define TCPI_OPT_TIMESTAMPS 1
@@ -343,7 +344,8 @@ struct tcp_sock {
u8 frto_counter; /* Number of new acks after RTO */
u8 nonagle : 4,/* Disable Nagle algorithm? */
thin_lto : 1,/* Use linear timeouts for thin streams */
- unused : 3;
+ thin_dupack : 1,/* Fast retransmit on first dupack */
+ unused : 2;

/* RTT measurement */
u32 srtt; /* smoothed round trip time << 3 */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6278fc7..56f0aec 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -245,6 +245,7 @@ extern int sysctl_tcp_slow_start_after_idle;
extern int sysctl_tcp_max_ssthresh;
extern int sysctl_tcp_cookie_size;
extern int sysctl_tcp_thin_linear_timeouts;
+extern int sysctl_tcp_thin_dupack;

extern atomic_t tcp_memory_allocated;
extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index e6a2460..c1bc074 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -582,6 +582,13 @@ static struct ctl_table ipv4_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec
},
+ {
+ .procname = "tcp_thin_dupack",
+ .data = &sysctl_tcp_thin_dupack,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
{
.procname = "udp_mem",
.data = &sysctl_udp_mem,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 21bae9a..5901010 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2236,6 +2236,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
tp->thin_lto = val;
break;

+ case TCP_THIN_DUPACK:
+ if (val < 0 || val > 1)
+ err = -EINVAL;
+ else
+ tp->thin_dupack = val;
+ break;
+
case TCP_CORK:
/* When set indicates to always queue non-full frames.
* Later the user clears this option and we transmit
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3fddc69..8d950b9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -89,6 +89,8 @@ int sysctl_tcp_frto __read_mostly = 2;
int sysctl_tcp_frto_response __read_mostly;
int sysctl_tcp_nometrics_save __read_mostly;

+int sysctl_tcp_thin_dupack __read_mostly;
+
int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
int sysctl_tcp_abc __read_mostly;

@@ -2447,6 +2449,16 @@ static int tcp_time_to_recover(struct sock *sk)
return 1;
}

+ /* If a thin stream is detected, retransmit after first
+ * received dupack. Employ only if SACK is supported in order
+ * to avoid possible corner-case series of spurious retransmissions
+ * Use only if there are no unsent data.
+ */
+ if ((tp->thin_dupack || sysctl_tcp_thin_dupack) &&
+ tcp_stream_is_thin(tp) && tcp_dupack_heuristics(tp) > 1 &&
+ tcp_is_sack(tp) && sk->sk_send_head == NULL)
+ return 1;
+
return 0;
}

--
1.6.3.3

== 3 of 4 ==
Date: Thurs, Feb 18 2010 4:50 am
From: Andreas Petlund

This patch will make TCP use only linear timeouts if the
stream is thin. This will help to avoid the very high latencies
that thin stream suffer because of exponential backoff. This
mechanism is only active if enabled by iocontrol or syscontrol
and the stream is identified as thin. A maximum of 6 linear
timeouts is tried before exponential backoff is resumed.

Signed-off-by: Andreas Petlund <apetlund@simula.no>
---
Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
include/linux/tcp.h | 5 ++++-
include/net/tcp.h | 4 ++++
net/ipv4/sysctl_net_ipv4.c | 7 +++++++
net/ipv4/tcp.c | 7 +++++++
net/ipv4/tcp_timer.c | 21 ++++++++++++++++++++-
6 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 2dc7a1d..f147310 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -487,6 +487,18 @@ tcp_dma_copybreak - INTEGER
and CONFIG_NET_DMA is enabled.
Default: 4096

+tcp_thin_linear_timeouts - BOOLEAN
+ Enable dynamic triggering of linear timeouts for thin streams.
+ If set, a check is performed upon retransmission by timeout to
+ determine if the stream is thin (less than 4 packets in flight).
+ As long as the stream is found to be thin, up to 6 linear
+ timeouts may be performed before exponential backoff mode is
+ initiated. This improves retransmission latency for
+ non-aggressive thin streams, often found to be time-dependent.
+ For more information on thin streams, see
+ Documentation/networking/tcp-thin.txt
+ Default: 0
+
UDP variables:

udp_mem - vector of 3 INTEGERs: min, pressure, max
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7fee8a4..3ba8b07 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -103,6 +103,7 @@ enum {
#define TCP_CONGESTION 13 /* Congestion control algorithm */
#define TCP_MD5SIG 14 /* TCP MD5 Signature (RFC2385) */
#define TCP_COOKIE_TRANSACTIONS 15 /* TCP Cookie Transactions */
+#define TCP_THIN_LINEAR_TIMEOUTS 16 /* Use linear timeouts for thin streams*/

/* for TCP_INFO socket option */
#define TCPI_OPT_TIMESTAMPS 1
@@ -340,7 +341,9 @@ struct tcp_sock {
u32 frto_highmark; /* snd_nxt when RTO occurred */
u16 advmss; /* Advertised MSS */
u8 frto_counter; /* Number of new acks after RTO */
- u8 nonagle; /* Disable Nagle algorithm? */
+ u8 nonagle : 4,/* Disable Nagle algorithm? */
+ thin_lto : 1,/* Use linear timeouts for thin streams */
+ unused : 3;

/* RTT measurement */
u32 srtt; /* smoothed round trip time << 3 */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0bdc3f6..6278fc7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -196,6 +196,9 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
#define TCP_NAGLE_CORK 2 /* Socket is corked */
#define TCP_NAGLE_PUSH 4 /* Cork is overridden for already queued data */

+/* TCP thin-stream limits */
+#define TCP_THIN_LINEAR_RETRIES 6 /* After 6 linear retries, do exp. backoff */
+
extern struct inet_timewait_death_row tcp_death_row;

/* sysctl variables for tcp */
@@ -241,6 +244,7 @@ extern int sysctl_tcp_workaround_signed_windows;
extern int sysctl_tcp_slow_start_after_idle;
extern int sysctl_tcp_max_ssthresh;
extern int sysctl_tcp_cookie_size;
+extern int sysctl_tcp_thin_linear_timeouts;

extern atomic_t tcp_memory_allocated;
extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 7e3712c..e6a2460 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -576,6 +576,13 @@ static struct ctl_table ipv4_table[] = {
.proc_handler = proc_dointvec
},
{
+ .procname = "tcp_thin_linear_timeouts",
+ .data = &sysctl_tcp_thin_linear_timeouts,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
+ {
.procname = "udp_mem",
.data = &sysctl_udp_mem,
.maxlen = sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e471d03..21bae9a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2229,6 +2229,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
}
break;

+ case TCP_THIN_LINEAR_TIMEOUTS:
+ if (val < 0 || val > 1)
+ err = -EINVAL;
+ else
+ tp->thin_lto = val;
+ break;
+
case TCP_CORK:
/* When set indicates to always queue non-full frames.
* Later the user clears this option and we transmit
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index de7d1bf..a17629b 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -29,6 +29,7 @@ int sysctl_tcp_keepalive_intvl __read_mostly = TCP_KEEPALIVE_INTVL;
int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
int sysctl_tcp_orphan_retries __read_mostly;
+int sysctl_tcp_thin_linear_timeouts __read_mostly;

static void tcp_write_timer(unsigned long);
static void tcp_delack_timer(unsigned long);
@@ -415,7 +416,25 @@ void tcp_retransmit_timer(struct sock *sk)
icsk->icsk_retransmits++;

out_reset_timer:
- icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
+ /* If stream is thin, use linear timeouts. Since 'icsk_backoff' is
+ * used to reset timer, set to 0. Recalculate 'icsk_rto' as this
+ * might be increased if the stream oscillates between thin and thick,
+ * thus the old value might already be too high compared to the value
+ * set by 'tcp_set_rto' in tcp_input.c which resets the rto without
+ * backoff. Limit to TCP_THIN_LINEAR_RETRIES before initiating
+ * exponential backoff behaviour to avoid continue hammering
+ * linear-timeout retransmissions into a black hole
+ */
+ if (sk->sk_state == TCP_ESTABLISHED &&
+ (tp->thin_lto || sysctl_tcp_thin_linear_timeouts) &&
+ tcp_stream_is_thin(tp) &&
+ icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
+ icsk->icsk_backoff = 0;
+ icsk->icsk_rto = min(__tcp_set_rto(tp), TCP_RTO_MAX);
+ } else {
+ /* Use normal (exponential) backoff */
+ icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
+ }
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
__sk_dst_reset(sk);
--
1.6.3.3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 4 of 4 ==
Date: Thurs, Feb 18 2010 5:00 am
From: "Ilpo Järvinen"

On Thu, 18 Feb 2010, Andreas Petlund wrote:

> This patch enables fast retransmissions after one dupACK for
> TCP if the stream is identified as thin. This will reduce
> latencies for thin streams that are not able to trigger fast
> retransmissions due to high packet interarrival time. This
> mechanism is only active if enabled by iocontrol or syscontrol
> and the stream is identified as thin.
>
>
> Signed-off-by: Andreas Petlund <apetlund@simula.no>
> ---
> Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
> include/linux/tcp.h | 4 +++-
> include/net/tcp.h | 1 +
> net/ipv4/sysctl_net_ipv4.c | 7 +++++++
> net/ipv4/tcp.c | 7 +++++++
> net/ipv4/tcp_input.c | 12 ++++++++++++
> 6 files changed, 42 insertions(+), 1 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index f147310..2571a62 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -499,6 +499,18 @@ tcp_thin_linear_timeouts - BOOLEAN
> Documentation/networking/tcp-thin.txt
> Default: 0
>
> +tcp_thin_dupack - BOOLEAN
> + Enable dynamic triggering of retransmissions after one dupACK
> + for thin streams. If set, a check is performed upon reception
> + of a dupACK to determine if the stream is thin (less than 4
> + packets in flight). As long as the stream is found to be thin,
> + data is retransmitted on the first received dupACK. This
> + improves retransmission latency for non-aggressive thin
> + streams, often found to be time-dependent.
> + For more information on thin streams, see
> + Documentation/networking/tcp-thin.txt
> + Default: 0
> +
> UDP variables:
>
> udp_mem - vector of 3 INTEGERs: min, pressure, max
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 3ba8b07..a778ee0 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -104,6 +104,7 @@ enum {
> #define TCP_MD5SIG 14 /* TCP MD5 Signature (RFC2385) */
> #define TCP_COOKIE_TRANSACTIONS 15 /* TCP Cookie Transactions */
> #define TCP_THIN_LINEAR_TIMEOUTS 16 /* Use linear timeouts for thin streams*/
> +#define TCP_THIN_DUPACK 17 /* Fast retrans. after 1 dupack */
>
> /* for TCP_INFO socket option */
> #define TCPI_OPT_TIMESTAMPS 1
> @@ -343,7 +344,8 @@ struct tcp_sock {
> u8 frto_counter; /* Number of new acks after RTO */
> u8 nonagle : 4,/* Disable Nagle algorithm? */
> thin_lto : 1,/* Use linear timeouts for thin streams */
> - unused : 3;
> + thin_dupack : 1,/* Fast retransmit on first dupack */
> + unused : 2;
>
> /* RTT measurement */
> u32 srtt; /* smoothed round trip time << 3 */
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 6278fc7..56f0aec 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -245,6 +245,7 @@ extern int sysctl_tcp_slow_start_after_idle;
> extern int sysctl_tcp_max_ssthresh;
> extern int sysctl_tcp_cookie_size;
> extern int sysctl_tcp_thin_linear_timeouts;
> +extern int sysctl_tcp_thin_dupack;
>
> extern atomic_t tcp_memory_allocated;
> extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index e6a2460..c1bc074 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -582,6 +582,13 @@ static struct ctl_table ipv4_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec
> },
> + {
> + .procname = "tcp_thin_dupack",
> + .data = &sysctl_tcp_thin_dupack,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec
> + },
> {
> .procname = "udp_mem",
> .data = &sysctl_udp_mem,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 21bae9a..5901010 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2236,6 +2236,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
> tp->thin_lto = val;
> break;
>
> + case TCP_THIN_DUPACK:
> + if (val < 0 || val > 1)
> + err = -EINVAL;
> + else
> + tp->thin_dupack = val;
> + break;
> +
> case TCP_CORK:
> /* When set indicates to always queue non-full frames.
> * Later the user clears this option and we transmit
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 3fddc69..8d950b9 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -89,6 +89,8 @@ int sysctl_tcp_frto __read_mostly = 2;
> int sysctl_tcp_frto_response __read_mostly;
> int sysctl_tcp_nometrics_save __read_mostly;
>
> +int sysctl_tcp_thin_dupack __read_mostly;
> +
> int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
> int sysctl_tcp_abc __read_mostly;
>
> @@ -2447,6 +2449,16 @@ static int tcp_time_to_recover(struct sock *sk)
> return 1;
> }
>
> + /* If a thin stream is detected, retransmit after first
> + * received dupack. Employ only if SACK is supported in order
> + * to avoid possible corner-case series of spurious retransmissions
> + * Use only if there are no unsent data.
> + */
> + if ((tp->thin_dupack || sysctl_tcp_thin_dupack) &&
> + tcp_stream_is_thin(tp) && tcp_dupack_heuristics(tp) > 1 &&
> + tcp_is_sack(tp) && sk->sk_send_head == NULL)

Use tcp_send_head(sk) instead.

> + return 1;
> +
> return 0;
> }

Other than that,

Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

--
i.

==============================================================================
TOPIC: net: TCP thin-stream latency-improving modifications
http://groups.google.com/group/linux.kernel/t/fc7b6fca6e8d0f83?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 4:50 am
From: Andreas Petlund

This is a series of patches enabling non-intrusive, dynamically
triggered modifications that improve retransmission latencies
for thin streams.

The patch set was modified according to the feedback received.

Major change:
-Used bitfields to compact the nonagle variable
in the tcp_sock struct. nonagle, thin_lto and
thin_dupack is now contained in the same u8.

I decided to use bitfields to handle this as it is
already done similarly in the tcp_options_received struct.

Also corrected some formatting issues.

Cheers,
Andreas Petlund
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: s2disk hang update
http://groups.google.com/group/linux.kernel/t/69e5c9798a1fe4e7?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 5:00 am
From: Alan Jenkins

On 2/17/10, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Wednesday 17 February 2010, Alan Jenkins wrote:
>> On 2/16/10, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> > On Tuesday 16 February 2010, Alan Jenkins wrote:
>> >> On 2/16/10, Alan Jenkins <sourcejedi.lkml@googlemail.com> wrote:
>> >> > On 2/15/10, Rafael J. Wysocki <rjw@sisk.pl> wrote:
>> >> >> On Tuesday 09 February 2010, Alan Jenkins wrote:
>> >> >>> Perhaps I spoke too soon. I see the same hang if I run too many
>> >> >>> applications. The first hibernation fails with "not enough swap"
>> >> >>> as
>> >> >>> expected, but the second or third attempt hangs (with the same
>> >> >>> backtrace
>> >> >>> as before).
>> >> >>>
>> >> >>> The patch definitely helps though. Without the patch, I see a hang
>> >> >>> the
>> >> >>> first time I try to hibernate with too many applications running.
>> >> >>
>> >> >> Well, I have an idea.
>> >> >>
>> >> >> Can you try to apply the appended patch in addition and see if that
>> >> >> helps?
>> >> >>
>> >> >> Rafael
>> >> >
>> >> > It doesn't seem to help.
>> >>
>> >> To be clear: It doesn't stop the hang when I hibernate with too many
>> >> applications.
>> >>
>> >> It does stop the same hang in a different case though.
>> >>
>> >> 1. boot with init=/bin/bash
>> >> 2. run s2disk
>> >> 3. cancel the s2disk
>> >> 4. repeat steps 2&3
>> >>
>> >> With the patch, I can run 10s of iterations, with no hang.
>> >> Without the patch, it soon hangs, (in disable_nonboot_cpus(), as
>> >> always).
>> >>
>> >> That's what happens on 2.6.33-rc7. On 2.6.30, there is no problem.
>> >> On 2.6.31 and 2.6.32 I don't get a hang, but dmesg shows an allocation
>> >> failure after a couple of iterations ("kthreadd: page allocation
>> >> failure. order:1, mode:0xd0"). It looks like it might be the same
>> >> stop_machine thread allocation failure that causes the hang.
>> >
>> > Have you tested it alone or on top of the previous one? If you've
>> > tested it
>> > alone, please apply the appended one in addition to it and retest.
>> >
>> > Rafael
>>
>> I did test with both patches applied together -
>>
>> 1. [Update] MM / PM: Force GFP_NOIO during suspend/hibernation and resume
>> 2. "reducing the number of pages that we're going to keep preallocated by
>> 20%"
>
> In that case you can try to reduce the number of preallocated pages even
> more,
> ie. change "/ 5" to "/ 2" (for example) in the second patch.

It still hangs if I try to hibernate a couple of times with too many
applications.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: powerpc: implement arch_scale_smt_power for Power7
http://groups.google.com/group/linux.kernel/t/891f3a14ac88e3fb?hl=en
==============================================================================

== 1 of 2 ==
Date: Thurs, Feb 18 2010 5:20 am
From: Peter Zijlstra

On Thu, 2010-02-18 at 14:17 +0100, Peter Zijlstra wrote:
>
> There's one fundamental assumption, and one weakness in the
> implementation.
>
Aside from bugs and the like.. ;-)

== 2 of 2 ==
Date: Thurs, Feb 18 2010 5:20 am
From: Peter Zijlstra

On Thu, 2010-02-18 at 09:20 +1100, Michael Neuling wrote:
> > Suppose for a moment we have 2 threads (hot-unplugged thread 1 and 3, we
> > can construct an equivalent but more complex example for 4 threads), and
> > we have 4 tasks, 3 SCHED_OTHER of equal nice level and 1 SCHED_FIFO, the
> > SCHED_FIFO task will consume exactly 50% walltime of whatever cpu it
> > ends up on.
> >
> > In that situation, provided that each cpu's cpu_power is of equal
> > measure, scale_rt_power() ensures that we run 2 SCHED_OTHER tasks on the
> > cpu that doesn't run the RT task, and 1 SCHED_OTHER task next to the RT
> > task, so that each task consumes 50%, which is all fair and proper.
> >
> > However, if you do the above, thread 0 will have +75% = 1.75 and thread
> > 2 will have -75% = 0.25, then if the RT task will land on thread 0,
> > we'll be having: 0.875 vs 0.25, or on thread 3, 1.75 vs 0.125. In either
> > case thread 0 will receive too many (if not all) SCHED_OTHER tasks.
> >
> > That is, unless these threads 2 and 3 really are _that_ weak, at which
> > point one wonders why IBM bothered with the silicon ;-)
>
> Peter,
>
> 2 & 3 aren't weaker than 0 & 1 but....
>
> The core has dynamic SMT mode switching which is controlled by the
> hypervisor (IBM's PHYP). There are 3 SMT modes:
> SMT1 uses thread 0
> SMT2 uses threads 0 & 1
> SMT4 uses threads 0, 1, 2 & 3
> When in any particular SMT mode, all threads have the same performance
> as each other (ie. at any moment in time, all threads perform the same).
>
> The SMT mode switching works such that when linux has threads 2 & 3 idle
> and 0 & 1 active, it will cede (H_CEDE hypercall) threads 2 and 3 in the
> idle loop and the hypervisor will automatically switch to SMT2 for that
> core (independent of other cores). The opposite is not true, so if
> threads 0 & 1 are idle and 2 & 3 are active, we will stay in SMT4 mode.
>
> Similarly if thread 0 is active and threads 1, 2 & 3 are idle, we'll go
> into SMT1 mode.
>
> If we can get the core into a lower SMT mode (SMT1 is best), the threads
> will perform better (since they share less core resources). Hence when
> we have idle threads, we want them to be the higher ones.

Just out of curiosity, is this a hardware constraint or a hypervisor
constraint?

> So to answer your question, threads 2 and 3 aren't weaker than the other
> threads when in SMT4 mode. It's that if we idle threads 2 & 3, threads
> 0 & 1 will speed up since we'll move to SMT2 mode.
>
> I'm pretty vague on linux scheduler details, so I'm a bit at sea as to
> how to solve this. Can you suggest any mechanisms we currently have in
> the kernel to reflect these properties, or do you think we need to
> develop something new? If so, any pointers as to where we should look?

Well there currently isn't one, and I've been telling people to create a
new SD_flag to reflect this and influence the f_b_g() behaviour.

Something like the below perhaps, totally untested and without comments
so that you'll have to reverse engineer and validate my thinking.

There's one fundamental assumption, and one weakness in the
implementation.

---

include/linux/sched.h | 2 +-
kernel/sched_fair.c | 61 +++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0eef87b..42fa5c6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -849,7 +849,7 @@ enum cpu_idle_type {
#define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */
#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
#define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
-
+#define SD_ASYM_PACKING 0x0800
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */

enum powersavings_balance_level {
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ff7692c..7e42bfe 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -2086,6 +2086,7 @@ struct sd_lb_stats {
struct sched_group *this; /* Local group in this sd */
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_pwr; /* Total power of all groups in sd */
+ unsigned long total_nr_running;
unsigned long avg_load; /* Average load across all groups in sd */

/** Statistics of this group */
@@ -2414,10 +2415,10 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
int *balance, struct sg_lb_stats *sgs)
{
unsigned long load, max_cpu_load, min_cpu_load;
- int i;
unsigned int balance_cpu = -1, first_idle_cpu = 0;
unsigned long sum_avg_load_per_task;
unsigned long avg_load_per_task;
+ int i;

if (local_group)
balance_cpu = group_first_cpu(group);
@@ -2493,6 +2494,28 @@ static inline void update_sg_lb_stats(struct sched_domain *sd,
DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
}

+static int update_sd_pick_busiest(struct sched_domain *sd,
+ struct sd_lb_stats *sds,
+ struct sched_group *sg,
+ struct sg_lb_stats *sgs)
+{
+ if (sgs->sum_nr_running > sgs->group_capacity)
+ return 1;
+
+ if (sgs->group_imb)
+ return 1;
+
+ if ((sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running) {
+ if (!sds->busiest)
+ return 1;
+
+ if (group_first_cpu(sds->busiest) < group_first_cpu(group))
+ return 1;
+ }
+
+ return 0;
+}
+
/**
* update_sd_lb_stats - Update sched_group's statistics for load balancing.
* @sd: sched_domain whose statistics are to be updated.
@@ -2533,6 +2556,7 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,

sds->total_load += sgs.group_load;
sds->total_pwr += group->cpu_power;
+ sds->total_nr_running += sgs.sum_nr_running;

/*
* In case the child domain prefers tasks go to siblings
@@ -2547,9 +2571,8 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
sds->this = group;
sds->this_nr_running = sgs.sum_nr_running;
sds->this_load_per_task = sgs.sum_weighted_load;
- } else if (sgs.avg_load > sds->max_load &&
- (sgs.sum_nr_running > sgs.group_capacity ||
- sgs.group_imb)) {
+ } else if (sgs.avg_load >= sds->max_load &&
+ update_sd_pick_busiest(sd, sds, group, &sgs)) {
sds->max_load = sgs.avg_load;
sds->busiest = group;
sds->busiest_nr_running = sgs.sum_nr_running;
@@ -2562,6 +2585,33 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
} while (group != sd->groups);
}

+static int check_asym_packing(struct sched_domain *sd,
+ struct sd_lb_stats *sds,
+ int cpu, unsigned long *imbalance)
+{
+ int i, cpu, busiest_cpu;
+
+ if (!(sd->flags & SD_ASYM_PACKING))
+ return 0;
+
+ if (!sds->busiest)
+ return 0;
+
+ i = 0;
+ busiest_cpu = group_first_cpu(sds->busiest);
+ for_each_cpu(cpu, sched_domain_span(sd)) {
+ i++;
+ if (cpu == busiest_cpu)
+ break;
+ }
+
+ if (sds->total_nr_running > i)
+ return 0;
+
+ *imbalance = sds->max_load;
+ return 1;
+}
+
/**
* fix_small_imbalance - Calculate the minor imbalance that exists
* amongst the groups of a sched_domain, during
@@ -2761,6 +2811,9 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
return sds.busiest;

out_balanced:
+ if (check_asym_packing(sd, &sds, this_cpu, imbalance))
+ return sds.busiest;
+
/*
* There is no obvious imbalance. But check if we can do some balancing
* to save power.

==============================================================================
TOPIC: 33-rc8 Running aplay with pulse as the default
http://groups.google.com/group/linux.kernel/t/0905202c42c4b8f3?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 5:30 am
From: Takashi Iwai

At Thu, 18 Feb 2010 07:25:38 -0500,
Ed Tomlinson wrote:
>
> On Wednesday 17 February 2010 09:29:37 Takashi Iwai wrote:
> > At Wed, 17 Feb 2010 08:16:32 -0500,
> > Ed Tomlinson wrote:
> > >
> > > On Tuesday 16 February 2010 08:37:46 Takashi Iwai wrote:
> > > > > Thanks for the patch. It helps in that it eliminates the opps but lockdep still triggers and aplay still fails.
> > > > > Here is the new traceback.
> > > >
> > > > Hmm, fixing this isn't so trivial. The same problem occurs on other
> > > > subsystems like NFS over years. And it's still there, AFAIK.
> > > > The mmap mutex appears suddenly in the strange code path at close.
> > > >
> > > > The patch below might fix, but I'm not 100% sure whether this has no
> > > > side effect.
> > > >
> > > > Anyway, I doubt very much it being a regression. There is no change
> > > > in ALSA core side, and also in V4L em28xx code. Maybe the lockdep
> > > > wasn't triggered by some reason. And, this lockdep warning is almost
> > > > harmless...
> > >
> > > Takashi,
> > >
> > > The second patch eliminating the lock causes oppes every time (one follows just in case
> > > its helpful).
> >
> > Are you sure? The patch should causes a compile error, so you must have
> > patched manually in a wrong place ;)
>
> Yes I am sure. I fixed the compile error the same way it is fixed below.

But the Oops looks pretty irrelevant from the code path.

Takashi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: kernel/xserver-xorg: X crash while either idle or busy
http://groups.google.com/group/linux.kernel/t/22f68f2333578c8b?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 5:30 am
From: lkml@Think-Future.de

Hi,

This bug report has been posted to the debian bugtracker for the
x-server-org deb package:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=570124

This link contains an extensive bug-report together with any relevant
information available.

The description is:

# Package: xserver-xorg
# Version: 1:7.5+2
# Severity: severe
#
# Data loss guaranteed. While writing a book, X crashed, part of the book had
# to be re-written.
#
# Whether using the system or not, X crashes semi-reproducibly. Be it after a
# suspend-resume cycle or while just browsing the net, using openoffice writer,
# doing nothing.. It just keeps crashing one day or the other. Mean times between
# crashing is from 2 hours to 2 days, sometimes more, sometimes less.
#
# When the system crashes after opening the cover/lid it usually takes the whole kernel
# down. Reboot only.
#
# Less often but often enough, the crashing X just hangs, locking the
# system up. Reboot only.
#
# This is occurring since january 2009 (switch-over to linux) over a variety of
# kernel, kms, intel-drv and X versions. May well be different causes but
# seems quite the same to us..
#
# System is a rather usual ACER Extensa 5220 notebook.

A complete bug-report is available via the above link.

The answer - among other ideas - was that if the kernel is
hanging/freezing it is supposed to be reported to the lkml.

So posting here, fyi.

Thank you.

Nils

==============================================================================
TOPIC: PROBLEM: oops w/ bridge in 2.6.32.7
http://groups.google.com/group/linux.kernel/t/64824c5fceda670a?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 5:40 am
From: lkml@Think-Future.de

Hi,

Reported to virtualbox bugtracker.

The crash so far only happens when starting virtualbox. W/o bridging is working.
Plain and within a box.

Thank you.

Nils

PS: We once had quite unstable kernels using bridge and ebtables modules.
When loaded those kernel crashed quite fast. This was _some_ versions ago.
maybe mid-summer 2009? Anyhow, nothing to dig in now except you have a hunch.
;) Just felt to let you know while talking about bridging...

==============================================================================
TOPIC: perf record and multiple events
http://groups.google.com/group/linux.kernel/t/ee0cb0c44ccbb5e0?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 5:50 am
From: Peter Zijlstra

On Wed, 2010-02-17 at 17:38 +0000, Eric B Munson wrote:
> While testing perf record I found that when multiple events are specified
> on the command line, i.e. perf record -e dTLB-misses -e cache-misses,
> the are lumped into a single category on output from perf report. The
> same event specification gives two event categories when using perf stat.
> Is this working as expected or should I see seperate entries in the report
> for each event?

The tools currently lack support for this, but patches are welcome.

==============================================================================
TOPIC: cpuset,mm: update tasks' mems_allowed in time (58568d2)
http://groups.google.com/group/linux.kernel/t/c97c21f117bf365d?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 5:50 am
From: Nick Piggin

Hi,

The patch cpuset,mm: update tasks' mems_allowed in time (58568d2) causes
a regression uncovered by SGI. Basically it is allowing possible but not
online nodes in the task_struct.mems_allowed nodemask (which is contrary
to several comments still in kernel/cpuset.c), and that causes
cpuset_mem_spread_node() to return an offline node to slab, causing an
oops.

Easy to reproduce if you have a machine with !online nodes.

- mkdir /dev/cpuset
- mount cpuset -t cpuset /dev/cpuset
- echo 1 > /dev/cpuset/memory_spread_slab

kernel BUG at
/usr/src/packages/BUILD/kernel-default-2.6.32/linux-2.6.32/mm/slab.c:3271!
bash[6885]: bugcheck! 0 [1]
Pid: 6885, CPU 5, comm: bash
psr : 00001010095a2010 ifs : 800000000000038b ip : [<a00000010020cf00>]
Tainted: G W (2.6.32-0.6.8-default)
ip is at ____cache_alloc_node+0x440/0x500

unat: 0000000000000000 pfs : 000000000000038b rsc : 0000000000000003
rnat: 0000000000283d85 bsps: 0000000000000001 pr : 99596aaa69aa6999
ldrs: 0000000000000000 ccv : 0000000000000018 fpsr: 0009804c0270033f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a00000010020cf00 b6 : a0000001004962c0 b7 : a000000100493240
f6 : 000000000000000000000 f7 : 000000000000000000000
f8 : 000000000000000000000 f9 : 000000000000000000000
f10 : 000000000000000000000 f11 : 000000000000000000000
r1 : a0000001015c6fc0 r2 : 000000000000e662 r3 : 000000000000fffe
r8 : 000000000000005c r9 : 0000000000000000 r10 : 0000000000004000
r11 : 0000000000000000 r12 : e000003c3904fcc0 r13 : e000003c39040000
r14 : 000000000000e662 r15 : a00000010138ed88 r16 : ffffffffffff65c8
r17 : a00000010138ed80 r18 : a0000001013c7ad0 r19 : a0000001013d3b60
r20 : e00001b03afdfe18 r21 : 0000000000000001 r22 : e0000130030365c8
r23 : e000013003040000 r24 : ffffffffffff0400 r25 : 00000000000068ef
r26 : 00000000000068ef r27 : a0000001029621d0 r28 : 00000000000068f0
r29 : 00000000000068f0 r30 : 00000000000068f0 r31 : 000000000000000a

Call Trace:
[<a000000100017a80>] show_stack+0x80/0xa0
[<a0000001000180e0>] show_regs+0x640/0x920
[<a000000100029a90>] die+0x190/0x2e0
[<a000000100029c30>] die_if_kernel+0x50/0x80
[<a000000100904af0>] ia64_bad_break+0x470/0x760
[<a00000010000cb60>] ia64_native_leave_kernel+0x0/0x270
[<a00000010020cf00>] ____cache_alloc_node+0x440/0x500
[<a00000010020ffa0>] kmem_cache_alloc+0x360/0x660

A simple bandaid is to skip !online nodes in cpuset_mem_spread_node().
However I'm a bit worried about 58568d2.

It is doing a lot of stuff. It is removing the callback_mutex from
around several seemingly unrelated places (eg. from around
guarnatee_online_cpus, which explicitly asks to be called with that
lock held), and other places, so I don't know how it is not racy
with hotplug.

Then it also says that the fastpath doesn't use any locking, so the
update-path first adds the newly allowed nodes, then removes the
newly prohibited nodes. Unfortunately there are no barriers apparent
(and none added), and cpumask/nodemask can be larger than one word,
so it seems there could be races.

It also seems like the exported cpuset_mems_allowed and
cpuset_cpus_allowed APIs are just broken wrt hotplug because the
hotplug lock is dropped before returning.

I'd just like to get opinions or comments from people who know the
code better before wading in too far myself. I'd be really keen on
making the locking simpler, using seqlocks for fastpaths, etc.

Thanks,
Nick

==============================================================================
TOPIC: KVM: SVM: Don't use kmap_atomic in nested_svm_map
http://groups.google.com/group/linux.kernel/t/b608877f8fa7c926?hl=en
==============================================================================

== 1 of 4 ==
Date: Thurs, Feb 18 2010 5:50 am
From: Avi Kivity

On 02/18/2010 01:38 PM, Joerg Roedel wrote:
> Use of kmap_atomic disables preemption but if we run in
> shadow-shadow mode the vmrun emulation executes kvm_set_cr3
> which might sleep or fault. So use kmap instead for
> nested_svm_map.
>
>
>
> -static void nested_svm_unmap(void *addr, enum km_type idx)
> +static void nested_svm_unmap(void *addr)
> {
> struct page *page;
>
> @@ -1443,7 +1443,7 @@ static void nested_svm_unmap(void *addr, enum km_type idx)
>
> page = kmap_atomic_to_page(addr);
>
> - kunmap_atomic(addr, idx);
> + kunmap(addr);
> kvm_release_page_dirty(page);
> }
>

kunmap() takes a struct page *, not the virtual address (a consistent
source of bugs).

kmap() is generally an unloved interface, it is slow and possibly
deadlock prone, but it's better than sleeping in atomic context. If you
can hack your way around it, that is preferred.

--
error compiling committee.c: too many arguments to function

== 2 of 4 ==
Date: Thurs, Feb 18 2010 6:00 am
From: Avi Kivity

On 02/18/2010 01:38 PM, Joerg Roedel wrote:
> Move the actual vmexit routine out of code that runs with
> irqs and preemption disabled.
>
> Cc: stable@kernel.org
> Signed-off-by: Joerg Roedel<joerg.roedel@amd.com>
> ---
> arch/x86/kvm/svm.c | 20 +++++++++++++++++---
> 1 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 7c96b8b..25d26ec 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -128,6 +128,7 @@ static void svm_flush_tlb(struct kvm_vcpu *vcpu);
> static void svm_complete_interrupts(struct vcpu_svm *svm);
>
> static int nested_svm_exit_handled(struct vcpu_svm *svm);
> +static int nested_svm_exit_handled_atomic(struct vcpu_svm *svm);
> static int nested_svm_vmexit(struct vcpu_svm *svm);
> static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
> bool has_error_code, u32 error_code);
> @@ -1386,7 +1387,7 @@ static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
> svm->vmcb->control.exit_info_1 = error_code;
> svm->vmcb->control.exit_info_2 = svm->vcpu.arch.cr2;
>
> - return nested_svm_exit_handled(svm);
> + return nested_svm_exit_handled_atomic(svm);
> }
>

What do you say to

if (nested_svm_intercepts(svm))
svm->nested.exit_required = true;

here, and recoding nested_svm_exit_handled() to call
nested_svm_intercepts()? I think it improves readability a little by
avoiding a function that changes behaviour according to how it is called.

Long term, we may want to split out the big switch into the individual
handlers, to avoid decoding the exit reason twice.

--
error compiling committee.c: too many arguments to function

== 3 of 4 ==
Date: Thurs, Feb 18 2010 6:40 am
From: Avi Kivity

On 02/18/2010 01:38 PM, Joerg Roedel wrote:
> TDB.
>
>

...

> @@ -973,6 +973,7 @@ static void svm_decache_cr4_guest_bits(struct kvm_vcpu *vcpu)
>
> static void update_cr0_intercept(struct vcpu_svm *svm)
> {
> + struct vmcb *vmcb = svm->vmcb;
> ulong gcr0 = svm->vcpu.arch.cr0;
> u64 *hcr0 =&svm->vmcb->save.cr0;
>
> @@ -984,11 +985,25 @@ static void update_cr0_intercept(struct vcpu_svm *svm)
>
>
> if (gcr0 == *hcr0&& svm->vcpu.fpu_active) {
> - svm->vmcb->control.intercept_cr_read&= ~INTERCEPT_CR0_MASK;
> - svm->vmcb->control.intercept_cr_write&= ~INTERCEPT_CR0_MASK;
> + vmcb->control.intercept_cr_read&= ~INTERCEPT_CR0_MASK;
> + vmcb->control.intercept_cr_write&= ~INTERCEPT_CR0_MASK;
> + if (is_nested(svm)) {
> + struct vmcb *hsave = svm->nested.hsave;
> +
> + hsave->control.intercept_cr_read&= ~INTERCEPT_CR0_MASK;
> + hsave->control.intercept_cr_write&= ~INTERCEPT_CR0_MASK;
> + vmcb->control.intercept_cr_read |= svm->nested.intercept_cr_read;
> + vmcb->control.intercept_cr_write |= svm->nested.intercept_cr_write;
>

Why are the last two lines needed?

> + }
> } else {
> svm->vmcb->control.intercept_cr_read |= INTERCEPT_CR0_MASK;
> svm->vmcb->control.intercept_cr_write |= INTERCEPT_CR0_MASK;
> + if (is_nested(svm)) {
> + struct vmcb *hsave = svm->nested.hsave;
> +
> + hsave->control.intercept_cr_read |= INTERCEPT_CR0_MASK;
> + hsave->control.intercept_cr_write |= INTERCEPT_CR0_MASK;
> + }
> }
> }
>

Maybe it's better to call update_cr0_intercept() after a vmexit instead,
to avoid this repetition, and since the if () may take a different
branch for the nested guest and guest cr0.

--
error compiling committee.c: too many arguments to function

== 4 of 4 ==
Date: Thurs, Feb 18 2010 6:40 am
From: Avi Kivity

On 02/18/2010 01:38 PM, Joerg Roedel wrote:
> Hi,
>
> here is a couple of fixes for the nested SVM implementation. I collected these
> fixes mostly when trying to get Windows 7 64bit running as an L2 guest. Most
> important fixes in this set make lazy fpu switching working with nested SVM and
> the nested tpr handling fixes. Without the later fix the l1 guest freezes when
> trying to run win7 as l2 guest. Please review and comment on these patches :-)
>

Overall looks good. Would appreciate Alex looking over these as well.

--
error compiling committee.c: too many arguments to function

==============================================================================
TOPIC: Stupid futex question - 2.6.33-rc7-mmotm0210
http://groups.google.com/group/linux.kernel/t/3418d4e896d1113f?hl=en
==============================================================================

== 1 of 2 ==
Date: Thurs, Feb 18 2010 6:10 am
From: Valdis.Kletnieks@vt.edu

Kernel: x86_64 2.6.33-rc7-mmotm0210

I'm debugging a problem where pulseaudio is getting killed with a SIGKILL
out of the blue. It appears to be a problem where pulseaudio sets
RLIMIT_RTTIME and the bound gets exceeded. Analysis with 'top' shows
a short spike of 96% system time, and the tail end of strace shows this:

[pid 25065] 01:50:20.371484 ioctl(28, USBDEVFS_CONTROL, 0x7fd3d76f630c) = 0 <0.000015>
[pid 25065] 01:50:20.371548 ioctl(28, 0x40045532, 0x7fd3d76f636c) = 0 <0.000016>
[pid 25065] 01:50:20.371611 open("/dev/snd/pcmC0D0p", O_RDWR|O_NONBLOCK|O_CLOEXEC <unfinished ...>
[pid 25064] 01:50:20.371678 <... write resumed> ) = 8 <0.002104>
[pid 25064] 01:50:20.371718 futex(0xc2ec00, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 25066] 01:50:21.408392 +++ killed by SIGKILL +++
PANIC: handle_group_exit: 25066 leader 25064
[pid 25065] 01:50:21.408442 +++ killed by SIGKILL +++
PANIC: handle_group_exit: 25065 leader 25064
01:50:21.420354 +++ killed by SIGKILL +++

thread 25064 apparently gets gunned down due to RTTIME because it spent a whole
second in a futex() call - is it reasonable for futex() to not return for that
long?

In other words - kernel bug because futex() should return, or pulseaudio bug
for not understanding futex() can snooze a while?

If a kernel bug, anybody got a better idea than nuking the RLIMIT_RTTIME call,
waiting for it to repeat (takes between 1 minute and 1 hour or so), and
whomping it a few times with sysrq-T?

== 2 of 2 ==
Date: Thurs, Feb 18 2010 6:40 am
From: Peter Zijlstra

On Thu, 2010-02-18 at 09:04 -0500, Valdis.Kletnieks@vt.edu wrote:
> Kernel: x86_64 2.6.33-rc7-mmotm0210
>
> I'm debugging a problem where pulseaudio is getting killed with a SIGKILL
> out of the blue. It appears to be a problem where pulseaudio sets
> RLIMIT_RTTIME and the bound gets exceeded. Analysis with 'top' shows
> a short spike of 96% system time, and the tail end of strace shows this:
>
> [pid 25065] 01:50:20.371484 ioctl(28, USBDEVFS_CONTROL, 0x7fd3d76f630c) = 0 <0.000015>
> [pid 25065] 01:50:20.371548 ioctl(28, 0x40045532, 0x7fd3d76f636c) = 0 <0.000016>
> [pid 25065] 01:50:20.371611 open("/dev/snd/pcmC0D0p", O_RDWR|O_NONBLOCK|O_CLOEXEC <unfinished ...>
> [pid 25064] 01:50:20.371678 <... write resumed> ) = 8 <0.002104>
> [pid 25064] 01:50:20.371718 futex(0xc2ec00, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 25066] 01:50:21.408392 +++ killed by SIGKILL +++
> PANIC: handle_group_exit: 25066 leader 25064
> [pid 25065] 01:50:21.408442 +++ killed by SIGKILL +++
> PANIC: handle_group_exit: 25065 leader 25064
> 01:50:21.420354 +++ killed by SIGKILL +++
>
> thread 25064 apparently gets gunned down due to RTTIME because it spent a whole
> second in a futex() call - is it reasonable for futex() to not return for that
> long?
>
> In other words - kernel bug because futex() should return, or pulseaudio bug
> for not understanding futex() can snooze a while?
>
> If a kernel bug, anybody got a better idea than nuking the RLIMIT_RTTIME call,
> waiting for it to repeat (takes between 1 minute and 1 hour or so), and
> whomping it a few times with sysrq-T?

is that second spend in processing sysrq-t?

==============================================================================
TOPIC: Linux mdadm superblock question.
http://groups.google.com/group/linux.kernel/t/f58e89a4f371364a?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 6:20 am
From: Nick Bowler

On 04:33 Thu 18 Feb , Goswin von Brederlow wrote:
> Nick Bowler <nbowler@elliptictech.com> writes:
>
> > On 09:41 Wed 17 Feb , david@lang.hm wrote:
> >> however for people who run systems that are known ahead of time and
> >> static (and who build their own kernels instead of just relying on the
> >> distro default kernel), all of this is unnessesary complication, which
> >> leaves more room for problems to creep in.
> >
> > Such people can easily construct an initramfs containing busybox and
> > mdadm with a shell script hardcoded to mount their root fs and run
> > switch_root. It's a ~10 minute jobbie that only needs to be done once.
>
> Except when mdadm, cryptsetup, lvm change you need to update it.
> Esspecially when you set up a new system that might have newer
> metadata.

I meant "once per system". One typically doesn't _need_ to update the
mdadm in the initramfs, as long as it's capable of assembling the root
array.

> Also at least Debian doesn't (yet) support a common initramfs for their
> kernel packaging. You either build a kernel without need for one or you
> have a per kernel initramfs that is automatically build and updated
> whenever anything in the initrmafs changes. Not often, but still too
> often, the initramfs then doesn't work.

The scenario was when users configure and build their own kernel. These
users are presumably capable of using grub's "initrd" command or the
CONFIG_INITRAMFS_SOURCE kernel option.

--
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: Call +234 802 972 9104
http://groups.google.com/group/linux.kernel/t/6432b553efc79d0c?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Feb 18 2010 6:40 am
From: Western Union

Congrat..you have $50,000, confirm receipt bysending your name,address,age,phone number etc to (wu.africa@w.cn)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================

You received this message because you are subscribed to the Google Groups "linux.kernel"
group.

To post to this group, visit http://groups.google.com/group/linux.kernel?hl=en

To unsubscribe from this group, send email to linux.kernel+unsubscribe@googlegroups.com

To change the way you get mail from this group, visit:
http://groups.google.com/group/linux.kernel/subscribe?hl=en

To report abuse, send email explaining the problem to abuse@googlegroups.com

==============================================================================
Google Groups: http://groups.google.com/?hl=en

twitter

Thursday, February 18, 2010

linux.kernel - 26 new messages in 15 topics - digest

0 Comments:

Post a Comment

About Me

Previous Posts