twitter: linux.kernel - 26 new messages in 19 topics

linux.kernel
http://groups.google.com/group/linux.kernel?hl=en

linux.kernel@googlegroups.com

Today's topics:

* Drivers: hv: Implement the file copy service - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/7fcaa7ccab7e2a1c?hl=en
* Implement new PTRACE_EVENT_SYSCALL_{ENTER,EXIT} - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/b5d27cb9eafa4d04?hl=en
* Crashdump-Accepting-Active-IOMMU-Copy-Translations - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/39176c9cda27fdff?hl=en
* mm: show message when updating min_free_kbytes in thp - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/6e7b9f828e3b2d2d?hl=en
* Terrible performance of sequential O_DIRECT 4k writes in SAN environment. ~3
times slower then Solars 10 with the same HBA/Storage. - 2 messages, 1 author
http://groups.google.com/group/linux.kernel/t/c2d4038a5c79058e?hl=en
* vfs: Fix possible NULL pointer dereference in inode_permission() - 3
messages, 2 authors
http://groups.google.com/group/linux.kernel/t/fe01f34c3b356d71?hl=en
* kdump failed because of hotplug memory adding in kdump kernel - 1 messages,
1 author
http://groups.google.com/group/linux.kernel/t/1d614525bf91dfa1?hl=en
* netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get - 2 messages,
2 authors
http://groups.google.com/group/linux.kernel/t/1f5bd631cfffab0f?hl=en
* HID logitech DJ fixes and preps for enabling extended features - 1 messages,
1 author
http://groups.google.com/group/linux.kernel/t/c2d6d10acec135d4?hl=en
* pci/iov: VFs are never multifunction - 2 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/c117099b558ec734?hl=en
* macvlan: forbid L2 fowarding offload for macvtap - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/e488fba6679952ea?hl=en
* mm: fix the theoretical compound_lock() vs prep_new_page() race - 1 messages,
1 author
http://groups.google.com/group/linux.kernel/t/bff24967685e62c5?hl=en
* blk-mq: Compile fix for null_blk - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/4b82ed12395483b1?hl=en
* hyperv: Add support for physically discontinuous receive buffer - 1 messages,
1 author
http://groups.google.com/group/linux.kernel/t/1484eaf6bfef9ffe?hl=en
* acpi memory hotplug, add parameter to disable memory hotplug for kexec - 2
messages, 2 authors
http://groups.google.com/group/linux.kernel/t/83ff201aba618199?hl=en
* mm, memcg: avoid oom notification when current needs access to memory
reserves - 2 messages, 1 author
http://groups.google.com/group/linux.kernel/t/ddffce70f2a0b5fc?hl=en
* at91: dt: smc: Added smc bus driver - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/88f86d5983c318d0?hl=en
* Fix ebizzy performance regression due to X86 TLB range flush v3 - 1 messages,
1 author
http://groups.google.com/group/linux.kernel/t/b9228736277d4509?hl=en
* i2c: Add message transfer tracepoints for I2C - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/1983299b36043b13?hl=en

==============================================================================
TOPIC: Drivers: hv: Implement the file copy service
http://groups.google.com/group/linux.kernel/t/7fcaa7ccab7e2a1c?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Jan 9 2014 1:20 pm
From: KY Srinivasan

> -----Original Message-----
> From: gregkh@linuxfoundation.org [mailto:gregkh@linuxfoundation.org]
> Sent: Thursday, January 09, 2014 12:15 PM
> To: Dan Carpenter
> Cc: KY Srinivasan; olaf@aepfle.de; jasowang@redhat.com; linux-
> kernel@vger.kernel.org; apw@canonical.com; devel@linuxdriverproject.org
> Subject: Re: [PATCH 1/1] Drivers: hv: Implement the file copy service
>
> On Thu, Jan 09, 2014 at 11:09:21PM +0300, Dan Carpenter wrote:
> > We've had this discussion before where you urge me to trust the host...
> >
> > Problem: This code is racy.
> > Solution: The host will only send one message at a time.
> >
> > Now I have to audit the user space code on the host and I don't feel
> > like doing that so you win.
> >
> > I wish we had a better way to do IPC. If kdbus were ready, that might
> > have worked for this, and it's a better solution because both sender and
> > reciever code will be written in a less trusting way.
>
> kdbus is almost ready, it might make 3.15, depending on the result of
> work that is happening at linux.conf.au and FOSDEM.
>
> If it would be a better solution for this, that's even more reason to
> get kdbus merged soon, no need to add something that doesn't really
> work.
>
> But, how will kdbus help with this? It's a userspace <-> userspace
> message transmission bus, would you want the kernel to be a receiver or
> sender here?

Greg,

The transport between the kernel and the user in this driver is a regular char device and
Vmbus is the transport between the host and the guest. As you observe, I am not sure
how kdbus would be useful.

K. Y
>
> thanks,
>
> greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: Implement new PTRACE_EVENT_SYSCALL_{ENTER,EXIT}
http://groups.google.com/group/linux.kernel/t/b5d27cb9eafa4d04?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Jan 9 2014 1:20 pm
From: Roland McGrath

> I won't argue, but it is not clear to me if this is really useful,
> given that the debugger can read the regs.

I do see the utility of having a consistent machine-independent way to get
the syscall number from userspace. But note that this could be probably
accomplished with just a uapi header change, providing an inline or macro
to extract the syscall number from the regset data. (It was once the case
that there were some machines where the syscall number is not in any
register and has to be extracted by decoding the instruction. I'm not sure
if that is still an issue anywhere.)

Also note that adding a separate copy of the syscall number introduces a
new wrinkle into the interface. This might be considered to be good, bad,
or indifferent, but I think it should at least be considered explicitly.
That is, at the entry stop the syscall number (when it's in a register) can
be changed via ptrace. So if the number is delivered via ptrace_message,
that's a separate copy of the original number that does not reflect any
changes made via ptrace--so it reflects what userland asked for, as opposed
to what the kernel actually acted on.

I don't have a particular opinion about which way to go with that.
I just wanted all the issues (I'm aware of) to be considered.

I absolutely concur that all new tracing features should work in the
existing style of PTRACE_EVENT_* that a PTRACE_O_* flag enables it and
then it applies for all kinds of running under ptrace. Things like
PTRACE_SYSCALL that combine what-to-report with how-to-run are terrible
and should never be done again.

Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: Crashdump-Accepting-Active-IOMMU-Copy-Translations
http://groups.google.com/group/linux.kernel/t/39176c9cda27fdff?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Jan 9 2014 1:20 pm
From: "Sumner, William"

Thank you for testing and reviewing this patch -- and for previously testing the RFC version of the patch.
I am currently preparing version 3 of the patch which will include your recommendations.

You mentioned two concerns in your email:

About this item:
...................................
> + INIT_LIST_HEAD(&domain->devices);
> + domain->id = (int) context_get_did(ce);
> + domain->agaw = (int) context_get_aw(ce);
> + domain->pgd = NULL;

Here the domain is created and initialized, but its member iovad is not
initialized. This will cause domain->iovad.iova_rbtree_lock deadlock
because its initial value is random. And the rbtree operation will
crash. This happen in reserve_iova invoked in copy_page_addr in high
frequency.

I add 1 line of code as below and it works well.

init_iova_domain(&domain->iovad, DMA_32BIT_PFN);
...................................
Yes, this line is necessary. I have added this line into the next version of the patch.

I also investigated why my earlier testing missed this. Looks like I was always getting zeroed memory which acts as though the structure is properly initialized for the fields of the structure that my code uses.

About this item: (And in your follow-up email about the same item)
.................................................
> > + ppap->page_addr = page_addr; /* Addr(new page) */
> > + ppap->page_size = page_size; /* Size(new page) */
> > +
> > + ppap->domain = domain; /* adr(domain for the new range) */
>
> Here I am confused, ppap is used to collect the copied ranges and
> necessary information. To my understanding, this domain is the
> dmar_domain which the 1st device is on. When ppat->last is set to 1,
> among the whole collected range, there may be many domains. I just feel
> not good for this. Is it OK to define a specific lock for ppap
> structure, possibly? Please correct me if I am wrong.

Well, check it again. It's not about the lock. Just all address is
recorded in one dmar_domain.
.................................................
Yes: no lock is required. This code operates only during Linux initialization when only one CPU is active.

Yes: ppap->domain holds the domain to be used when the address range accumulated in ppap->page_addr and ppap->page_size is eventually added to the tree (either when a new range is started or when ppap->last is set.)

Bill

-----Original Message-----
From: Baoquan He [mailto:bhe@redhat.com]
Sent: Friday, December 20, 2013 12:04 AM
To: Sumner, William
Cc: dwmw2@infradead.org; indou.takao@jp.fujitsu.com; iommu@lists.linux-foundation.org; kexec@lists.infradead.org; alex.williamson@redhat.com; linux-pci@vger.kernel.org; linux-kernel@vger.kernel.org; ddutile@redhat.com; ishii.hironobu@jp.fujitsu.com; bhelgaas@google.com; Hatch, Douglas B (HPS Linux PM)
Subject: Re: [PATCHv2 4/6] Crashdump-Accepting-Active-IOMMU-Copy-Translations

On 12/19/13 at 07:49pm, Bill Sumner wrote:

> +static int copy_page_addr(u64 page_addr, u32 shift, u32 bus, u32 devfn,
> + u64 pte, struct dmar_domain *domain,
> + void *parms)
> +{
> + struct copy_page_addr_parms *ppap = parms;
> +
> + u64 page_size = ((u64)1 << shift); /* page_size */
> + u64 pfn_lo; /* For reserving IOVA range */
> + u64 pfn_hi; /* For reserving IOVA range */
> + struct iova *iova_p; /* For reserving IOVA range */
> +
> + if (!ppap) {
> + pr_err("ERROR: ppap is NULL: 0x%3.3x(%3.3d) DevFn: 0x%3.3x(%3.3d) Page: 0x%16.16llx Size: 0x%16.16llx(%lld)\n",
> + bus, bus, devfn, devfn, page_addr,
> + page_size, page_size);
> + return 0;
> + }
> +

> + /* Prepare for a new page */
> + ppap->first = 0; /* Not first-time anymore */
> + ppap->bus = bus;
> + ppap->devfn = devfn;
> + ppap->shift = shift;
> + ppap->pte = pte;
> + ppap->next_addr = page_addr + page_size; /* Next-expected page_addr */
> +
> + ppap->page_addr = page_addr; /* Addr(new page) */
> + ppap->page_size = page_size; /* Size(new page) */
> +
> + ppap->domain = domain; /* adr(domain for the new range) */

Here I am confused, ppap is used to collect the copied ranges and
necessary information. To my understanding, this domain is the
dmar_domain which the 1st device is on. When ppat->last is set to 1,
among the whole collected range, there may be many domains. I just feel
not good for this. Is it OK to define a specific lock for ppap
structure, possibly? Please correct me if I am wrong.

> +
> + return 0;
> +}
> +

> +
> +
> +
> +static int copy_context_entry(struct intel_iommu *iommu, u32 bus, u32 devfn,
> + void *ppap, struct context_entry *ce)
> +{
> + int ret = 0; /* Integer Return Code */
> + u32 shift = 0; /* bits to shift page_addr */
> + u64 page_addr = 0; /* Address of translated page */
> + struct dma_pte *pgt_old_phys; /* Adr(page_table in the old kernel) */
> + struct dma_pte *pgt_new_phys; /* Adr(page_table in the new kernel) */
> + unsigned long asr; /* New asr value for new context */
> + u8 t; /* Translation-type from context */
> + u8 aw; /* Address-width from context */
> + u32 aw_shift[8] = {
> + 12+9+9, /* [000b] 30-bit AGAW (2-level page table) */
> + 12+9+9+9, /* [001b] 39-bit AGAW (3-level page table) */
> + 12+9+9+9+9, /* [010b] 48-bit AGAW (4-level page table) */
> + 12+9+9+9+9+9, /* [011b] 57-bit AGAW (5-level page table) */
> + 12+9+9+9+9+9+9, /* [100b] 64-bit AGAW (6-level page table) */
> + 0, /* [111b] Reserved */
> + 0, /* [110b] Reserved */
> + 0, /* [111b] Reserved */
> + };
> +
> + struct dmar_domain *domain = NULL; /* To hold domain & device */
> + /* values from old kernel */
> + struct device_domain_info *info = NULL; /* adr(new for this device) */
> + struct device_domain_info *i = NULL; /* iterator for foreach */
> +
> +
+ /* info->segment = segment; May need this later */
> + info->bus = bus;
> + info->devfn = devfn;
> + info->iommu = iommu;
> +
> + list_for_each_entry(i, &device_domain_values_list[iommu->seq_id],
> + global) {
> + if (i->domain->id == (int) context_get_did(ce)) {
> + domain = i->domain;
> + pr_debug("CTXT B:D:F:%2.2x:%2.2x:%1.1x Found did=0x%4.4x\n",
> + bus, devfn >> 3, devfn & 0x7, i->domain->id);
> + break;
> + }
> + }
> +
> + if (!domain) {
> + domain = alloc_domain();
> + if (!domain) {
> + ret = -ENOMEM;
> + goto exit;
> + }
> + INIT_LIST_HEAD(&domain->devices);
> + domain->id = (int) context_get_did(ce);
> + domain->agaw = (int) context_get_aw(ce);
> + domain->pgd = NULL;

Here the domain is created and initialized, but its member iovad is not
initialized. This will cause domain->iovad.iova_rbtree_lock deadlock
because its initial value is random. And the rbtree operation will
crash. This happen in reserve_iova invoked in copy_page_addr in high
frequency.

I add 1 line of code as below and it works well.

init_iova_domain(&domain->iovad, DMA_32BIT_PFN);
> +
> + pr_debug("CTXT Allocated new list entry, did:%d\n",
> + domain->id);
> + }
> +
> + info->domain = domain;
> + list_add(&info->link, &domain->devices);
> + list_add(&info->global, &device_domain_values_list[iommu->seq_id]);
> +
> + if (domain->pgd) {
> + asr = virt_to_phys(domain->pgd) >> VTD_PAGE_SHIFT;
> + context_put_asr(ce, asr);
> + ret = 4;
> + goto exit;
> + }
> +
> + t = context_get_t(ce);
> +
> + if (t == 0 || t == 1) { /* If (context has page tables) */
> + aw = context_get_aw(ce);
> + shift = aw_shift[aw];
> +
> + pgt_old_phys = (struct dma_pte *)(context_get_asr(ce) << 12);
> +
> + ret = copy_page_table(&pgt_new_phys, pgt_old_phys,
> + shift-9, page_addr, iommu, bus, devfn, domain, ppap);
> +
> + if (ret) /* if (problem) bail out */
> + goto exit;
> +
> + asr = ((unsigned long)(pgt_new_phys)) >> VTD_PAGE_SHIFT;
> + context_put_asr(ce, asr);
> + domain->pgd = phys_to_virt((unsigned long)pgt_new_phys);
> + ret = 1;
> + goto exit;
> + }
> +
> + return ret;
> +}
> +
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: mm: show message when updating min_free_kbytes in thp
http://groups.google.com/group/linux.kernel/t/6e7b9f828e3b2d2d?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Jan 9 2014 1:20 pm
From: David Rientjes

On Thu, 9 Jan 2014, Han Pingtian wrote:

> min_free_kbytes may be raised during THP's initialization. Sometimes,
> this will change the value being set by user. Showing message will
> clarify this confusion.
>
> Only show this message when changing the value set by user according to
> Michal Hocko's suggestion.
>
> Showing the old value of min_free_kbytes according to Dave Hansen's
> suggestion. This will give user the chance to restore old value of
> min_free_kbytes.
>
> Signed-off-by: Han Pingtian <hanpt@linux.vnet.ibm.com>
> ---
> mm/huge_memory.c | 9 ++++++++-
> mm/page_alloc.c | 2 +-
> 2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7de1bf8..e0e4e29 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -100,6 +100,7 @@ static struct khugepaged_scan khugepaged_scan = {
> .mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
> };
>
> +extern int user_min_free_kbytes;
>

We don't add extern declarations to .c files. How many other examples of
this can you find in mm/?

> static int set_recommended_min_free_kbytes(void)
> {
> @@ -130,8 +131,14 @@ static int set_recommended_min_free_kbytes(void)
> (unsigned long) nr_free_buffer_pages() / 20);
> recommended_min <<= (PAGE_SHIFT-10);
>
> - if (recommended_min > min_free_kbytes)
> + if (recommended_min > min_free_kbytes) {
> + if (user_min_free_kbytes >= 0)
> + pr_info("raising min_free_kbytes from %d to %lu "
> + "to help transparent hugepage allocations\n",
> + min_free_kbytes, recommended_min);
> +
> min_free_kbytes = recommended_min;
> + }
> setup_per_zone_wmarks();
> return 0;
> }

Does this even ever trigger since set_recommended_min_free_kbytes() is
called via late_initcall()?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: Terrible performance of sequential O_DIRECT 4k writes in SAN
environment. ~3 times slower then Solars 10 with the same HBA/Storage.
http://groups.google.com/group/linux.kernel/t/c2d4038a5c79058e?hl=en
==============================================================================

== 1 of 2 ==
Date: Thurs, Jan 9 2014 1:30 pm
From: Sergey Meirovich

Hi Duglas,

On 9 January 2014 21:54, Douglas Gilbert <dgilbert@interlog.com> wrote:
> On 14-01-08 08:57 AM, Sergey Meirovich wrote:
...
>>
>> The strangest thing to me that this is the problem with sequential
>> write. For example the fnic one machine is zoned to EMC XtremIO and
>> had results: 14.43Mb/sec 3693.65 Requests/sec for sequential 4k. The
>> same fnic machine perfrormed rather impressive for random 4k
>> 451.11Mb/sec 115485.02 Requests/sec
>
>
> You could bypass O_DIRECT and use ddpt together with
> a bsg pass-through (bsg is a little faster than sg
> for these purposes).
>
> For example:
>
> # lsscsi -g
> [0:0:0:0] disk ATA INTEL SSDSC2CW12 400i /dev/sda /dev/sg0
> [14:0:0:0] disk Linux scsi_debug 0004 - /dev/sg1
>
> # ddpt if=/dev/bsg/14:0:0:0 bs=512 bpt=128 count=1m
> Output file not specified so no copy, just reading input
> 1048576+0 records in
> 0+0 records out
> time to read data: 0.283566 secs at 1893.28 MB/sec
>
> bs= should match the block size of the storage device and
> the size of each SCSI READ is dictated by bpt= (so 64 KB
> in this case).
>
> Such a test should show you if your performance problem
> is in the block layer or below, or above the block layer
> (at least the point where pass-through commands are
> injected).
>
> Doug Gilbert

Thanks for an excellent idea!

Seems like this is not Direct IO issue.Just tried it against
fnic/XtremIO. 4k over via bsg is still 17.278 Mb/s for write.
.
[root@dca-poc-gtsxdb3 bsg]# /tmp/ddpt-0.91/src/ddpt if=/dev/zero
of=/dev/bsg/0:0:4:1 bs=512 bpt=8 count=1m
1048576+0 records in
1048576+0 records out
time to transfer data: 31.076487 secs at 17.28 MB/sec

[root@dca-poc-gtsxdb3 bsg]# /tmp/ddpt-0.91/src/ddpt if=/dev/zero
of=/dev/bsg/0:0:4:1 bs=512 bpt=16 count=512k
524288+0 records in
524288+0 records out
time to transfer data: 8.511421 secs at 31.54 MB/sec

[root@dca-poc-gtsxdb3 bsg]# /tmp/ddpt-0.91/src/ddpt if=/dev/zero
of=/dev/bsg/0:0:4:1 bs=512 bpt=32 count=256k
262144+0 records in
262144+0 records out
time to transfer data: 2.426037 secs at 55.32 MB/sec
[root@dca-poc-gtsxdb3 bsg]#
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 2 ==
Date: Thurs, Jan 9 2014 1:50 pm
From: Sergey Meirovich

Hi,

On 9 January 2014 23:26, Sergey Meirovich <rathamahata@gmail.com> wrote:
> Hi Duglas,
>
> On 9 January 2014 21:54, Douglas Gilbert <dgilbert@interlog.com> wrote:
>> On 14-01-08 08:57 AM, Sergey Meirovich wrote:
> ...
>>>
>>> The strangest thing to me that this is the problem with sequential
>>> write. For example the fnic one machine is zoned to EMC XtremIO and
>>> had results: 14.43Mb/sec 3693.65 Requests/sec for sequential 4k. The
>>> same fnic machine perfrormed rather impressive for random 4k
>>> 451.11Mb/sec 115485.02 Requests/sec
>>
>>
>> You could bypass O_DIRECT and use ddpt together with
>> a bsg pass-through (bsg is a little faster than sg
>> for these purposes).
>>
>> For example:
>>
>> # lsscsi -g
>> [0:0:0:0] disk ATA INTEL SSDSC2CW12 400i /dev/sda /dev/sg0
>> [14:0:0:0] disk Linux scsi_debug 0004 - /dev/sg1
>>
>> # ddpt if=/dev/bsg/14:0:0:0 bs=512 bpt=128 count=1m
>> Output file not specified so no copy, just reading input
>> 1048576+0 records in
>> 0+0 records out
>> time to read data: 0.283566 secs at 1893.28 MB/sec
>>
>> bs= should match the block size of the storage device and
>> the size of each SCSI READ is dictated by bpt= (so 64 KB
>> in this case).
>>
>> Such a test should show you if your performance problem
>> is in the block layer or below, or above the block layer
>> (at least the point where pass-through commands are
>> injected).
>>
>> Doug Gilbert
>
> Thanks for an excellent idea!
>
> Seems like this is not Direct IO issue.Just tried it against
> fnic/XtremIO. 4k over via bsg is still 17.278 Mb/s for write.

At a second glance seems to be natural ddpt - is suffering from the
same SAN latencies for the small chunks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: vfs: Fix possible NULL pointer dereference in inode_permission()
http://groups.google.com/group/linux.kernel/t/fe01f34c3b356d71?hl=en
==============================================================================

== 1 of 3 ==
Date: Thurs, Jan 9 2014 1:30 pm
From: Steven Rostedt

While running stress tests on adding and deleting ftrace instances I
hit this bug:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: [<ffffffff812d8bc5>] selinux_inode_permission+0x85/0x160
PGD 63681067 PUD 7ddbe067 PMD 0
Oops: 0000 [#1] PREEMPT
CPU: 0 PID: 5634 Comm: ftrace-test-mki Not tainted 3.13.0-rc4-test-00033-gd2a6dde-dirty #20
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
task: ffff880078375800 ti: ffff88007ddb0000 task.ti: ffff88007ddb0000
RIP: 0010:[<ffffffff812d8bc5>] [<ffffffff812d8bc5>] selinux_inode_permission+0x85/0x160
RSP: 0018:ffff88007ddb1c48 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000800000 RCX: ffff88006dd43840
RDX: 0000000000000001 RSI: 0000000000000081 RDI: ffff88006ee46000
RBP: ffff88007ddb1c88 R08: 0000000000000000 R09: ffff88007ddb1c54
R10: 6e6576652f6f6f66 R11: 0000000000000003 R12: 0000000000000000
R13: 0000000000000081 R14: ffff88006ee46000 R15: 0000000000000000
FS: 00007f217b5b6700(0000) GS:ffffffff81e21000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
CR2: 0000000000000020 CR3: 000000006a0fe000 CR4: 00000000000007f0
Stack:
0000000000000081 ffff88006ee46000 0000000000000081 ffffffff812d8b45
ffff88006ee46000 0000000000000081 ffff880078375800 ffff880078375800
ffff88007ddb1c98 ffffffff812d358c ffff88007ddb1cb8 ffffffff811364f1
Call Trace:^M
[<ffffffff812d8b45>] ? selinux_inode_permission+0x5/0x160
[<ffffffff812d358c>] security_inode_permission+0x1c/0x30
[<ffffffff811364f1>] __inode_permission+0x41/0xa0
[<ffffffff81136568>] inode_permission+0x18/0x50
[<ffffffff811378b6>] link_path_walk+0x66/0x920
[<ffffffff810875a5>] ? __rcu_read_lock+0x5/0x20
[<ffffffff8113a9e6>] path_openat+0xa6/0x6c0
[<ffffffff8113b000>] ? path_openat+0x6c0/0x6c0
[<ffffffff810c4e69>] ? __trace_graph_entry+0x49/0xc0
[<ffffffff8112b196>] ? do_sys_open+0x146/0x240
[<ffffffff8113b043>] do_filp_open+0x43/0xa0
[<ffffffff8113b005>] ? do_filp_open+0x5/0xa0
[<ffffffff8112b196>] do_sys_open+0x146/0x240
[<ffffffff8112b2ae>] SyS_open+0x1e/0x20
[<ffffffff81948cd0>] system_call_fastpath+0x16/0x1b
Code: 84 a1 00 00 00 81 e3 00 20 00 00 89 d8 83 c8 02 40 f6 c6 04 0f 45 d8 40 f6 c6 08 74 71 80 cf 02 49 8b 46 38 4c 8d 4d cc 45 31 c0 <0f> b7 50 20 8b 70 1c 48 8b 41 70 89 d9 8b 78 04 e8 36 cf ff ff
RIP [<ffffffff812d8bc5>] selinux_inode_permission+0x85/0x160
RSP <ffff88007ddb1c48>
CR2: 0000000000000020
---[ end trace 9d800e5ac5059462 ]---

Investigating, I found that the inode->i_security was NULL, and the
dereference of it caused the oops.

in selinux_inode_permission():
----
isec = inode->i_security;

rc = avc_has_perm_noaudit(sid, isec->sid, isec->sclass, perms, 0, &avd);
----

Note, the crash came from stressing the deletion and reading of debugfs
files. I was not able to recreate this via normal files. But I'm not
sure they are safe. It may just be that the race window is much harder
to hit.

What seems to have happened (and what I have traced), is the file is
being opened at the same time the file or directory is being deleted.
As the dentry and inode locks are not held during the path walk, nor is
the inodes ref counts being incremented, there is nothing saving these
structures from being discarded except for an rcu_read_lock().

The rcu_read_lock() protects against freeing of the inode, but it does
not protect freeing of the inode_security_struct. Talking with Eric
Paris about this, it seems to be a generic issue with the
destroy_inode() calling security_inode_free() when the inode may still
be in use (protected by rcu). It seems that the true destruction of the
inode (done by __destroy_inode()) should also be protect by rcu.

Cc: stable@vger.kernel.org
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---

This is based off of this thread:

https://lkml.org/lkml/2014/1/9/349

And perhaps is the true fix for:

http://oss.sgi.com/archives/xfs/2013-11/msg00709.html

diff --git a/fs/inode.c b/fs/inode.c
index 4bcdad3..a8f3b88 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -252,16 +252,17 @@ EXPORT_SYMBOL(__destroy_inode);
static void i_callback(struct rcu_head *head)
{
struct inode *inode = container_of(head, struct inode, i_rcu);
+ __destroy_inode(inode);
kmem_cache_free(inode_cachep, inode);
}

static void destroy_inode(struct inode *inode)
{
BUG_ON(!list_empty(&inode->i_lru));
- __destroy_inode(inode);
- if (inode->i_sb->s_op->destroy_inode)
+ if (inode->i_sb->s_op->destroy_inode) {
+ __destroy_inode(inode);
inode->i_sb->s_op->destroy_inode(inode);
- else
+ } else
call_rcu(&inode->i_rcu, i_callback);
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 3 ==
Date: Thurs, Jan 9 2014 1:50 pm
From: Matthew Wilcox

On Thu, Jan 09, 2014 at 04:27:31PM -0500, Steven Rostedt wrote:
> Note, the crash came from stressing the deletion and reading of debugfs
> files. I was not able to recreate this via normal files. But I'm not
> sure they are safe. It may just be that the race window is much harder
> to hit.

But "normal" files have a 'destroy_inode' method. So you've basically
only fixed it for debugfs (and maybe a few other unusual filesystems).
Why doesn't the code look like this:

static void i_callback(struct rcu_head *head)
{
struct inode *inode = container_of(head, struct inode, i_rcu);
__destroy_inode(inode);
if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
else
kmem_cache_free(inode_cachep, inode);
}

static void destroy_inode(struct inode *inode)
{
BUG_ON(!list_empty(&inode->i_lru));
call_rcu(&inode->i_rcu, i_callback);
}

We'd then have to get rid of all the call_rcu() invocations in individual
filesystems' destroy_inode methods, but that doesn't sound like a bad
thing to me.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 3 of 3 ==
Date: Thurs, Jan 9 2014 2:00 pm
From: Steven Rostedt

On Thu, 9 Jan 2014 14:42:39 -0700
Matthew Wilcox <matthew@wil.cx> wrote:

> On Thu, Jan 09, 2014 at 04:27:31PM -0500, Steven Rostedt wrote:
> > Note, the crash came from stressing the deletion and reading of debugfs
> > files. I was not able to recreate this via normal files. But I'm not
> > sure they are safe. It may just be that the race window is much harder
> > to hit.
>
> But "normal" files have a 'destroy_inode' method. So you've basically
> only fixed it for debugfs (and maybe a few other unusual filesystems).
> Why doesn't the code look like this:

Because I thought of that after I sent the email ;-)

Well, that's not really true. I don't know the semantics of the
destroy_inode() call. But I should have asked that in my change log.

>
> static void i_callback(struct rcu_head *head)
> {
> struct inode *inode = container_of(head, struct inode, i_rcu);
> __destroy_inode(inode);
> if (inode->i_sb->s_op->destroy_inode)
> inode->i_sb->s_op->destroy_inode(inode);
> else
> kmem_cache_free(inode_cachep, inode);
> }
>
> static void destroy_inode(struct inode *inode)
> {
> BUG_ON(!list_empty(&inode->i_lru));
> call_rcu(&inode->i_rcu, i_callback);
> }
>
> We'd then have to get rid of all the call_rcu() invocations in individual
> filesystems' destroy_inode methods, but that doesn't sound like a bad
> thing to me.
>

Which is another reason that I didn't do it, as I didn't know all the
happenings inside the ->destroy_inode() calls. But yeah, I agree with
this.

Also, can iput() sleep? If not then we are OK. Otherwise, we need to be
careful about any mutex being grabbed in those call backs, as the
rcu_callback can't sleep either.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: kdump failed because of hotplug memory adding in kdump kernel
http://groups.google.com/group/linux.kernel/t/1d614525bf91dfa1?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Jan 9 2014 1:30 pm
From: Vivek Goyal

On Thu, Jan 09, 2014 at 11:34:30AM -0700, Toshi Kani wrote:
> On Thu, 2014-01-09 at 13:23 -0500, Vivek Goyal wrote:
> > On Thu, Jan 09, 2014 at 10:24:25AM -0700, Toshi Kani wrote:
> >
> > [..]
> > > > I think creating a new command line option is simpler as compared to
> > > > creating a new flag in bootparam which in turn disables memory hotplug.
> > > > More users can use that option. For example, if for some reason hotplug
> > > > code is crashing, one can just disable it on command line as work around
> > > > and move on.
> > >
> > > I do not have a strong opinion about having such option. However, I
> > > think it is more user friendly to keep the exactmap option works alone
> > > on any platforms.
> >
> > I think we should create internally a variable which will disable memory
> > hotplug. And set that variable based on memmap=exactmap, mem=X and also
> > provide a way to disable memory hotplug directly using command line
> > option.
> >
> > Current kexec-tools can use memmap=exactmap and be happy. I am writing
> > a new kexec syscall and will not be using memmap=exactmap and would need
> > to use that command line option to disable memory hotplug behavior.
>
> Sounds good to me.

Nobody responded to my other question, so I would ask it again.

Assume we have disabled hotplug memory in second kernel. First kernel
saw hotplug memory and assume crash kernel reserved region came from
there. We will pass this memory in bootparams to second kernel and it
will show up in E820 map. It should still be accessible in second kernel,
is that right?

Or there is some dependency on ACPI doing some magic before this memory
range is available in second kernel?

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: netfilter: nf_conntrack: fix RCU race in nf_conntrack_find_get
http://groups.google.com/group/linux.kernel/t/1f5bd631cfffab0f?hl=en
==============================================================================

== 1 of 2 ==
Date: Thurs, Jan 9 2014 1:30 pm
From: Florian Westphal

Andrew Vagin <avagin@parallels.com> wrote:
> On Thu, Jan 09, 2014 at 09:56:22PM +0100, Florian Westphal wrote:
> > Andrew Vagin <avagin@parallels.com> wrote:
> > > Can we allocate conntrack with zero ct_general.use and increment it at
> > > the first time before inserting the conntrack into the hash table?
> > > When conntrack is allocated it is attached exclusively to one skb.
> > > It must be destroyed with skb, if it has not been confirmed, so we
> > > don't need refcnt on this stage.
> > >
> > > I found only one place, where a reference counter of unconfirmed
> > > conntract can incremented. It's ctnetlink_dump_table().
> >
> > What about skb_clone, etc? They will also increment the refcnt
> > if a conntrack entry is attached to the skb.
>
> We can not attach an unconfirmed conntrack to a few skb, because

s/few/new/?

> nf_nat_setup_info can be executed concurrently for the same conntrack.
>
> How do we avoid this race condition for cloned skb-s?

Simple, the assumption is that only one cpu owns the nfct, so it does
not matter if the skb is cloned in between, as there are no parallel
users.

The only possibility (that I know of) to violate this is to create
a bridge, enable call-iptables sysctl, add -j NFQUEUE rules and then
wait for packets that need to be forwarded to several recipients,
e.g. multicast traffic.

see
http://marc.info/?l=netfilter-devel&m=131471083501656&w=2

or search 'netfilter: nat: work around shared nfct struct in bridge
case'

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 2 ==
Date: Thurs, Jan 9 2014 1:50 pm
From: Andrey Wagin

2014/1/9 Eric Dumazet <eric.dumazet@gmail.com>:
> On Thu, 2014-01-09 at 09:24 +0400, Andrew Vagin wrote:
>
>> I have one more question. Looks like I found another problem.
>>
>> init_conntrack:
>> hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
>> &net->ct.unconfirmed);
>>
>> __nf_conntrack_hash_insert:
>> hlist_nulls_add_head_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
>> &net->ct.hash[hash]);
>>
>> We use one hlist_nulls_node to add a conntrack into two different lists.
>> As far as I understand, it's unacceptable in case of
>> SLAB_DESTROY_BY_RCU.
>
> I guess you missed :
>
> net/netfilter/nf_conntrack_core.c:1598: INIT_HLIST_NULLS_HEAD(&net->ct.unconfirmed, UNCONFIRMED_NULLS_VAL);

but we can look up something suitable and return this value, but it
will be unconfirmed. Ok, I see. This situation is fixed by this patch
too.

I don't understand the result of your discussion with Florian. Here
are a few states of conntracts: it can be used and it's initialized.
The sign of the first state is non-zero refcnt and the sign of the
second state is the confirmed bit.

In the first state conntrack is attached to skb and inserted in the
unconfirmed list. In this state the use count can be incremented in
ctnetlink_dump_list() and skb_clone().

In the second state conntrack may be attached to a few skb-s and
inserted to net->ct.hash.

I have read all emails again and I can't understand when this patch
doesn't work. Maybe you could give a sequence of actions? Thanks.

>
>
>>
>> Lets imagine that we have two threads. The first one enumerates objects
>> from a first list, the second one recreates an object and insert it in a
>> second list. If a first thread in this moment stays on the object, it
>> can read "next", when it's in the second list, so it will continue
>> to enumerate objects from the second list. It is not what we want, isn't
>> it?
>>
>> cpu1 cpu2
>> hlist_nulls_for_each_entry_rcu(ct)
>> destroy_conntrack
>> kmem_cache_free
>>
>> init_conntrack
>> hlist_nulls_add_head_rcu
>> ct = ct->next
>>
>
> This will be fine.
>
> I think we even have a counter to count number of occurrence of this
> rare event. (I personally never read a non null search_restart value)
>
> NF_CT_STAT_INC(net, search_restart);

Thank you for explanation.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: HID logitech DJ fixes and preps for enabling extended features
http://groups.google.com/group/linux.kernel/t/c2d6d10acec135d4?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Jan 9 2014 1:30 pm
From: Benjamin Tissoires

On 09/01/14 16:08, Andrew de los Reyes wrote:
> In general, I'm positive on the change to fix the USB3 issue (yay!),
> and for the others I'm happy it's going upstream. It seem to open up
> the possibility of user-space drivers, which is great, even though we
> don't need this on our team.
>
> One thing I want to double-check: on some devices (T651, at least),
> the raw data comes in not via HID++, but tacked onto the end of the
> normal mouse reports. Will a driver for this device be able to get all
> packets, not just HID++ ones? Sorry if this was clear and I missed it.

Yeah, that is partly why I can not send the rest of the series right
now. Nestor already warned me about those funny devices, so I need to
double check how to implement HID++.

On a technical point of view, a driver connected through the unifying
receiver currently only get the regular input reports, and this series
adds the HID++ reports to these ones. So yes, a device will receive all
reports dedicated to it.

Cheers,
Benjamin

>
> -andrew
>
> On Wed, Jan 8, 2014 at 2:18 PM, Benjamin Tissoires
> <benjamin.tissoires@redhat.com> wrote:
>> Hi Jiri,
>>
>> Well, this work is _not_ for 3.14 (except maybe patch 1), especially since it is
>> missing the biggest part where we enable the capabilities of Logitech devices.
>>
>> Long story short:
>> This work is based on the work I did back in Summer 2011. I worked at Logitech
>> for a few weeks to show up a demo of a driver for the Logitech Wireless Touchpad.
>> At that time, a first draft has been done, but due to a lack of resources, noone
>> upstreamed it.
>> Since then, the code marinated a little at Logitech and in the ChromeOS kernel
>> tree, but nobody tried to push it upstream. So here I am, trying to push this stuff
>> upstream.
>>
>> I can not send the full series right now because I am lacking most of the
>> testing hardware (I mean I only have the oldest Wireless Touchpad).
>> Hopefully, I should receive some other soon, and I'll be able to send the second
>> part then.
>>
>> Now, let me roughly explain the patches:
>> - patch 1 can be applied right now I think, but it's entirely up to you Jiri.
>> This patch should fix the missing notifications with some USB 3.0 boards.
>> - patches 2 to 5 allows to forward in both direction the proprietary protocol
>> used by Logitech (HID++ [1]) between the driver and the hardware.
>> - later patches will introduce a transport layer for HID++ and also a driver
>> to support full multitouch on various Logitech touchpads.
>>
>> Nestor, Andrew, feel free to add your "Signed-off-by" whereever you want, I lost
>> a little bit the track of who added what.
>>
>> Cheers,
>> Benjamin
>>
>> [1] HID++: Documentation is provided at
>> https://drive.google.com/a/logitech.com/?tab=mo#folders/0BxbRzx7vEV7eWmgwazJ3NUFfQ28
>>
>> Benjamin Tisssoires (5):
>> HID: logitech-dj: Fix USB 3.0 issue
>> HID: core: do not scan reports if the group is already set
>> HID: logitech-dj: rely on hid groups to separate receivers from dj
>> devices
>> HID: logitech-dj: forward incoming HID++ reports to the correct dj
>> device
>> HID: logitech-dj: add .request callback
>>
>> drivers/hid/hid-core.c | 3 +-
>> drivers/hid/hid-logitech-dj.c | 161 +++++++++++++++++++++++++++++++++---------
>> drivers/hid/hid-logitech-dj.h | 16 ++---
>> include/linux/hid.h | 1 +
>> 4 files changed, 136 insertions(+), 45 deletions(-)
>>
>> --
>> 1.8.4.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-input" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: pci/iov: VFs are never multifunction
http://groups.google.com/group/linux.kernel/t/c117099b558ec734?hl=en
==============================================================================

== 1 of 2 ==
Date: Thurs, Jan 9 2014 1:40 pm
From: Bjorn Helgaas

On Thu, Jan 9, 2014 at 11:25 AM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Thu, 2014-01-09 at 11:08 -0700, Bjorn Helgaas wrote:
>> On Thu, Jan 9, 2014 at 8:36 AM, Alex Williamson
>> <alex.williamson@redhat.com> wrote:
>> > Per the SR-IOV spec rev 1.1:
>> >
>> > 3.4.1.9 Header Type (Offset 0Eh)
>> >
>> > "... For VFs, this register must be RO Zero."
>> >
>> > Unfortunately some devices get this wrong, ex. Emulex OneConnect 10Gb
>> > NIC. When they do it makes us handle ACS testing and therefore IOMMU
>> > groups as if they were actual multifunction devices and require ACS
>> > capabilities to make sure there's no peer-to-peer between functions.
>> > VFs are never traditional multifunction devices, so simply clear this
>> > bit before we get any further into setup.
>>
>> This seems reasonable. Do you have "lspci -vvxxxx" output for this
>> device? I'd like to save it for future reference.
>
> Sure, here's a VF:
>
> 09:04.0 Ethernet controller: Emulex Corporation OneConnect 10Gb NIC (be3) (rev 01)
> Subsystem: Emulex Corporation Device e722

Thanks! I put this in
https://bugzilla.kernel.org/show_bug.cgi?id=68431, and I'll add a
reference to the changelog.

But I wonder if we can make this slightly more generic by doing
something like this in pci_setup_device():

dev->multifunction = (PCI_FUNC(dev->devfn) == 0) && (hdr_type & 0x80);

That's basically what lspci does in pci_generic_scan_bus(), and
section 3.2.2.3.4 of the PCI 3.0 spec sort of implies that we should
only look at the bit 7 of the header type for function 0:

If a single function device is detected (i.e., bit 7 in the Header
Type register of function 0 is 0), no more functions for that
Device Number will be checked. If a multi-function device is
detected (i.e., bit 7 in the Header Type register of function 0
is 1), then all remaining Function Numbers will be checked.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 2 ==
Date: Thurs, Jan 9 2014 2:00 pm
From: Alex Williamson

On Thu, 2014-01-09 at 14:39 -0700, Bjorn Helgaas wrote:
> On Thu, Jan 9, 2014 at 11:25 AM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > On Thu, 2014-01-09 at 11:08 -0700, Bjorn Helgaas wrote:
> >> On Thu, Jan 9, 2014 at 8:36 AM, Alex Williamson
> >> <alex.williamson@redhat.com> wrote:
> >> > Per the SR-IOV spec rev 1.1:
> >> >
> >> > 3.4.1.9 Header Type (Offset 0Eh)
> >> >
> >> > "... For VFs, this register must be RO Zero."
> >> >
> >> > Unfortunately some devices get this wrong, ex. Emulex OneConnect 10Gb
> >> > NIC. When they do it makes us handle ACS testing and therefore IOMMU
> >> > groups as if they were actual multifunction devices and require ACS
> >> > capabilities to make sure there's no peer-to-peer between functions.
> >> > VFs are never traditional multifunction devices, so simply clear this
> >> > bit before we get any further into setup.
> >>
> >> This seems reasonable. Do you have "lspci -vvxxxx" output for this
> >> device? I'd like to save it for future reference.
> >
> > Sure, here's a VF:
> >
> > 09:04.0 Ethernet controller: Emulex Corporation OneConnect 10Gb NIC (be3) (rev 01)
> > Subsystem: Emulex Corporation Device e722
>
> Thanks! I put this in
> https://bugzilla.kernel.org/show_bug.cgi?id=68431, and I'll add a
> reference to the changelog.
>
> But I wonder if we can make this slightly more generic by doing
> something like this in pci_setup_device():
>
> dev->multifunction = (PCI_FUNC(dev->devfn) == 0) && (hdr_type & 0x80);
>
> That's basically what lspci does in pci_generic_scan_bus(), and
> section 3.2.2.3.4 of the PCI 3.0 spec sort of implies that we should
> only look at the bit 7 of the header type for function 0:
>
> If a single function device is detected (i.e., bit 7 in the Header
> Type register of function 0 is 0), no more functions for that
> Device Number will be checked. If a multi-function device is
> detected (i.e., bit 7 in the Header Type register of function 0
> is 1), then all remaining Function Numbers will be checked.

We could do that and rely only on pci_scan_slot() to set multifunction=1
for the other functions, but that doesn't completely solve this problem.
VFs can occupy function zero and the example device would still set
multifunction with that test. Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: macvlan: forbid L2 fowarding offload for macvtap
http://groups.google.com/group/linux.kernel/t/e488fba6679952ea?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Jan 9 2014 1:40 pm
From: Stephen Hemminger

On Thu, 09 Jan 2014 16:55:07 +0800
Jason Wang <jasowang@redhat.com> wrote:

> What if use do want a qdisc and want to change the its queue length for
> tun/macvlan? And the the name tx_queue_length is misleading. For tun it
> may make sense since it was used in transmission path. For macvtap it
> was not. So maybe what we need is just a new ioctl for both tun/macvtap
> and a new feature flag. If user create the device with new feature flag,
> the socket receive queue length could be changed by ioctl instead of
> dev->tx_queue_length. If not, the old behaviour could be kept.

The overloading of tx_queue_len in macvtap was the original design mistake.
Can't this just be undone and expose rx_queue_len as sysfs attribute?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: mm: fix the theoretical compound_lock() vs prep_new_page() race
http://groups.google.com/group/linux.kernel/t/bff24967685e62c5?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Jan 9 2014 1:40 pm
From: Andrea Arcangeli

On Thu, Jan 09, 2014 at 08:43:50PM +0100, Oleg Nesterov wrote:
> get_page_unless_zero(), and recently it was documented that the kernel can
> rely on the control dependency to serialize LOAD + STORE.

Ok, that's cheap then.

>
> But we probably need barrier() in between, we can't use ACCESS_ONCE().

After get_page_unless_zero I don't think there's any need of
barrier(). barrier() should have been implicit in __atomic_add_unless
in fact it should be a full smp_mb() equivalent too. Memory is always
clobbered there and the asm is volatile.

My wondering was only about the runtime (not compiler) barrier after
running PageTail and before compound_lock, because bit_spin_lock has
only acquire semantics so in absence of the branch that bails out the
lock, the spinlock could run before PageTail. If the branch is good
enough guarantee for all archs it's good and cheap solution. Clearly
this is not an x86 concern, which is always fine without anything when
surrounded by locked ops like here.

> Yes. But in this case I really think we should cleanup get/put first
> and add the helper, like the patch I mentioned does.

Ok, up to you, I'll check it.

> Oh, I don't think this highly theoreitical fix should be backported
> but I agree, lets fix the bug first.

I think it should, but the risk free version of it, so either a few
liner addition before compound_lock or the previous with smp_wmb()
inside the ifdef (considering it only matters on x86 where smp_wmb is
zero cost and the only operational change is actually the reordering
of the set_page_refcounted not the smp_wmb irrelevant for x86).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: blk-mq: Compile fix for null_blk
http://groups.google.com/group/linux.kernel/t/4b82ed12395483b1?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Jan 9 2014 1:40 pm
From: Muthu Kumar

Thanks Matias. Yes, Ming Lei's 4th patch does make the function internal.

So, which branch has the laest patches... i am checking for-3.14/core...

Regards,
Muthu

On Thu, Jan 9, 2014 at 12:58 PM, Matias Bjorling <m@bjorling.me> wrote:
> On 01/09/2014 07:54 PM, Muthu Kumar wrote:
>>
>> Jens,
>>
>> Compiling null_blk.ko failed with error that blk_mq_free_queue() was
>> defined implicitly. So, moved the declaration from block/blk-mq.h to
>> include/linux/blk-mq.h and exported it.
>>
>
> The patch from Ming Lei is missing in -rc6
>
> 4af48694451676403188a62385dd1a2849fc05c5
> block: null_blk: fix queue leak inside removing device
>
> Its queued for -rc7. It removes the usage of blk_mq_free_queue in blk_null.
>
>
>> Signed-off-by: Muthukumar Ratty <muthur@gmail.com>
>>
>> ----------------------
>>
>> block/blk-mq.c | 1 +
>> block/blk-mq.h | 1 -
>> include/linux/blk-mq.h | 1 +
>> 3 files changed, 2 insertions(+), 1 deletions(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 57039fc..3e08b87 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -1422,6 +1422,7 @@ void blk_mq_free_queue(struct request_queue *q)
>> list_del_init(&q->all_q_node);
>> mutex_unlock(&all_q_mutex);
>> }
>> +EXPORT_SYMBOL(blk_mq_free_queue);
>>
>> /* Basically redo blk_mq_init_queue with queue frozen */
>> static void blk_mq_queue_reinit(struct request_queue *q)
>> diff --git a/block/blk-mq.h b/block/blk-mq.h
>> index 5c39179..35ff4f7 100644
>> --- a/block/blk-mq.h
>> +++ b/block/blk-mq.h
>> @@ -28,7 +28,6 @@ void blk_mq_run_request(struct request *rq, bool
>> run_queue, bool async);
>> void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
>> void blk_mq_init_flush(struct request_queue *q);
>> void blk_mq_drain_queue(struct request_queue *q);
>> -void blk_mq_free_queue(struct request_queue *q);
>>
>> /*
>> * CPU hotplug helpers
>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index 851d34b..51109b8 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -113,6 +113,7 @@ enum {
>> };
>>
>> struct request_queue *blk_mq_init_queue(struct blk_mq_reg *, void *);
>> +void blk_mq_free_queue(struct request_queue *q);
>> int blk_mq_register_disk(struct gendisk *);
>> void blk_mq_unregister_disk(struct gendisk *);
>> void blk_mq_init_commands(struct request_queue *, void (*init)(void
>> *data, struct blk_mq_hw_ctx *, struct request *, unsigned int), void
>> *data);
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: hyperv: Add support for physically discontinuous receive buffer
http://groups.google.com/group/linux.kernel/t/1484eaf6bfef9ffe?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Jan 9 2014 1:40 pm
From: Haiyang Zhang

This will allow us to use bigger receive buffer, and prevent allocation failure
due to fragmented memory.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
---
drivers/hv/channel.c | 14 ++++++++------
drivers/net/hyperv/hyperv_net.h | 2 +-
drivers/net/hyperv/netvsc.c | 7 ++-----
3 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/hv/channel.c b/drivers/hv/channel.c
index cea623c..69ea36f 100644
--- a/drivers/hv/channel.c
+++ b/drivers/hv/channel.c
@@ -209,7 +209,6 @@ static int create_gpadl_header(void *kbuffer, u32 size,
{
int i;
int pagecount;
- unsigned long long pfn;
struct vmbus_channel_gpadl_header *gpadl_header;
struct vmbus_channel_gpadl_body *gpadl_body;
struct vmbus_channel_msginfo *msgheader;
@@ -219,7 +218,6 @@ static int create_gpadl_header(void *kbuffer, u32 size,
int pfnsum, pfncount, pfnleft, pfncurr, pfnsize;

pagecount = size >> PAGE_SHIFT;
- pfn = virt_to_phys(kbuffer) >> PAGE_SHIFT;

/* do we need a gpadl body msg */
pfnsize = MAX_SIZE_CHANNEL_MESSAGE -
@@ -248,7 +246,8 @@ static int create_gpadl_header(void *kbuffer, u32 size,
gpadl_header->range[0].byte_offset = 0;
gpadl_header->range[0].byte_count = size;
for (i = 0; i < pfncount; i++)
- gpadl_header->range[0].pfn_array[i] = pfn+i;
+ gpadl_header->range[0].pfn_array[i] = slow_virt_to_phys(
+ kbuffer + PAGE_SIZE * i) >> PAGE_SHIFT;
*msginfo = msgheader;
*messagecount = 1;

@@ -301,7 +300,9 @@ static int create_gpadl_header(void *kbuffer, u32 size,
* so the hypervisor gurantees that this is ok.
*/
for (i = 0; i < pfncurr; i++)
- gpadl_body->pfn[i] = pfn + pfnsum + i;
+ gpadl_body->pfn[i] = slow_virt_to_phys(
+ kbuffer + PAGE_SIZE * (pfnsum + i)) >>
+ PAGE_SHIFT;

/* add to msg header */
list_add_tail(&msgbody->msglistentry,
@@ -327,7 +328,8 @@ static int create_gpadl_header(void *kbuffer, u32 size,
gpadl_header->range[0].byte_offset = 0;
gpadl_header->range[0].byte_count = size;
for (i = 0; i < pagecount; i++)
- gpadl_header->range[0].pfn_array[i] = pfn+i;
+ gpadl_header->range[0].pfn_array[i] = slow_virt_to_phys(
+ kbuffer + PAGE_SIZE * i) >> PAGE_SHIFT;

*msginfo = msgheader;
*messagecount = 1;
@@ -344,7 +346,7 @@ nomem:
* vmbus_establish_gpadl - Estabish a GPADL for the specified buffer
*
* @channel: a channel
- * @kbuffer: from kmalloc
+ * @kbuffer: from kmalloc or vmalloc
* @size: page-size multiple
* @gpadl_handle: some funky thing
*/
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index a26eecb..7b594ce 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -462,7 +462,7 @@ struct nvsp_message {

#define NETVSC_MTU 65536

-#define NETVSC_RECEIVE_BUFFER_SIZE (1024*1024*2) /* 2MB */
+#define NETVSC_RECEIVE_BUFFER_SIZE (1024*1024*16) /* 16MB */

#define NETVSC_RECEIVE_BUFFER_ID 0xcafe

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 93b485b..03a2c6e 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -136,8 +136,7 @@ static int netvsc_destroy_recv_buf(struct netvsc_device *net_device)

if (net_device->recv_buf) {
/* Free up the receive buffer */
- free_pages((unsigned long)net_device->recv_buf,
- get_order(net_device->recv_buf_size));
+ vfree(net_device->recv_buf);
net_device->recv_buf = NULL;
}

@@ -163,9 +162,7 @@ static int netvsc_init_recv_buf(struct hv_device *device)
return -ENODEV;
ndev = net_device->ndev;

- net_device->recv_buf =
- (void *)__get_free_pages(GFP_KERNEL|__GFP_ZERO,
- get_order(net_device->recv_buf_size));
+ net_device->recv_buf = vzalloc(net_device->recv_buf_size);
if (!net_device->recv_buf) {
netdev_err(ndev, "unable to allocate receive "
"buffer of size %d\n", net_device->recv_buf_size);
--
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: acpi memory hotplug, add parameter to disable memory hotplug for kexec
http://groups.google.com/group/linux.kernel/t/83ff201aba618199?hl=en
==============================================================================

== 1 of 2 ==
Date: Thurs, Jan 9 2014 1:40 pm
From: KOSAKI Motohiro

On Thu, Jan 9, 2014 at 10:00 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Thu, Jan 09, 2014 at 12:00:29AM +0100, Rafael J. Wysocki wrote:
>
> [..]
>> > The system then panics and the kdump/kexec kernel boots. During this boot
>> > ACPi is initialized and the kernel (as can be seen above)
>>
>> Which is a bug. You're not supposed to initialize ACPI twice in a row.
>
> [CC lkml, kexec mailing list, dave young]
>
> It is a fresh instance of kernel booting and it is initializing its data
> structures fresh. It is *not* re-initializing ACPI in same kernel.
>
>> > This patchset resolves the problem by adding a kernel parameter,
>> > no_memory_hotplug, to disable ACPI memory hotplug. It can be added by default
>> > as a parameter to the kexec/kdump kernel so the kernel boots correctly.
>>
>> This problem is specific to kexec/kdump, so please don't add *generic* command
>> line parameters to address this.
>>
>
> There are other command line options to solve kdump problems. In general
> one might want to disable memory hogplug on the fly even if it is compiled
> in the kernel. So it can act as a good debugging aid.
>
> Secondly, it can be specified with memmap=exactmap and mem=X paramters to
> make sure no memory is hot added in the system.
>
> So I can see other usages of this parameter. To me it makes sense to have
> a separate command line option to disable memory hotplug feature on the
> fly.

I'm ok this option. But note, even if this option is specified, SH,
Power and S390 still
be able to use memory hotplug because their firmware are totally
different from ACPI.

Maybe, adding acpi prefix provides good clarification.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 2 ==
Date: Thurs, Jan 9 2014 1:50 pm
From: Vivek Goyal

On Thu, Jan 09, 2014 at 04:34:16PM -0500, KOSAKI Motohiro wrote:
> On Thu, Jan 9, 2014 at 10:00 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Thu, Jan 09, 2014 at 12:00:29AM +0100, Rafael J. Wysocki wrote:
> >
> > [..]
> >> > The system then panics and the kdump/kexec kernel boots. During this boot
> >> > ACPi is initialized and the kernel (as can be seen above)
> >>
> >> Which is a bug. You're not supposed to initialize ACPI twice in a row.
> >
> > [CC lkml, kexec mailing list, dave young]
> >
> > It is a fresh instance of kernel booting and it is initializing its data
> > structures fresh. It is *not* re-initializing ACPI in same kernel.
> >
> >> > This patchset resolves the problem by adding a kernel parameter,
> >> > no_memory_hotplug, to disable ACPI memory hotplug. It can be added by default
> >> > as a parameter to the kexec/kdump kernel so the kernel boots correctly.
> >>
> >> This problem is specific to kexec/kdump, so please don't add *generic* command
> >> line parameters to address this.
> >>
> >
> > There are other command line options to solve kdump problems. In general
> > one might want to disable memory hogplug on the fly even if it is compiled
> > in the kernel. So it can act as a good debugging aid.
> >
> > Secondly, it can be specified with memmap=exactmap and mem=X paramters to
> > make sure no memory is hot added in the system.
> >
> > So I can see other usages of this parameter. To me it makes sense to have
> > a separate command line option to disable memory hotplug feature on the
> > fly.
>
> I'm ok this option. But note, even if this option is specified, SH,
> Power and S390 still
> be able to use memory hotplug because their firmware are totally
> different from ACPI.
>
> Maybe, adding acpi prefix provides good clarification.

Makes sense. Something like "acpi_no_memhotplug" or "acpi_disable_memhotplug"
or something else.

Thanks
Vivek

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: mm, memcg: avoid oom notification when current needs access to memory
reserves
http://groups.google.com/group/linux.kernel/t/ddffce70f2a0b5fc?hl=en
==============================================================================

== 1 of 2 ==
Date: Thurs, Jan 9 2014 1:40 pm
From: David Rientjes

On Tue, 7 Jan 2014, Andrew Morton wrote:

> I just spent a happy half hour reliving this thread and ended up
> deciding I agreed with everyone! I appears that many more emails are
> needed so I think I'll drop
> http://ozlabs.org/~akpm/mmots/broken-out/mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch
> for now.
>
> The claim that
> mm-memcg-avoid-oom-notification-when-current-needs-access-to-memory-reserves.patch
> will impact existing userspace seems a bit dubious to me.
>

I'm not sure why this was dropped since it's vitally needed for any sane
userspace oom handler to be effective.

Without the patch, a userspace oom handler waiting on memory.oom_control
will be triggered when any process with a pending SIGKILL or in the exit()
path simply needs access to memory reserves to make forward progress. The
kernel oom killer itself is preempted since nothing is actionable other
than giving current access to memory reserves by setting the TIF_MEMDIE
bit. Userspace does not have the privilege to set this bit itself, so in
such cases there is absolutely nothing actionable for the userspace oom
handler.

The problem is that the userspace oom handler doesn't know that.

It would be ludicrous to require that a userspace oom handler must wait
for some arbitrary amount of time to determine if it is actionable or not;
what is a sane amount of time to wait? Should we reliably expect that
multiple oom notifications will be sent over a period of time if we are in
a situation where current doesn't require memory reserves to make forward
progress? How long should the userspace oom handler store this state to
determine how many times it has woken up?

Userspace oom handling implementations are fragile enough as it is, they
should be made as trivial as possible to ensure they can do what is needed
to make memory available, have the smallest memory footprint possible, and
be as reliable as possible. Requiring them to determine when a
notification is actionable is troublesome.

Furthermore, Section 10 of Documentation/cgroups/memory.txt does not imply
that any of this checking needs to be done and lists possible actions that
a userspace oom handler can do upon being notified such as raising a limit
or killing a process itself. This is what userspace _expects_ to do when
notified.

Giving current access to memory reserves so that it may make forward
progress is something only the kernel can do and is a part of both the VM
and memcg implementations to allow forward progress to be made. It is not
something userspace is involved in.

Additionally, you're not losing any functionality by merging the patch, if
you really want to know simply when the limit has been reached and not
something actionable as stated by the memcg documentation, you can do so
with memory thresholds or VMPRESSURE_CRITICAL.

Google relies on this behavior so that userspace oom handlers can be
implemented to respond to oom conditions and not cause unnecessary oom
killing. We'd like to know why you refuse to provide such an interface in
a responsible and reliable way.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

== 2 of 2 ==
Date: Thurs, Jan 9 2014 1:50 pm
From: David Rientjes

On Thu, 9 Jan 2014, Michal Hocko wrote:

> Eric has reported that he can see task(s) stuck in memcg OOM handler
> regularly. The only way out is to
> echo 0 > $GROUP/memory.oom_controll
> His usecase is:
> - Setup a hierarchy with memory and the freezer
> (disable kernel oom and have a process watch for oom).
> - In that memory cgroup add a process with one thread per cpu.
> - In one thread slowly allocate once per second I think it is 16M of ram
> and mlock and dirty it (just to force the pages into ram and stay there).
> - When oom is achieved loop:
> * attempt to freeze all of the tasks.
> * if frozen send every task SIGKILL, unfreeze, remove the directory in
> cgroupfs.
>
> Eric has then pinpointed the issue to be memcg specific.
>
> All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
> Those that have received fatal signal will bypass the charge and should
> continue on their way out. The tricky part is that the exit path might
> trigger a page fault (e.g. exit_robust_list), thus the memcg charge,
> while its memcg is still under OOM because nobody has released any
> charges yet.
> Unlike with the in-kernel OOM handler the exiting task doesn't get
> TIF_MEMDIE set so it doesn't shortcut futher charges of the killed task
> and falls to the memcg OOM again without any way out of it as there are
> no fatal signals pending anymore.
>
> This patch fixes the issue by checking PF_EXITING early in
> __mem_cgroup_try_charge and bypass the charge same as if it had fatal
> signal pending or TIF_MEMDIE set.
>
> Normally exiting tasks (aka not killed) will bypass the charge now but
> this should be OK as the task is leaving and will release memory and
> increasing the memory pressure just to release it in a moment seems
> dubious wasting of cycles. Besides that charges after exit_signals
> should be rare.
>
> Reported-by: Eric W. Biederman <ebiederm@xmission.com>
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Is this tested?

> ---
> mm/memcontrol.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b8dfed1b9d87..b86fbb04b7c6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2685,7 +2685,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> * MEMDIE process.
> */
> if (unlikely(test_thread_flag(TIF_MEMDIE)
> - || fatal_signal_pending(current)))
> + || fatal_signal_pending(current))
> + || current->flags & PF_EXITING)
> goto bypass;
>
> if (unlikely(task_in_memcg_oom(current)))

This would become problematic if significant amount of memory is charged
in the exit() path. I don't know of an egregious amount of memory being
allocated and charged after PF_EXITING is set, but if it happens in the
future then this could potentially cause system oom conditions even in
memcg configurations that are designed such as the one Tejun suggested to
be able to handle such conditions in userspace:

___root___
/ \
user oom
/ \ / \
A B C D

where the limit of user is equal to the amount of system memory minus
whatever amount of memory is needed by the system oom handler attached as
a descendant of oom and still allows the limits of A + B to exceed the
limit of user.

So how do we ensure that memory allocations in the exit() path don't cause
system oom conditions whereas the above configuration no longer provides
any strict guarantee?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: at91: dt: smc: Added smc bus driver
http://groups.google.com/group/linux.kernel/t/88f86d5983c318d0?hl=en
==============================================================================

== 1 of 1 ==
Date: Thurs, Jan 9 2014 1:50 pm
From: Jean-Jacques Hiblot

Hi Boris,

2014/1/9 boris brezillon <b.brezillon@overkiz.com>:
> Hello JJ,
>
>
> On 09/01/2014 13:31, Jean-Jacques Hiblot wrote:
>>
>> The EBI/SMC external interface is used to access external peripherals
>> (NAND
>> and Ethernet controller in the case of sam9261ek). Different
>> configurations and
>> timings are required for those peripherals. This bus driver can be used to
>> setup the bus timings/configuration from the device tree.
>> It currently accepts timings in clock period and in nanoseconds.
>>
>> Signed-off-by: Jean-Jacques Hiblot <jjhiblot@traphandler.com>
>> ---
>> drivers/memory/Kconfig | 10 ++
>> drivers/memory/Makefile | 1 +
>> drivers/memory/atmel-smc.c | 431
>> +++++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 442 insertions(+)
>> create mode 100644 drivers/memory/atmel-smc.c
>>
>> diff --git a/drivers/memory/Kconfig b/drivers/memory/Kconfig
>> index 29a11db..fbdfd63 100644
>> --- a/drivers/memory/Kconfig
>> +++ b/drivers/memory/Kconfig
>> @@ -50,4 +50,14 @@ config TEGRA30_MC
>> analysis, especially for IOMMU/SMMU(System Memory Management
>> Unit) module.
>> +config ATMEL_SMC
>> + bool "Atmel SMC/EBI driver"
>> + default y
>> + depends on SOC_AT91SAM9 && OF
>> + help
>> + Driver for Atmel SMC/EBI controller.
>> + Used to configure the EBI (external bus interface) when the
>> device-
>> + tree is used. This bus supports NANDs, external ethernet
>> controller,
>> + SRAMs, ATA devices, etc.
>> +
>> endif
>> diff --git a/drivers/memory/Makefile b/drivers/memory/Makefile
>> index 969d923..101abc4 100644
>> --- a/drivers/memory/Makefile
>> +++ b/drivers/memory/Makefile
>> @@ -9,3 +9,4 @@ obj-$(CONFIG_TI_EMIF) += emif.o
>> obj-$(CONFIG_MVEBU_DEVBUS) += mvebu-devbus.o
>> obj-$(CONFIG_TEGRA20_MC) += tegra20-mc.o
>> obj-$(CONFIG_TEGRA30_MC) += tegra30-mc.o
>> +obj-$(CONFIG_ATMEL_SMC) += atmel-smc.o
>> diff --git a/drivers/memory/atmel-smc.c b/drivers/memory/atmel-smc.c
>> new file mode 100644
>> index 0000000..0a1d9ba
>> --- /dev/null
>> +++ b/drivers/memory/atmel-smc.c
>> @@ -0,0 +1,431 @@
>> +/*
>> + * EBI driver for Atmel SAM9 chips
>> + * inspired by the fsl weim bus driver
>> + *
>> + * Copyright (C) 2013 JJ Hiblot.
>> + *
>> + * This file is licensed under the terms of the GNU General Public
>> + * License version 2. This program is licensed "as is" without any
>> + * warranty of any kind, whether express or implied.
>> + */
>> +#include <linux/module.h>
>> +#include <linux/clk.h>
>> +#include <linux/io.h>
>> +#include <linux/of_device.h>
>> +#include <mach/at91sam9_smc.h>
>
>
> You should avoid machine specific headers inclusions: we're trying to get
> rid of them.
>
> Duplicate the code and macros you need in your driver instead.

Is this the right way? We usually try to avoid duplication.

>
>
>> +
>> +struct smc_data {
>> + struct clk *bus_clk;
>> + void __iomem *base;
>> + struct device *dev;
>> +};
>> +
>> +struct at91_smc_devtype {
>> + unsigned int cs_count;
>> +};
>> +
>> +static const struct at91_smc_devtype sam9261_smc_devtype = {
>> + .cs_count = 6,
>> +};
>> +
>> +static const struct of_device_id smc_id_table[] = {
>> + { .compatible = "atmel,at91sam9261-smc", .data =
>> &sam9261_smc_devtype},
>> + { }
>> +};
>> +MODULE_DEVICE_TABLE(of, smc_id_table);
>> +
>> +struct smc_parameters_type {
>> + const char *name;
>> + u16 width;
>> + u16 shift;
>> +};
>> +
>> +static const struct smc_parameters_type smc_parameters[] = {
>> + {"smc,burst_size", 2, 28},
>> + {"smc,burst_enabled", 1, 24},
>> + {"smc,tdf_mode", 1, 20},
>> + {"smc,bus_width", 2, 12},
>> + {"smc,byte_access_type", 1, 8},
>> + {"smc,nwait_mode", 2, 4},
>> + {"smc,write_mode", 1, 0},
>> + {"smc,read_mode", 1, 1},
>> + {NULL}
>> +};
>> +
>> +static int get_mode_register_from_dt(struct smc_data *smc,
>> + struct device_node *np,
>> + struct sam9_smc_config *cfg)
>> +{
>> + int ret;
>> + u32 val;
>> + struct device *dev = smc->dev;
>> + const struct smc_parameters_type *p = smc_parameters;
>> + u32 mode = cfg->mode;
>> +
>> + while (p->name) {
>> + ret = of_property_read_u32(np, p->name , &val);
>> + if (ret == -EINVAL) {
>> + dev_dbg(dev, "%s: property %s not set.\n",
>> np->name,
>> + p->name);
>> + p++;
>> + continue;
>> + } else if (ret) {
>> + dev_err(dev, "%s: can't get property %s.\n",
>> np->name,
>> + p->name);
>> + return ret;
>> + }
>> + if (val >= (1<<p->width)) {
>> + dev_err(dev, "%s: property %s out of range.\n",
>> + np->name, p->name);
>> + return -ERANGE;
>> + }
>> + mode &= ~(((1<<p->width)-1) << p->shift);
>> + mode |= (val << p->shift);
>> + p++;
>> + }
>> + cfg->mode = mode;
>> + return 0;
>> +}
>> +
>> +static int generic_timing_from_dt(struct smc_data *smc, struct
>> device_node *np,
>> + struct sam9_smc_config *cfg)
>> +{
>> + u32 val;
>> +
>> + if (!of_property_read_u32(np, "smc,ncs_read_setup" , &val))
>> + cfg->ncs_read_setup = val;
>> +
>> + if (!of_property_read_u32(np, "smc,nrd_setup" , &val))
>> + cfg->nrd_setup = val;
>> +
>> + if (!of_property_read_u32(np, "smc,ncs_write_setup" , &val))
>> + cfg->ncs_write_setup = val;
>> +
>> + if (!of_property_read_u32(np, "smc,nwe_setup" , &val))
>> + cfg->nwe_setup = val;
>> +
>> + if (!of_property_read_u32(np, "smc,ncs_read_pulse" , &val))
>> + cfg->ncs_read_pulse = val;
>> +
>> + if (!of_property_read_u32(np, "smc,nrd_pulse" , &val))
>> + cfg->nrd_pulse = val;
>> +
>> + if (!of_property_read_u32(np, "smc,ncs_write_pulse" , &val))
>> + cfg->ncs_write_pulse = val;
>> +
>> + if (!of_property_read_u32(np, "smc,nwe_pulse" , &val))
>> + cfg->nwe_pulse = val;
>> +
>> + if (!of_property_read_u32(np, "smc,read_cycle" , &val))
>> + cfg->read_cycle = val;
>> +
>> + if (!of_property_read_u32(np, "smc,write_cycle" , &val))
>> + cfg->write_cycle = val;
>> +
>> + if (!of_property_read_u32(np, "smc,tdf_cycles" , &val))
>> + cfg->tdf_cycles = val;
>> +
>> + return get_mode_register_from_dt(smc, np, cfg);
>> +}
>> +
>> +/* convert the time in ns in a number of clock cycles */
>> +static u32 ns_to_cycles(u32 ns, u32 clk)
>> +{
>> + /*
>> + * convert the clk to kHz for the rest of the calculation to avoid
>> + * overflow
>> + */
>> + u32 clk_kHz = clk / 1000;
>> + u32 ncycles = ((ns * clk_kHz) + 1000000 - 1) / 1000000;
>
> What about using an u64 type and do_div ?

easier and faster (though it's not the point here) this way, and kHz
ist not so imprecise :-)

>
>> + return ncycles;
>> +}
>> +
>> +static u32 cycles_to_coded_cycle(u32 cycles, int a, int b)
>> +{
>> + u32 mask_high = (1 << a) - 1;
>> + u32 mask_low = (1 << b) - 1;
>> + u32 coded;
>> +
>> + /* check if the value can be described with the coded format */
>> + if (cycles & (mask_high & ~mask_low)) {
>> + /* not representable. we need to round up */
>> + cycles |= mask_high;
>> + cycles += 1;
>> + }
>> + /* Now the value can be represented in the coded format */
>> + coded = (cycles & ~mask_high) >> (a - b);
>> + coded |= (cycles & mask_low);
>> + return coded;
>> +}
>> +
>> +static u32 ns_to_rw_cycles(u32 ns, u32 clk)
>> +{
>> + return cycles_to_coded_cycle(ns_to_cycles(ns, clk), 8, 7);
>> +}
>> +
>> +static u32 ns_to_pulse_cycles(u32 ns, u32 clk)
>> +{
>> + return cycles_to_coded_cycle(ns_to_cycles(ns, clk), 8, 6);
>> +}
>> +
>> +static u32 ns_to_setup_cycles(u32 ns, u32 clk)
>> +{
>> + return cycles_to_coded_cycle(ns_to_cycles(ns, clk), 7, 5);
>> +}
>> +
>> +static u32 cycles_to_ns(u32 cycles, u32 clk)
>> +{
>> + /*
>> + * convert the clk to kHz for the rest of the calculation to avoid
>> + * overflow
>> + */
>> + u32 clk_kHz = clk / 1000;
>
>
> Ditto (u64 + do_div).
>
>> + return (cycles * 1000000) / clk_kHz;
>> +}
>> +
>> +static u32 coded_cycle_to_cycles(u32 coded, int a, int b)
>> +{
>> + u32 cycles = (coded >> b) << a;
>> + u32 mask_low = (1 << b) - 1;
>> + cycles |= (coded & mask_low);
>> + return cycles;
>> +}
>> +
>> +static u32 rw_cycles_to_ns(u32 reg, u32 clk)
>> +{
>> + return cycles_to_ns(coded_cycle_to_cycles(reg, 8, 7), clk);
>> +}
>> +
>> +static u32 pulse_cycles_to_ns(u32 reg, u32 clk)
>> +{
>> + return cycles_to_ns(coded_cycle_to_cycles(reg, 8, 6), clk);
>> +}
>> +
>> +static u32 setup_cycles_to_ns(u32 reg, u32 clk)
>> +{
>> + return cycles_to_ns(coded_cycle_to_cycles(reg, 7, 5), clk);
>> +}
>> +
>> +static void dump_timing(struct smc_data *smc, struct sam9_smc_config
>> *config)
>> +{
>> + u32 freq = clk_get_rate(smc->bus_clk);
>> + struct device *dev = smc->dev;
>> +
>> +#define DUMP(fn, y) dev_info(dev, "%s : 0x%x (%d ns)\n", #y,
>> config->y,\
>> + fn(config->y, freq))
>> +#define DUMP_PULSE(y) DUMP(pulse_cycles_to_ns, y)
>> +#define DUMP_RWCYCLE(y) DUMP(rw_cycles_to_ns, y)
>> +#define DUMP_SETUP(y) DUMP(setup_cycles_to_ns, y)
>> +#define DUMP_SIMPLE(y) DUMP(cycles_to_ns, y)
>> +
>> + DUMP_SETUP(nwe_setup);
>> + DUMP_SETUP(ncs_write_setup);
>> + DUMP_SETUP(nrd_setup);
>> + DUMP_SETUP(ncs_read_setup);
>> + DUMP_PULSE(nwe_pulse);
>> + DUMP_PULSE(ncs_write_pulse);
>> + DUMP_PULSE(nrd_pulse);
>> + DUMP_PULSE(ncs_read_pulse);
>> + DUMP_RWCYCLE(write_cycle);
>> + DUMP_RWCYCLE(read_cycle);
>> + DUMP_SIMPLE(tdf_cycles);
>> +}
>> +
>> +static int ns_timing_from_dt(struct smc_data *smc, struct device_node
>> *np,
>> + struct sam9_smc_config *cfg)
>> +{
>> + u32 t_ns;
>> + u32 freq = clk_get_rate(smc->bus_clk);
>> +
>> + if (!of_property_read_u32(np, "smc,ncs_read_setup" , &t_ns))
>> + cfg->ncs_read_setup = ns_to_setup_cycles(t_ns, freq);
>> +
>> + if (!of_property_read_u32(np, "smc,nrd_setup" , &t_ns))
>> + cfg->nrd_setup = ns_to_setup_cycles(t_ns, freq);
>> +
>> + if (!of_property_read_u32(np, "smc,ncs_write_setup" , &t_ns))
>> + cfg->ncs_write_setup = ns_to_setup_cycles(t_ns, freq);
>> +
>> + if (!of_property_read_u32(np, "smc,nwe_setup" , &t_ns))
>> + cfg->nwe_setup = ns_to_setup_cycles(t_ns, freq);
>> +
>> + if (!of_property_read_u32(np, "smc,ncs_read_pulse" , &t_ns))
>> + cfg->ncs_read_pulse = ns_to_pulse_cycles(t_ns, freq);
>> +
>> + if (!of_property_read_u32(np, "smc,nrd_pulse" , &t_ns))
>> + cfg->nrd_pulse = ns_to_pulse_cycles(t_ns, freq);
>> +
>> + if (!of_property_read_u32(np, "smc,ncs_write_pulse" , &t_ns))
>> + cfg->ncs_write_pulse = ns_to_pulse_cycles(t_ns, freq);
>> +
>> + if (!of_property_read_u32(np, "smc,nwe_pulse" , &t_ns))
>> + cfg->nwe_pulse = ns_to_pulse_cycles(t_ns, freq);
>> +
>> + if (!of_property_read_u32(np, "smc,read_cycle" , &t_ns))
>> + cfg->read_cycle = ns_to_rw_cycles(t_ns, freq);
>> +
>> + if (!of_property_read_u32(np, "smc,write_cycle" , &t_ns))
>> + cfg->write_cycle = ns_to_rw_cycles(t_ns, freq);
>> +
>> + if (!of_property_read_u32(np, "smc,tdf_cycles" , &t_ns))
>> + cfg->tdf_cycles = ns_to_cycles(t_ns, freq);
>> +
>> + return get_mode_register_from_dt(smc, np, cfg);
>> +}
>> +
>> +struct converter {
>> + const char *name;
>> + int (*fn) (struct smc_data *smc, struct device_node *np,
>> + struct sam9_smc_config *cfg);
>> +};
>> +static const struct converter converters[] = {
>> + {"raw", generic_timing_from_dt},
>> + {"nanosec", ns_timing_from_dt},
>> +};
>
>
> Could you use more specific names like:
> "atmel,smc-converter-generic"
> "atmel,smc-converter-nand"
> ...
Isn't it a bit redudant? smc,converter = "atmel,smc-converter-generic";

>
> IMHO the timing unit should be specified in the property names:
> smc,ncs_read_setup-ns
> smc,ncs_read_setup-cycles
>
True. Although cycles is misleading. It's more a raw register value.
For pulse, setup and rw cycle, the register value is not identical to
the number of cycles.
>
>
>
>> +
>> +/* Parse and set the timing for this device. */
>> +static int smc_timing_setup(struct smc_data *smc, struct device_node *np,
>> + const struct at91_smc_devtype *devtype)
>> +{
>> + int ret;
>> + u32 cs;
>> + int i;
>> + struct device *dev = smc->dev;
>> + const struct converter *converter;
>> + const char *converter_name = NULL;
>> + struct sam9_smc_config cfg;
>> +
>> + ret = of_property_read_u32(np, "smc,cs" , &cs);
>
>
> Shouldn't this be stored in the reg property ?
> After all, in your DM9000 patch you use "@cs,offset" to identify the node...
True

>
>
>> + if (ret < 0) {
>> + dev_err(dev, "missing mandatory property : smc,cs\n");
>> + return ret;
>> + }
>> + if (cs >= devtype->cs_count) {
>> + dev_err(dev, "invalid value for property smc,cs (=%d)."
>> + "Must be in range 0 to %d\n", cs, devtype->cs_count-1);
>> + return -EINVAL;
>> + }
>> +
>> + of_property_read_string(np, "smc,converter", &converter_name);
>
>
> What about using the "compatible" property + struct of_device_id instead of
> "smc,converter" property + struct converter ?

Because one instance of the driver handles several chip selects and
each may use a different converter.

>
>
>> + if (converter_name) {
>> + for (i = 0; i < ARRAY_SIZE(converters); i++)
>> + if (strcmp(converters[i].name, converter_name) ==
>> 0)
>> + converter = &converters[i];
>> + if (!converter) {
>> + dev_info(dev, "unknown converter. aborting\n");
>> + return -EINVAL;
>> + }
>> + } else {
>> + dev_dbg(dev, "cs %d: no smc converter provided. using "
>> + "raw register values\n", cs);
>> + converter = &converters[0];
>> + }
>> + dev_dbg(dev, "cs %d using converter : %s\n", cs, converter->name);
>> + sam9_smc_cs_read(smc->base + (0x10 * cs), &cfg);
>> + converter->fn(smc, np, &cfg);
>> + ret = sam9_smc_check_cs_configuration(&cfg);
>> + if (ret < 0) {
>> + dev_info(dev, "invalid smc configuration for cs %d."
>> + "clipping values\n", cs);
>> + sam9_smc_clip_cs_configuration(&cfg);
>> + dump_timing(smc, &cfg);
>> + }
>> +#ifdef DEBUG
>> + else
>> + dump_timing(smc, &cfg);
>> +

twitter

Thursday, January 9, 2014

linux.kernel - 26 new messages in 19 topics - digest

0 Comments:

Post a Comment

About Me

Previous Posts