Friday, January 17, 2014

linux.kernel - 26 new messages in 17 topics - digest

linux.kernel
http://groups.google.com/group/linux.kernel?hl=en

linux.kernel@googlegroups.com

Today's topics:

* [PATCH] Remove bus dependency for iommu_domain_alloc. - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/ace715b4db0f0ed7?hl=en
* net: rfkill: gpio: Add device tree support - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/502551497c79a009?hl=en
* Change khugepaged to respect MMF_THP_DISABLE flag - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/eff632e7496a6e85?hl=en
* cgroup: make CONFIG_NET_CLS_CGROUP and CONFIG_NETPRIO_CGROUP bool instead of
tristate - 2 messages, 1 author
http://groups.google.com/group/linux.kernel/t/3a3c8584cdd64136?hl=en
* vfio/iommu_type1: Multi-IOMMU domain support - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/ec14a94720cba15e?hl=en
* KVM: SVM: fix NMI window after iret - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/2825e0c4359b20ed?hl=en
* f2fs: clean checkpatch warnings - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/b78c3ed4b7d93688?hl=en
* Phase out pci_enable_msi_block() - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/7ded004c0804c466?hl=en
* linux-next: Tree for Jan 14 (lowpan, 802.15.4) - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/a5d16a1f0a317f20?hl=en
* dcache: fix d_splice_alias handling of aliases - 2 messages, 1 author
http://groups.google.com/group/linux.kernel/t/599bb447f08d6704?hl=en
* Why is (2 < 2) true? Is it a gcc bug? - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/587453bf46273e70?hl=en
* x86, mm, perf: Allow recursive faults from interrupts - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/3e53decd100762a7?hl=en
* Why does kexec use device_shutdown rather than ubind them - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/45a18d6ad21d78a6?hl=en
* numa,sched: tracepoints for NUMA balancing active nodemask changes - 8
messages, 1 author
http://groups.google.com/group/linux.kernel/t/34ca8de3d5507b71?hl=en
* tracing/README: Add event file usage to tracing mini-HOWTO - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/7df94aa1598c0f55?hl=en
* use ether_addr_equal_64bits - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/c126115ff68bc79d?hl=en
* 5769cf63: LTP semget02 TFAILs - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/2bc8ef4ef0e9a3a3?hl=en

==============================================================================
TOPIC: [PATCH] Remove bus dependency for iommu_domain_alloc.
http://groups.google.com/group/linux.kernel/t/ace715b4db0f0ed7?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Jan 17 2014 12:30 pm
From: Varun Sethi


DQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogQWxleCBXaWxsaWFtc29u
IFttYWlsdG86YWxleC53aWxsaWFtc29uQHJlZGhhdC5jb21dDQo+IFNlbnQ6IFNhdHVyZGF5LCBK
YW51YXJ5IDE4LCAyMDE0IDE6MzkgQU0NCj4gVG86IFNldGhpIFZhcnVuLUIxNjM5NQ0KPiBDYzog
am9yb0A4Ynl0ZXMub3JnOyBpb21tdUBsaXN0cy5saW51eC1mb3VuZGF0aW9uLm9yZzsgbGludXgt
DQo+IGtlcm5lbEB2Z2VyLmtlcm5lbC5vcmcNCj4gU3ViamVjdDogUmU6IFtSRkNdW1BBVENIXSBS
ZW1vdmUgYnVzIGRlcGVuZGVuY3kgZm9yIGlvbW11X2RvbWFpbl9hbGxvYy4NCj4gDQo+IE9uIFNh
dCwgMjAxNC0wMS0xOCBhdCAwMTowMCArMDUzMCwgVmFydW4gU2V0aGkgd3JvdGU6DQo+ID4gVGhp
cyBwYXRjaCBhdHRlbXB0cyB0byByZW1vdmUgaW9tbXVfZG9tYWluX2FsbG9jIGZ1bmN0aW9uJ3Mg
ZGVwZW5kZW5jeQ0KPiBvbiB0aGUgYnVzIHR5cGUuDQo+ID4gVGhpcyBkZXBlbmRlbmN5IGlzIHF1
aWV0IHJlc3RyaWN0aXZlIGluIGNhc2Ugb2YgdmZpbywgd2hlcmUgaXQncw0KPiA+IHBvc3NpYmxl
IHRvIGJpbmQgbXVsdGlwbGUgaW9tbXUgZ3JvdXBzIChmcm9tIGRpZmZlcmVudCBidXMgdHlwZXMp
IHRvDQo+IHRoZSBzYW1lIGlvbW11IGRvbWFpbi4NCj4gPg0KPiA+IFRoaXMgcGF0Y2ggaXMgYmFz
ZWQgb24gdGhlIGFzc3VtcHRpb24sIHRoYXQgdGhlcmUgaXMgYSBzaW5nbGUgaW9tbXUNCj4gPiBm
b3IgYWxsIGJ1cyB0eXBlcyBvbiB0aGUgc3lzdGVtLg0KPiA+DQo+ID4gV2UgbWFpbnRhaW4gYSBs
aXN0IG9mIGJ1cyB0eXBlcyAoZm9yIHdoaWNoIGlvbW11IG9wcyBhcmUgcmVnaXN0ZXJlZCkuDQo+
ID4gSW4gdGhlIGlvbW11X2RvbWFpbl9hbGxvYyBmdW5jdGlvbiB3ZSBlbnN1cmUgdGhhdCBhbGwg
YnVzIHR5cGVzDQo+IGNvcnJlc3BvbmQgdG8gdGhlIHNhbWUgc2V0IG9mIGlvbW11IG9wZXJhdGlv
bnMuDQo+IA0KPiBTZWVtcyBsaWtlIHRoaXMganVzdCBraWNrcyB0aGUgcHJvYmxlbSBkb3duIHRo
ZSByb2FkIGEgbGl0dGxlIHdheXMgYXMgSQ0KPiBleHBlY3QgdGhlIGFzc3VtcHRpb24gaXNuJ3Qg
Z29pbmcgdG8gbGFzdCBsb25nLiAgSSB0aGluayB0aGVyZSdzIGFub3RoZXINCj4gd2F5IHRvIGRv
IHRoaXMgYW5kIHdlIGNhbiBkbyBpdCBlbnRpcmVseSBmcm9tIHdpdGhpbiB2ZmlvX2lvbW11X3R5
cGUxLg0KPiBXZSBoYXZlIGEgcHJvYmxlbSBvbiB4ODYgdGhhdCB0aGUgSU9NTVUgZHJpdmVyIGNh
biBiZSBiYWNrZWQgYnkgbXVsdGlwbGUNCj4gSU9NTVUgaGFyZHdhcmUgZGV2aWNlcy4gIFRoZXNl
IHNlcGFyYXRlIGRldmljZXMgYXJlIGFyY2hpdGVjdHVyYWxseQ0KPiBhbGxvd2VkIHRvIGhhdmUg
ZGlmZmVyZW50IHByb3BlcnRpZXMuICBUaGUgcHJvcGVydHkgY2F1c2luZyB1cyB0cm91YmxlIGlz
DQo+IGNhY2hlIGNvaGVyZW5jeS4gIFNvbWUgaGFyZHdhcmUgZGV2aWNlcyBhbGxvdyB1cyB0byB1
c2UgSU9NTVVfQ0FDSEUgYXMgYQ0KPiBtYXBwaW5nIGF0dHJpYnV0ZSwgb3RoZXJzIGRvIG5vdC4g
IFRoZXJlZm9yZSB3ZSBjYW5ub3QgdXNlIGEgc2luZ2xlIElPTU1VDQo+IGRvbWFpbiB0byBvcHRp
bWFsbHkgaGFuZGxlIGFsbCBkZXZpY2VzIGluIGEgaGV0ZXJvZ2VuZW91cyBlbnZpcm9ubWVudC4N
Cj4gDQo+IEkgdGhpbmsgdGhlIHNvbHV0aW9uIHRvIHRoaXMgaXMgdG8gaGF2ZSB2ZmlvX2lvbW11
X3R5cGUxIHRyYW5zcGFyZW50bHkNCj4gc3VwcG9ydCBtdWx0aXBsZSBJT01NVSBkb21haW5zLiAg
SW4gdGhlIGltcGxlbWVudGF0aW9uIG9mIHRoYXQsIGl0IHNlZW1zDQo+IHRvIG1ha2Ugc2Vuc2Ug
dG8gbW92ZSB0aGUgaW9tbXVfZG9tYWluX2FsbG9jKCkgdG8gdGhlIHBvaW50IHdoZXJlIHdlDQo+
IGF0dGFjaCBhIGdyb3VwIHRvIHRoZSBkb21haW4uICBUaGF0IG1lYW5zIHdlIGNhbiBzY2FuIHRo
ZSBkZXZpY2VzIGluIHRoZQ0KW1NldGhpIFZhcnVuLUIxNjM5NV0gTXVsdGlwbGUgaW9tbXUgZ3Jv
dXBzIGNhbiBhbHNvIHNoYXJlIHRoZSBzYW1lIGRvbWFpbiAoYXMgYSBwYXJ0DQpPZiB0aGUgc2Ft
ZSBWRklPIGNvbnRhaW5lcikuIEkgYW0gbm90IHN1cmUgaG93IGNhbiB3ZSBoYW5kbGUgdGhlIGNh
c2Ugb2YgaW9tbXUgZ3JvdXBzIGZyb20NCkRpZmZlcmVudCBidXMgdHlwZXMgaW4gdmZpby4NCg0K
LVZhcnVuDQoNCj4gZG9tYWluIHRvIGRldGVybWluZSB0aGUgYnVzLiAgSSBzdXBwb3NlIHRoZXJl
IGlzIHN0aWxsIGFuIGFzc3VtcHRpb24gdGhhdA0KPiBhbGwgdGhlIGRldmljZXMgaW4gYSBncm91
cCBhcmUgb24gdGhlIHNhbWUgYnVzLCBidXQgc2luY2UgdGhlIGdyb3VwIGlzDQo+IGRldGVybWlu
ZWQgYnkgdGhlIElPTU1VIGFuZCB3ZSBhbHJlYWR5IGFzc3VtZSBvbmx5IGEgc2luZ2xlIElPTU1V
IHBlcg0KPiBidXMsIEkgdGhpbmsgd2UncmUgb2suICBJIHNwZW50IHNvbWUgdGltZSB3b3JraW5n
IG9uIGEgcGF0Y2ggdG8gZG8gdGhpcywNCj4gYnV0IGl0IGlzbid0IHF1aXRlIGZpbmlzaGVkLiAg
SSdsbCB0cnkgdG8gYmFuZGFnZSB0aGUgcm91Z2ggZWRnZXMgYW5kDQo+IHNlbmQgaXQgb3V0IGFz
IGFuIFJGQyBzbyB5b3UgY2FuIHNlZSB3aGF0IEknbSB0YWxraW5nIGFib3V0LiAgVGhhbmtzLA0K
PiANCj4gQWxleA0KPiANCj4gPiBTaWduZWQtb2ZmLWJ5OiBWYXJ1biBTZXRoaSA8VmFydW4uU2V0
aGlAZnJlZXNjYWxlLmNvbT4NCj4gPiAtLS0NCj4gPiAgYXJjaC9hcm0vbW0vZG1hLW1hcHBpbmcu
YyAgICAgICAgICAgICB8ICAgIDIgKy0NCj4gPiAgZHJpdmVycy9ncHUvZHJtL21zbS9tc21fZ3B1
LmMgICAgICAgICB8ICAgIDIgKy0NCj4gPiAgZHJpdmVycy9pb21tdS9hbWRfaW9tbXVfdjIuYyAg
ICAgICAgICB8ICAgIDIgKy0NCj4gPiAgZHJpdmVycy9pb21tdS9pb21tdS5jICAgICAgICAgICAg
ICAgICB8ICAgMzINCj4gKysrKysrKysrKysrKysrKysrKysrKysrKysrKystLS0NCj4gPiAgZHJp
dmVycy9tZWRpYS9wbGF0Zm9ybS9vbWFwM2lzcC9pc3AuYyB8ICAgIDIgKy0NCj4gPiAgZHJpdmVy
cy9yZW1vdGVwcm9jL3JlbW90ZXByb2NfY29yZS5jICB8ICAgIDIgKy0NCj4gPiAgZHJpdmVycy92
ZmlvL3ZmaW9faW9tbXVfdHlwZTEuYyAgICAgICB8ICAgIDIgKy0NCj4gPiAgaW5jbHVkZS9saW51
eC9kZXZpY2UuaCAgICAgICAgICAgICAgICB8ICAgIDIgKysNCj4gPiAgaW5jbHVkZS9saW51eC9p
b21tdS5oICAgICAgICAgICAgICAgICB8ICAgIDQgKystLQ0KPiA+ICB2aXJ0L2t2bS9pb21tdS5j
ICAgICAgICAgICAgICAgICAgICAgIHwgICAgMiArLQ0KPiA+ICAxMCBmaWxlcyBjaGFuZ2VkLCA0
MCBpbnNlcnRpb25zKCspLCAxMiBkZWxldGlvbnMoLSkNCg0K
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/





==============================================================================
TOPIC: net: rfkill: gpio: Add device tree support
http://groups.google.com/group/linux.kernel/t/502551497c79a009?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Jan 17 2014 12:30 pm
From: Johannes Berg


On Fri, 2014-01-17 at 14:47 +0800, Chen-Yu Tsai wrote:

> This patch series adds device tree support to rfkill-gpio, and
> fixes some issues I ran into. This is so we can define and control
> RF devices through the device tree, such as the Broadcom BCM20710
> UART-based Bluetooth device found on the CubieTruck,

> Comments, please?

Does anyone else want to maintain rfkill-gpio? :)

I'm not up to par on all the DT, ACPI and even GPIO stuff it does, and
the rfkill bits in it are really small ...

I'll happily apply the patches if everyone else is happy with them, but
please don't expect me to actually be able to say anything about them.

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/





==============================================================================
TOPIC: Change khugepaged to respect MMF_THP_DISABLE flag
http://groups.google.com/group/linux.kernel/t/eff632e7496a6e85?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Jan 17 2014 12:40 pm
From: Oleg Nesterov


On 01/16, Alex Thorlton wrote:
>
> static inline int khugepaged_test_exit(struct mm_struct *mm)
> {
> - return atomic_read(&mm->mm_users) == 0;
> + return atomic_read(&mm->mm_users) == 0 ||
> + (mm->flags & MMF_THP_DISABLE_MASK);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

test_bit(MMF_THP_DISABLE) ?

And I am not sure this and another check in transparent_hugepage_enabled
is actually right...

I think that MMF_THP_DISABLE_MASK should not disable thp if this
vma has VM_HUGEPAGE set, iow perhaps madvise() should work even
after PR_SET_THP_DISABLE?

IOW, MMF_THP_DISABLE should act as khugepaged_req_madv().

But again, I won't argue.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/





==============================================================================
TOPIC: cgroup: make CONFIG_NET_CLS_CGROUP and CONFIG_NETPRIO_CGROUP bool
instead of tristate
http://groups.google.com/group/linux.kernel/t/3a3c8584cdd64136?hl=en
==============================================================================

== 1 of 2 ==
Date: Fri, Jan 17 2014 12:40 pm
From: Neil Horman


On Fri, Jan 17, 2014 at 01:11:52PM -0500, Tejun Heo wrote:
> net_cls and net_prio are the only cgroups which are allowed to be
> built as modules. The savings from allowing the two controllers to be
> built as modules are tiny especially given that cgroup module support
> itself adds quite a bit of complexity.
>
> The following are the sizes of vmlinux with both built as module and
> both built as part of the kernel image with cgroup module support
> removed.
>
> text data bss dec
> 20292207 2411496 10784768 33488471
> 20293421 2412568 10784768 33490757
>
> The total difference is 2286 bytes. Given that none of other
> controllers has much chance of being made a module and that we're
> unlikely to add new modular controllers, the added complexity is
> simply not justifiable.
>
> As a first step to drop cgroup module support, this patch changes the
> two config options to bool from tristate and drops module related code
> from the two controllers.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Cc: Thomas Graf <tgraf@suug.ch>
> Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Neil Horman <nhorman@tuxdriver.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/




== 2 of 2 ==
Date: Fri, Jan 17 2014 1:00 pm
From: Neil Horman


On Fri, Jan 17, 2014 at 01:11:54PM -0500, Tejun Heo wrote:
> cgroup_subsys is a bit messier than it needs to be.
>
> * The name of a subsys can be different from its internal identifier
> defined in cgroup_subsys.h. Most subsystems use the matching name
> but three - cpu, memory and perf_event - use different ones.
>
> * cgroup_subsys_id enums are postfixed with _subsys_id and each
> cgroup_subsys is postfixed with _subsys. cgroup.h is widely
> included throughout various subsystems, it doesn't and shouldn't
> have claim on such generic names which don't have any qualifier
> indicating that they belong to cgroup.
>
> * cgroup_subsys->subsys_id should always equal the matching
> cgroup_subsys_id enum; however, we require each controller to
> initialize it and then BUG if they don't match, which is a bit
> silly.
>
> This patch cleans up cgroup_subsys names and initialization by doing
> the followings.
>
> * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
> cgroup_subsys with _cgrp_subsys.
>
> * With the above, renaming subsys identifiers to match the userland
> visible names doesn't cause any naming conflicts. All non-matching
> identifiers are renamed to match the official names.
>
> cpu_cgroup -> cpu
> mem_cgroup -> memory
> perf -> perf_event
>
> * controllers no longer need to initialize ->subsys_id and ->name.
> They're generated in cgroup core and set automatically during boot.
>
> * Redundant cgroup_subsys declarations removed.
>
> * While updating BUG_ON()s in cgroup_init_early(), convert them to
> WARN()s. BUGging that early during boot is stupid - the kernel
> can't print anything, even through serial console and the trap
> handler doesn't even link stack frame properly for back-tracing.
>
> This patch doesn't introduce any behavior changes.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Li Zefan <lizefan@huawei.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Balbir Singh <bsingharora@gmail.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Aristeu Rozanski <aris@redhat.com>
> Cc: Serge E. Hallyn <serue@us.ibm.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Cc: Thomas Graf <tgraf@suug.ch>
> Cc: "David S. Miller" <davem@davemloft.net>
For the netprio and net_cls bits
Acked-by: Neil Horman <nhorman@tuxdriver.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/





==============================================================================
TOPIC: vfio/iommu_type1: Multi-IOMMU domain support
http://groups.google.com/group/linux.kernel/t/ec14a94720cba15e?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Jan 17 2014 12:40 pm
From: Alex Williamson


RFC: This is not complete but I want to share with Varun the
dirrection I'm thinking about. In particular, I'm really not
sure if we want to introduce a "v2" interface version with
slightly different unmap semantics. QEMU doesn't care about
the difference, but other users might. Be warned, I'm not even
sure if this code works at the moment. Thanks,

Alex


We currently have a problem that we cannot support advanced features
of an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee
that those features will be supported by all of the hardware units
involved with the domain over its lifetime. For instance, the Intel
VT-d architecture does not require that all DRHDs support snoop
control. If we create a domain based on a device behind a DRHD that
does support snoop control and enable SNP support via the IOMMU_CACHE
mapping option, we cannot then add a device behind a DRHD which does
not support snoop control or we'll get reserved bit faults from the
SNP bit in the pagetables. To add to the complexity, we can't know
the properties of a domain until a device is attached.

We could pass this problem off to userspace and require that a
separate vfio container be used, but we don't know how to handle page
accounting in that case. How do we know that a page pinned in one
container is the same page as a different container and avoid double
billing the user for the page.

The solution is therefore to support multiple IOMMU domains per
container. In the majority of cases, only one domain will be required
since hardware is typically consistent within a system. However, this
provides us the ability to validate compatibility of domains and
support mixed environments where page table flags can be different
between domains.

To do this, our DMA tracking needs to change. We currently try to
coalesce user mappings into as few tracking entries as possible. The
problem then becomes that we lose granularity of user mappings. We've
never guaranteed that a user is able to unmap at a finer granularity
than the original mapping, but we must honor the granularity of the
original mapping. This coalescing code is therefore removed, allowing
only unmaps covering complete maps. The change in accounting is
fairly small here, a typical QEMU VM will start out with roughly a
dozen entries, so it's arguable if this coalescing was ever needed.

We also move IOMMU domain creation to the point where a group is
attached to the container. An interesting side-effect of this is that
we now have access to the device at the time of domain creation and
can probe the devices within the group to determine the bus_type.
This finally makes vfio_iommu_type1 completely device/bus agnostic.
In fact, each IOMMU domain can host devices on different buses managed
by different physical IOMMUs, and present a single DMA mapping
interface to the user. When a new domain is created, mappings are
replayed to bring the IOMMU pagetables up to the state of the current
container. And of course, DMA mapping and unmapping automatically
traverse all of the configured IOMMU domains.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

drivers/vfio/vfio_iommu_type1.c | 631 ++++++++++++++++++++-------------------
include/uapi/linux/vfio.h | 1
2 files changed, 329 insertions(+), 303 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4fb7a8f..983aae5 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -30,7 +30,6 @@
#include <linux/iommu.h>
#include <linux/module.h>
#include <linux/mm.h>
-#include <linux/pci.h> /* pci_bus_type */
#include <linux/rbtree.h>
#include <linux/sched.h>
#include <linux/slab.h>
@@ -55,11 +54,18 @@ MODULE_PARM_DESC(disable_hugepages,
"Disable VFIO IOMMU support for IOMMU hugepages.");

struct vfio_iommu {
- struct iommu_domain *domain;
+ struct list_head domain_list;
struct mutex lock;
struct rb_root dma_list;
+ bool v2;
+};
+
+struct vfio_domain {
+ struct iommu_domain *domain;
+ struct bus_type *bus;
+ struct list_head next;
struct list_head group_list;
- bool cache;
+ int prot; /* IOMMU_CACHE */
};

struct vfio_dma {
@@ -99,7 +105,7 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
return NULL;
}

-static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
+static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
{
struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
struct vfio_dma *dma;
@@ -118,7 +124,7 @@ static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
rb_insert_color(&new->node, &iommu->dma_list);
}

-static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
+static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
{
rb_erase(&old->node, &iommu->dma_list);
}
@@ -322,32 +328,39 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
return unlocked;
}

-static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
- dma_addr_t iova, size_t *size)
+static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
{
- dma_addr_t start = iova, end = iova + *size;
+ dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
+ struct vfio_domain *domain, *d;
long unlocked = 0;

+ if (!dma->size)
+ return;
+ /*
+ * We use the IOMMU to track the physical addresses, otherwise we'd
+ * need a much more complicated tracking system. Unfortunately that
+ * means we need to use one of the iommu domains to figure out the
+ * pfns to unpin. The rest need to be unmapped in advance so we have
+ * no iommu translations remaining when the pages are unpinned.
+ */
+ domain = d = list_first_entry(&iommu->domain_list,
+ struct vfio_domain, next);
+
+ list_for_each_entry_continue(d, &iommu->domain_list, next)
+ iommu_unmap(d->domain, dma->iova, dma->size);
+
while (iova < end) {
size_t unmapped;
phys_addr_t phys;

- /*
- * We use the IOMMU to track the physical address. This
- * saves us from having a lot more entries in our mapping
- * tree. The downside is that we don't track the size
- * used to do the mapping. We request unmap of a single
- * page, but expect IOMMUs that support large pages to
- * unmap a larger chunk.
- */
- phys = iommu_iova_to_phys(iommu->domain, iova);
+ phys = iommu_iova_to_phys(domain->domain, iova);
if (WARN_ON(!phys)) {
iova += PAGE_SIZE;
continue;
}

- unmapped = iommu_unmap(iommu->domain, iova, PAGE_SIZE);
- if (!unmapped)
+ unmapped = iommu_unmap(domain->domain, iova, PAGE_SIZE);
+ if (WARN_ON(!unmapped))
break;

unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
@@ -357,119 +370,26 @@ static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
}

vfio_lock_acct(-unlocked);
-
- *size = iova - start;
-
- return 0;
}

-static int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
- size_t *size, struct vfio_dma *dma)
+static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
{
- size_t offset, overlap, tmp;
- struct vfio_dma *split;
- int ret;
-
- if (!*size)
- return 0;
-
- /*
- * Existing dma region is completely covered, unmap all. This is
- * the likely case since userspace tends to map and unmap buffers
- * in one shot rather than multiple mappings within a buffer.
- */
- if (likely(start <= dma->iova &&
- start + *size >= dma->iova + dma->size)) {
- *size = dma->size;
- ret = vfio_unmap_unpin(iommu, dma, dma->iova, size);
- if (ret)
- return ret;
-
- /*
- * Did we remove more than we have? Should never happen
- * since a vfio_dma is contiguous in iova and vaddr.
- */
- WARN_ON(*size != dma->size);
-
- vfio_remove_dma(iommu, dma);
- kfree(dma);
- return 0;
- }
-
- /* Overlap low address of existing range */
- if (start <= dma->iova) {
- overlap = start + *size - dma->iova;
- ret = vfio_unmap_unpin(iommu, dma, dma->iova, &overlap);
- if (ret)
- return ret;
-
- vfio_remove_dma(iommu, dma);
-
- /*
- * Check, we may have removed to whole vfio_dma. If not
- * fixup and re-insert.
- */
- if (overlap < dma->size) {
- dma->iova += overlap;
- dma->vaddr += overlap;
- dma->size -= overlap;
- vfio_insert_dma(iommu, dma);
- } else
- kfree(dma);
-
- *size = overlap;
- return 0;
- }
-
- /* Overlap high address of existing range */
- if (start + *size >= dma->iova + dma->size) {
- offset = start - dma->iova;
- overlap = dma->size - offset;
-
- ret = vfio_unmap_unpin(iommu, dma, start, &overlap);
- if (ret)
- return ret;
-
- dma->size -= overlap;
- *size = overlap;
- return 0;
- }
-
- /* Split existing */
-
- /*
- * Allocate our tracking structure early even though it may not
- * be used. An Allocation failure later loses track of pages and
- * is more difficult to unwind.
- */
- split = kzalloc(sizeof(*split), GFP_KERNEL);
- if (!split)
- return -ENOMEM;
-
- offset = start - dma->iova;
-
- ret = vfio_unmap_unpin(iommu, dma, start, size);
- if (ret || !*size) {
- kfree(split);
- return ret;
- }
-
- tmp = dma->size;
+ vfio_unmap_unpin(iommu, dma);
+ vfio_unlink_dma(iommu, dma);
+ kfree(dma);
+}

- /* Resize the lower vfio_dma in place, before the below insert */
- dma->size = offset;
+static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
+{
+ struct vfio_domain *domain;
+ unsigned long bitmap = PAGE_MASK;

- /* Insert new for remainder, assuming it didn't all get unmapped */
- if (likely(offset + *size < tmp)) {
- split->size = tmp - offset - *size;
- split->iova = dma->iova + offset + *size;
- split->vaddr = dma->vaddr + offset + *size;
- split->prot = dma->prot;
- vfio_insert_dma(iommu, split);
- } else
- kfree(split);
+ mutex_lock(&iommu->lock);
+ list_for_each_entry(domain, &iommu->domain_list, next)
+ bitmap &= domain->domain->ops->pgsize_bitmap;
+ mutex_unlock(&iommu->lock);

- return 0;
+ return bitmap;
}

static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
@@ -477,10 +397,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
{
uint64_t mask;
struct vfio_dma *dma;
- size_t unmapped = 0, size;
+ size_t unmapped = 0;
int ret = 0;

- mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+ mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;

if (unmap->iova & mask)
return -EINVAL;
@@ -491,20 +411,61 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,

mutex_lock(&iommu->lock);

+ /*
+ * vfio-iommu-type1 (v1) - User mappings were coalesced together to
+ * avoid tracking individual mappings. This means that the granularity
+ * of the original mapping was lost and the user was allowed to attempt
+ * to unmap any range. Depending on the contiguousness of physical
+ * memory and page sizes supported by the IOMMU, arbitrary unmaps may
+ * or may not have worked. We only guaranteed unmap granularity
+ * matching the original mapping; even though it was untracked here,
+ * the original mappings are reflected in IOMMU mappings. This
+ * resulted in a couple unusual behaviors. First, if a range is not
+ * able to be unmapped, ex. a set of 4k pages that was mapped as a
+ * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
+ * a zero sized unmap. Also, if an unmap request overlaps the first
+ * address of a hugepage, the IOMMU will unmap the entire hugepage.
+ * This also returns success and the returned unmap size reflects the
+ * actual size unmapped.
+ *
+ * We attempt to maintain compatibility with this interface, but we
+ * take control out of the hands of the IOMMU. An unmap request offset
+ * from the beginning of the original mapping will return success with
+ * zero sized unmap. An unmap request covering the first iova of
+ * mapping will unmap the entire range.
+ *
+ * The v2 version of this interface intends to be more deterministic.
+ * Unmap requests must fully cover previous mappings. Multiple
+ * mappings may still be unmaped by specifying large ranges, but there
+ * must not be any previous mappings bisected by the range. An error
+ * will be returned if these conditions are not met. The v2 interface
+ * will only return success and a size of zero if there were no
+ * mappings within the range.
+ */
+ if (iommu->v2 ) {
+ dma = vfio_find_dma(iommu, unmap->iova, 0);
+ if (dma && dma->iova != unmap->iova) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+ dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1, 0);
+ if (dma && dma->iova + dma->size != unmap->iova + unmap->size) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+ }
+
while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
- size = unmap->size;
- ret = vfio_remove_dma_overlap(iommu, unmap->iova, &size, dma);
- if (ret || !size)
+ if (!iommu->v2 && unmap->iova > dma->iova)
break;
- unmapped += size;
+ unmapped += dma->size;
+ vfio_remove_dma(iommu, dma);
}

+unlock:
mutex_unlock(&iommu->lock);

- /*
- * We may unmap more than requested, update the unmap struct so
- * userspace can know.
- */
+ /* Report how much was unmapped */
unmap->size = unmapped;

return ret;
@@ -516,22 +477,47 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
* soon, so this is just a temporary workaround to break mappings down into
* PAGE_SIZE. Better to map smaller pages than nothing.
*/
-static int map_try_harder(struct vfio_iommu *iommu, dma_addr_t iova,
+static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
unsigned long pfn, long npage, int prot)
{
long i;
int ret;

for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
- ret = iommu_map(iommu->domain, iova,
+ ret = iommu_map(domain->domain, iova,
(phys_addr_t)pfn << PAGE_SHIFT,
- PAGE_SIZE, prot);
+ PAGE_SIZE, prot | domain->prot);
if (ret)
break;
}

for (; i < npage && i > 0; i--, iova -= PAGE_SIZE)
- iommu_unmap(iommu->domain, iova, PAGE_SIZE);
+ iommu_unmap(domain->domain, iova, PAGE_SIZE);
+
+ return ret;
+}
+
+static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
+ unsigned long pfn, long npage, int prot)
+{
+ struct vfio_domain *d;
+ int ret;
+
+ list_for_each_entry(d, &iommu->domain_list, next) {
+ ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
+ npage << PAGE_SHIFT, prot | d->prot);
+ if (ret) {
+ if (ret != -EBUSY ||
+ map_try_harder(d, iova, pfn, npage, prot))
+ goto unwind;
+ }
+ }
+
+ return 0;
+
+unwind:
+ list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
+ iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);

return ret;
}
@@ -545,12 +531,12 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
long npage;
int ret = 0, prot = 0;
uint64_t mask;
- struct vfio_dma *dma = NULL;
+ struct vfio_dma *dma;
unsigned long pfn;

end = map->iova + map->size;

- mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+ mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;

/* READ/WRITE from device perspective */
if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
@@ -561,9 +547,6 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
if (!prot)
return -EINVAL; /* No READ/WRITE? */

- if (iommu->cache)
- prot |= IOMMU_CACHE;
-
if (vaddr & mask)
return -EINVAL;
if (map->iova & mask)
@@ -588,180 +571,249 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
return -EEXIST;
}

- for (iova = map->iova; iova < end; iova += size, vaddr += size) {
- long i;
+ dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+ if (!dma) {
+ mutex_unlock(&iommu->lock);
+ return -ENOMEM;
+ }

+ dma->iova = map->iova;
+ dma->vaddr = map->vaddr;
+ dma->prot = prot;
+
+ /* Insert zero-sized and grow as we map chunks of it */
+ vfio_link_dma(iommu, dma);
+
+ for (iova = map->iova; iova < end; iova += size, vaddr += size) {
/* Pin a contiguous chunk of memory */
npage = vfio_pin_pages(vaddr, (end - iova) >> PAGE_SHIFT,
prot, &pfn);
if (npage <= 0) {
WARN_ON(!npage);
ret = (int)npage;
- goto out;
- }
-
- /* Verify pages are not already mapped */
- for (i = 0; i < npage; i++) {
- if (iommu_iova_to_phys(iommu->domain,
- iova + (i << PAGE_SHIFT))) {
- ret = -EBUSY;
- goto out_unpin;
- }
+ break;
}

- ret = iommu_map(iommu->domain, iova,
- (phys_addr_t)pfn << PAGE_SHIFT,
- npage << PAGE_SHIFT, prot);
+ /* Map it! */
+ ret = vfio_iommu_map(iommu, iova, pfn, npage, prot);
if (ret) {
- if (ret != -EBUSY ||
- map_try_harder(iommu, iova, pfn, npage, prot)) {
- goto out_unpin;
- }
+ vfio_unpin_pages(pfn, npage, prot, true);
+ break;
}

size = npage << PAGE_SHIFT;
+ dma->size += size;
+ }

- /*
- * Check if we abut a region below - nothing below 0.
- * This is the most likely case when mapping chunks of
- * physically contiguous regions within a virtual address
- * range. Update the abutting entry in place since iova
- * doesn't change.
- */
- if (likely(iova)) {
- struct vfio_dma *tmp;
- tmp = vfio_find_dma(iommu, iova - 1, 1);
- if (tmp && tmp->prot == prot &&
- tmp->vaddr + tmp->size == vaddr) {
- tmp->size += size;
- iova = tmp->iova;
- size = tmp->size;
- vaddr = tmp->vaddr;
- dma = tmp;
- }
- }
+ if (ret)
+ vfio_remove_dma(iommu, dma);

- /*
- * Check if we abut a region above - nothing above ~0 + 1.
- * If we abut above and below, remove and free. If only
- * abut above, remove, modify, reinsert.
- */
- if (likely(iova + size)) {
- struct vfio_dma *tmp;
- tmp = vfio_find_dma(iommu, iova + size, 1);
- if (tmp && tmp->prot == prot &&
- tmp->vaddr == vaddr + size) {
- vfio_remove_dma(iommu, tmp);
- if (dma) {
- dma->size += tmp->size;
- kfree(tmp);
- } else {
- size += tmp->size;
- tmp->size = size;
- tmp->iova = iova;
- tmp->vaddr = vaddr;
- vfio_insert_dma(iommu, tmp);
- dma = tmp;
- }
- }
- }
+ mutex_unlock(&iommu->lock);
+ return ret;
+}
+
+static int vfio_bus_type(struct device *dev, void *data)
+{
+ struct vfio_domain *domain = data;
+
+ if (domain->bus && domain->bus != dev->bus)
+ return -EINVAL;
+
+ domain->bus = dev->bus;
+
+ return 0;
+}
+
+static int vfio_iommu_replay(struct vfio_iommu *iommu,
+ struct vfio_domain *domain)
+{
+ struct vfio_domain *d;
+ struct rb_node *n;
+ int ret;
+
+ /* Arbitrarily pick the first domain in the list for lookups */
+ d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
+ n = rb_first(&iommu->dma_list);
+
+ /* If there's not a domain, there better not be any mappings */
+ if (WARN_ON(n && !d))
+ return -EINVAL;
+
+ for (; n; n = rb_next(n)) {
+ struct vfio_dma *dma;
+ dma_addr_t iova;
+
+ dma = rb_entry(n, struct vfio_dma, node);
+ iova = dma->iova;

- if (!dma) {
- dma = kzalloc(sizeof(*dma), GFP_KERNEL);
- if (!dma) {
- iommu_unmap(iommu->domain, iova, size);
- ret = -ENOMEM;
- goto out_unpin;
+ while (iova < dma->iova + dma->size) {
+ phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
+ size_t size;
+
+ if (WARN_ON(!phys)) {
+ iova += PAGE_SIZE;
+ continue;
}

- dma->size = size;
- dma->iova = iova;
- dma->vaddr = vaddr;
- dma->prot = prot;
- vfio_insert_dma(iommu, dma);
- }
- }
+ size = PAGE_SIZE;

- WARN_ON(ret);
- mutex_unlock(&iommu->lock);
- return ret;
+ while (iova + size < dma->iova + dma->size &&
+ phys + size == iommu_iova_to_phys(d->domain,
+ iova + size))
+ size += PAGE_SIZE;

-out_unpin:
- vfio_unpin_pages(pfn, npage, prot, true);
+ ret = iommu_map(domain->domain, iova, phys,
+ size, dma->prot | domain->prot);
+ if (ret)
+ return ret;

-out:
- iova = map->iova;
- size = map->size;
- while ((dma = vfio_find_dma(iommu, iova, size))) {
- int r = vfio_remove_dma_overlap(iommu, iova,
- &size, dma);
- if (WARN_ON(r || !size))
- break;
+ iova += size;
+ }
}

- mutex_unlock(&iommu->lock);
- return ret;
+ return 0;
}

static int vfio_iommu_type1_attach_group(void *iommu_data,
struct iommu_group *iommu_group)
{
struct vfio_iommu *iommu = iommu_data;
- struct vfio_group *group, *tmp;
+ struct vfio_group *group, *g;
+ struct vfio_domain *domain, *d;
int ret;

- group = kzalloc(sizeof(*group), GFP_KERNEL);
- if (!group)
- return -ENOMEM;
-
mutex_lock(&iommu->lock);

- list_for_each_entry(tmp, &iommu->group_list, next) {
- if (tmp->iommu_group == iommu_group) {
+ list_for_each_entry(d, &iommu->domain_list, next) {
+ list_for_each_entry(g, &d->group_list, next) {
+ if (g->iommu_group != iommu_group)
+ continue;
+
mutex_unlock(&iommu->lock);
- kfree(group);
return -EINVAL;
}
}

- /*
- * TODO: Domain have capabilities that might change as we add
- * groups (see iommu->cache, currently never set). Check for
- * them and potentially disallow groups to be attached when it
- * would change capabilities (ugh).
- */
- ret = iommu_attach_group(iommu->domain, iommu_group);
- if (ret) {
- mutex_unlock(&iommu->lock);
- kfree(group);
- return ret;
+ group = kzalloc(sizeof(*group), GFP_KERNEL);
+ domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+ if (!group || !domain) {
+ ret = -ENOMEM;
+ goto out_free;
}

group->iommu_group = iommu_group;
- list_add(&group->next, &iommu->group_list);
+
+ /* Determine bus_type in order to allocate a domain */
+ ret = iommu_group_for_each_dev(iommu_group, domain, vfio_bus_type);
+ if (ret)
+ goto out_free;
+
+ domain->domain = iommu_domain_alloc(domain->bus);
+ if (!domain->domain) {
+ ret = -EIO;
+ goto out_free;
+ }
+
+ ret = iommu_attach_group(domain->domain, iommu_group);
+ if (ret)
+ goto out_domain;
+
+ INIT_LIST_HEAD(&domain->group_list);
+ list_add(&group->next, &domain->group_list);
+
+ if (!allow_unsafe_interrupts &&
+ !iommu_domain_has_cap(domain->domain, IOMMU_CAP_INTR_REMAP)) {
+ pr_warn("%s: No interrupt remapping support. Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
+ __func__);
+ ret = -EPERM;
+ goto out_detach;
+ }
+
+ if (iommu_domain_has_cap(domain->domain, IOMMU_CAP_CACHE_COHERENCY))
+ domain->prot |= IOMMU_CACHE;
+
+ /* Try to match an existing compatible domain. */
+ list_for_each_entry(d, &iommu->domain_list, next) {
+ if (d->bus == domain->bus && d->prot == domain->prot) {
+ iommu_detach_group(domain->domain, iommu_group);
+ if (!iommu_attach_group(d->domain, iommu_group)) {
+ list_add(&group->next, &d->group_list);
+ iommu_domain_free(domain->domain);
+ kfree(domain);
+ mutex_unlock(&iommu->lock);
+ return 0;
+ }
+
+ ret = iommu_attach_group(domain->domain, iommu_group);
+ if (ret)
+ goto out_domain;
+ }
+ }
+
+ /* replay mappings on new domains */
+ ret = vfio_iommu_replay(iommu, domain);
+ if (ret)
+ goto out_detach;
+
+ list_add(&domain->next, &iommu->domain_list);

mutex_unlock(&iommu->lock);

return 0;
+
+out_detach:
+ iommu_detach_group(domain->domain, iommu_group);
+out_domain:
+ iommu_domain_free(domain->domain);
+out_free:
+ kfree(domain);
+ kfree(group);
+ mutex_unlock(&iommu->lock);
+ return ret;
+}
+
+static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
+{
+ struct rb_node *node;
+
+ while ((node = rb_first(&iommu->dma_list)))
+ vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
}

static void vfio_iommu_type1_detach_group(void *iommu_data,
struct iommu_group *iommu_group)
{
struct vfio_iommu *iommu = iommu_data;
+ struct vfio_domain *domain;
struct vfio_group *group;

mutex_lock(&iommu->lock);

- list_for_each_entry(group, &iommu->group_list, next) {
- if (group->iommu_group == iommu_group) {
- iommu_detach_group(iommu->domain, iommu_group);
+ list_for_each_entry(domain, &iommu->domain_list, next) {
+ list_for_each_entry(group, &domain->group_list, next) {
+ if (group->iommu_group != iommu_group)
+ continue;
+
+ iommu_detach_group(domain->domain, iommu_group);
list_del(&group->next);
kfree(group);
- break;
+ /*
+ * Group ownership provides privilege, if the group
+ * list is empty, the domain goes away. If it's the
+ * last domain, then all the mappings go away too.
+ */
+ if (list_empty(&domain->group_list)) {
+ if (list_is_singular(&iommu->domain_list))
+ vfio_iommu_unmap_unpin_all(iommu);
+ iommu_domain_free(domain->domain);
+ list_del(&domain->next);
+ kfree(domain);
+ }
+ goto done;
}
}

+done:
mutex_unlock(&iommu->lock);
}

@@ -769,40 +821,17 @@ static void *vfio_iommu_type1_open(unsigned long arg)
{
struct vfio_iommu *iommu;

- if (arg != VFIO_TYPE1_IOMMU)
+ if (arg != VFIO_TYPE1_IOMMU || arg != VFIO_TYPE1v2_IOMMU)
return ERR_PTR(-EINVAL);

iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
if (!iommu)
return ERR_PTR(-ENOMEM);

- INIT_LIST_HEAD(&iommu->group_list);
+ INIT_LIST_HEAD(&iommu->domain_list);
iommu->dma_list = RB_ROOT;
mutex_init(&iommu->lock);
-
- /*
- * Wish we didn't have to know about bus_type here.
- */
- iommu->domain = iommu_domain_alloc(&pci_bus_type);
- if (!iommu->domain) {
- kfree(iommu);
- return ERR_PTR(-EIO);
- }
-
- /*
- * Wish we could specify required capabilities rather than create
- * a domain, see what comes out and hope it doesn't change along
- * the way. Fortunately we know interrupt remapping is global for
- * our iommus.
- */
- if (!allow_unsafe_interrupts &&
- !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
- pr_warn("%s: No interrupt remapping support. Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
- __func__);
- iommu_domain_free(iommu->domain);
- kfree(iommu);
- return ERR_PTR(-EPERM);
- }
+ iommu->v2 = (arg == VFIO_TYPE1v2_IOMMU);

return iommu;
}
@@ -810,25 +839,24 @@ static void *vfio_iommu_type1_open(unsigned long arg)
static void vfio_iommu_type1_release(void *iommu_data)
{
struct vfio_iommu *iommu = iommu_data;
+ struct vfio_domain *domain, *domain_tmp;
struct vfio_group *group, *group_tmp;
- struct rb_node *node;

- list_for_each_entry_safe(group, group_tmp, &iommu->group_list, next) {
- iommu_detach_group(iommu->domain, group->iommu_group);
- list_del(&group->next);
- kfree(group);
- }
+ vfio_iommu_unmap_unpin_all(iommu);

- while ((node = rb_first(&iommu->dma_list))) {
- struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
- size_t size = dma->size;
- vfio_remove_dma_overlap(iommu, dma->iova, &size, dma);
- if (WARN_ON(!size))
- break;
+ list_for_each_entry_safe(domain, domain_tmp,
+ &iommu->domain_list, next) {
+ list_for_each_entry_safe(group, group_tmp,
+ &domain->group_list, next) {
+ iommu_detach_group(domain->domain, group->iommu_group);
+ list_del(&group->next);
+ kfree(group);
+ }
+ iommu_domain_free(domain->domain);
+ list_del(&domain->next);
+ kfree(domain);
}

- iommu_domain_free(iommu->domain);
- iommu->domain = NULL;
kfree(iommu);
}

@@ -858,7 +886,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,

info.flags = 0;

- info.iova_pgsizes = iommu->domain->ops->pgsize_bitmap;
+ info.iova_pgsizes = vfio_pgsize_bitmap(iommu);

return copy_to_user((void __user *)arg, &info, minsz);

@@ -911,9 +939,6 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {

static int __init vfio_iommu_type1_init(void)
{
- if (!iommu_present(&pci_bus_type))
- return -ENODEV;
-
return vfio_register_iommu_driver(&vfio_iommu_driver_ops_type1);
}

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0fd47f5..460fdf2 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -23,6 +23,7 @@

#define VFIO_TYPE1_IOMMU 1
#define VFIO_SPAPR_TCE_IOMMU 2
+#define VFIO_TYPE1v2_IOMMU 3

/*
* The IOCTL interface is designed for extensibility by embedding the

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/





==============================================================================
TOPIC: KVM: SVM: fix NMI window after iret
http://groups.google.com/group/linux.kernel/t/2825e0c4359b20ed?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Jan 17 2014 12:40 pm
From: Radim Krčmář


2014-01-17 12:18-0800, Greg KH:
> On Fri, Jan 17, 2014 at 08:52:42PM +0100, Radim Krčmář wrote:
> > We should open NMI window right after an iret, but SVM exits before it.
> > We wanted to single step using the trap flag and then open it.
> > (or we could emulate the iret instead)
> > We don't do it since commit 3842d135ff2 (likely), because the iret exit
> > handler does not request an event, so NMI window remains closed until
> > the next exit.
> >
> > Fix this by making KVM_REQ_EVENT request in the iret handler.
> >
> > Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> > ---
> > (btw. kvm-unit-tests weren't executed on SVM since Nov 2010, at least)
> >
> > arch/x86/kvm/svm.c | 1 +
> > 1 file changed, 1 insertion(+)
>
>
> <formletter>
>
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree. Please read Documentation/stable_kernel_rules.txt
> for how to do this properly.
>
> </formletter>

Welp, at the last second, I decided it is not that critical to have it
in stable and forgot to clean the git-send-email command line too.

Please ignore this patch in stable.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/





==============================================================================
TOPIC: f2fs: clean checkpatch warnings
http://groups.google.com/group/linux.kernel/t/b78c3ed4b7d93688?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Jan 17 2014 12:50 pm
From: Chris Fries


From: Chris Fries <cfries@motorola.com>

Fixed a variety of trivial checkpatch warnings. The only delta should
be some minor formatting on log strings that were split / too long.

Signed-off-by: Chris Fries <cfries@motorola.com>
---
fs/f2fs/data.c | 2 +-
fs/f2fs/debug.c | 2 +-
fs/f2fs/dir.c | 5 +++--
fs/f2fs/f2fs.h | 6 +++---
fs/f2fs/file.c | 2 +-
fs/f2fs/inode.c | 12 ++++++++----
fs/f2fs/node.c | 2 +-
fs/f2fs/recovery.c | 12 ++++++------
fs/f2fs/segment.c | 6 ++++--
fs/f2fs/segment.h | 13 +++++++------
fs/f2fs/super.c | 8 +++++---
11 files changed, 40 insertions(+), 30 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index bda889e..f5fac16 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -792,7 +792,7 @@ static int f2fs_write_data_page(struct page *page,
int err = 0;
struct f2fs_io_info fio = {
.type = DATA,
- .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC: WRITE,
+ .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
};

if (page->index < end_index)
diff --git a/fs/f2fs/debug.c b/fs/f2fs/debug.c
index 6e4ac9a..63cb7e2 100644
--- a/fs/f2fs/debug.c
+++ b/fs/f2fs/debug.c
@@ -245,7 +245,7 @@ static int stat_show(struct seq_file *s, void *v)
seq_printf(s, " - node blocks : %d\n", si->node_blks);
seq_printf(s, "\nExtent Hit Ratio: %d / %d\n",
si->hit_ext, si->total_ext);
- seq_printf(s, "\nBalancing F2FS Async:\n");
+ seq_puts(s, "\nBalancing F2FS Async:\n");
seq_printf(s, " - nodes: %4d in %4d\n",
si->ndirty_node, si->node_pages);
seq_printf(s, " - dents: %4d in dirs:%4d\n",
diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index f815ca0..cd055b6 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -430,7 +430,8 @@ next:
* Caller should grab and release a rwsem by calling f2fs_lock_op() and
* f2fs_unlock_op().
*/
-int __f2fs_add_link(struct inode *dir, const struct qstr *name, struct inode *inode)
+int __f2fs_add_link(struct inode *dir, const struct qstr *name,
+ struct inode *inode)
{
unsigned int bit_pos;
unsigned int level;
@@ -631,7 +632,7 @@ static int f2fs_readdir(struct file *file, struct dir_context *ctx)

bit_pos = ((unsigned long)ctx->pos % NR_DENTRY_IN_BLOCK);

- for ( ; n < npages; n++) {
+ for (; n < npages; n++) {
dentry_page = get_lock_data_page(inode, n);
if (IS_ERR(dentry_page))
continue;
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index ee304fb..5ab3981 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -607,9 +607,9 @@ static inline int check_nid_range(struct f2fs_sb_info *sbi, nid_t nid)
static inline int F2FS_HAS_BLOCKS(struct inode *inode)
{
if (F2FS_I(inode)->i_xattr_nid)
- return (inode->i_blocks > F2FS_DEFAULT_ALLOCATED_BLOCKS + 1);
+ return inode->i_blocks > F2FS_DEFAULT_ALLOCATED_BLOCKS + 1;
else
- return (inode->i_blocks > F2FS_DEFAULT_ALLOCATED_BLOCKS);
+ return inode->i_blocks > F2FS_DEFAULT_ALLOCATED_BLOCKS;
}

static inline bool inc_valid_block_count(struct f2fs_sb_info *sbi,
@@ -1231,7 +1231,7 @@ struct f2fs_stat_info {

static inline struct f2fs_stat_info *F2FS_STAT(struct f2fs_sb_info *sbi)
{
- return (struct f2fs_stat_info*)sbi->stat_info;
+ return (struct f2fs_stat_info *)sbi->stat_info;
}

#define stat_inc_call_count(si) ((si)->call_count++)
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 14511b0..85e91ca 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -202,7 +202,7 @@ int truncate_data_blocks_range(struct dnode_of_data *dn, int count)
raw_node = F2FS_NODE(dn->node_page);
addr = blkaddr_in_node(raw_node) + ofs;

- for ( ; count > 0; count--, addr++, dn->ofs_in_node++) {
+ for (; count > 0; count--, addr++, dn->ofs_in_node++) {
block_t blkaddr = le32_to_cpu(*addr);
if (blkaddr == NULL_ADDR)
continue;
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index ffa4c6d..4d67ed7 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -42,9 +42,11 @@ static void __get_inode_rdev(struct inode *inode, struct f2fs_inode *ri)
if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode) ||
S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
if (ri->i_addr[0])
- inode->i_rdev = old_decode_dev(le32_to_cpu(ri->i_addr[0]));
+ inode->i_rdev =
+ old_decode_dev(le32_to_cpu(ri->i_addr[0]));
else
- inode->i_rdev = new_decode_dev(le32_to_cpu(ri->i_addr[1]));
+ inode->i_rdev =
+ new_decode_dev(le32_to_cpu(ri->i_addr[1]));
}
}

@@ -52,11 +54,13 @@ static void __set_inode_rdev(struct inode *inode, struct f2fs_inode *ri)
{
if (S_ISCHR(inode->i_mode) || S_ISBLK(inode->i_mode)) {
if (old_valid_dev(inode->i_rdev)) {
- ri->i_addr[0] = cpu_to_le32(old_encode_dev(inode->i_rdev));
+ ri->i_addr[0] =
+ cpu_to_le32(old_encode_dev(inode->i_rdev));
ri->i_addr[1] = 0;
} else {
ri->i_addr[0] = 0;
- ri->i_addr[1] = cpu_to_le32(new_encode_dev(inode->i_rdev));
+ ri->i_addr[1] =
+ cpu_to_le32(new_encode_dev(inode->i_rdev));
ri->i_addr[2] = 0;
}
}
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index b8c9301..226a05a 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1196,7 +1196,7 @@ static int f2fs_write_node_page(struct page *page,
struct node_info ni;
struct f2fs_io_info fio = {
.type = NODE,
- .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC: WRITE,
+ .rw = (wbc->sync_mode == WB_SYNC_ALL) ? WRITE_SYNC : WRITE,
};

if (unlikely(sbi->por_doing))
diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c
index 655791e..976a7a9 100644
--- a/fs/f2fs/recovery.c
+++ b/fs/f2fs/recovery.c
@@ -95,9 +95,9 @@ out_unmap_put:
kunmap(page);
f2fs_put_page(page, 0);
out:
- f2fs_msg(inode->i_sb, KERN_NOTICE, "recover_inode and its dentry: "
- "ino = %x, name = %s, dir = %lx, err = %d",
- ino_of_node(ipage), raw_inode->i_name,
+ f2fs_msg(inode->i_sb, KERN_NOTICE,
+ "%s: ino = %x, name = %s, dir = %lx, err = %d",
+ __func__, ino_of_node(ipage), raw_inode->i_name,
IS_ERR(dir) ? 0 : dir->i_ino, err);
return err;
}
@@ -366,9 +366,9 @@ err:
f2fs_put_dnode(&dn);
f2fs_unlock_op(sbi);
out:
- f2fs_msg(sbi->sb, KERN_NOTICE, "recover_data: ino = %lx, "
- "recovered_data = %d blocks, err = %d",
- inode->i_ino, recovered, err);
+ f2fs_msg(sbi->sb, KERN_NOTICE,
+ "recover_data: ino = %lx, recovered = %d blocks, err = %d",
+ inode->i_ino, recovered, err);
return err;
}

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index a934e6f..e82423fb 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -946,7 +946,8 @@ void write_data_page(struct page *page, struct dnode_of_data *dn,
do_write_page(sbi, page, dn->data_blkaddr, new_blkaddr, &sum, fio);
}

-void rewrite_data_page(struct page *page, block_t old_blkaddr, struct f2fs_io_info *fio)
+void rewrite_data_page(struct page *page, block_t old_blkaddr,
+ struct f2fs_io_info *fio)
{
struct inode *inode = page->mapping->host;
struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);
@@ -1647,7 +1648,8 @@ static void build_sit_entries(struct f2fs_sb_info *sbi)

mutex_lock(&curseg->curseg_mutex);
for (i = 0; i < sits_in_cursum(sum); i++) {
- if (le32_to_cpu(segno_in_journal(sum, i)) == start) {
+ if (le32_to_cpu(segno_in_journal(sum, i))
+ == start) {
sit = sit_in_journal(sum, i);
mutex_unlock(&curseg->curseg_mutex);
goto got_it;
diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
index e9a10bd..5731682 100644
--- a/fs/f2fs/segment.h
+++ b/fs/f2fs/segment.h
@@ -448,8 +448,8 @@ static inline int reserved_sections(struct f2fs_sb_info *sbi)

static inline bool need_SSR(struct f2fs_sb_info *sbi)
{
- return ((prefree_segments(sbi) / sbi->segs_per_sec)
- + free_sections(sbi) < overprovision_sections(sbi));
+ return (prefree_segments(sbi) / sbi->segs_per_sec)
+ + free_sections(sbi) < overprovision_sections(sbi);
}

static inline bool has_not_enough_free_secs(struct f2fs_sb_info *sbi, int freed)
@@ -460,18 +460,19 @@ static inline bool has_not_enough_free_secs(struct f2fs_sb_info *sbi, int freed)
if (unlikely(sbi->por_doing))
return false;

- return ((free_sections(sbi) + freed) <= (node_secs + 2 * dent_secs +
- reserved_sections(sbi)));
+ return (free_sections(sbi) + freed) <= (node_secs + 2 * dent_secs +
+ reserved_sections(sbi));
}

static inline bool excess_prefree_segs(struct f2fs_sb_info *sbi)
{
- return (prefree_segments(sbi) > SM_I(sbi)->rec_prefree_segments);
+ return prefree_segments(sbi) > SM_I(sbi)->rec_prefree_segments;
}

static inline int utilization(struct f2fs_sb_info *sbi)
{
- return div_u64((u64)valid_user_blocks(sbi) * 100, sbi->user_block_count);
+ return div_u64((u64)valid_user_blocks(sbi) * 100,
+ sbi->user_block_count);
}

/*
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index b070f30..1a85f83 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -535,7 +535,8 @@ static int segment_info_seq_show(struct seq_file *seq, void *offset)
{
struct super_block *sb = seq->private;
struct f2fs_sb_info *sbi = F2FS_SB(sb);
- unsigned int total_segs = le32_to_cpu(sbi->raw_super->segment_count_main);
+ unsigned int total_segs =
+ le32_to_cpu(sbi->raw_super->segment_count_main);
int i;

for (i = 0; i < total_segs; i++) {
@@ -816,8 +817,9 @@ retry:
/* sanity checking of raw super */
if (sanity_check_raw_super(sb, *raw_super)) {
brelse(*raw_super_buf);
- f2fs_msg(sb, KERN_ERR, "Can't find a valid F2FS filesystem "
- "in %dth superblock", block + 1);
+ f2fs_msg(sb, KERN_ERR,
+ "Can't find valid F2FS filesystem in %dth superblock",
+ block + 1);
if (block == 0) {
block++;
goto retry;
--
1.8.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/





==============================================================================
TOPIC: Phase out pci_enable_msi_block()
http://groups.google.com/group/linux.kernel/t/7ded004c0804c466?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Jan 17 2014 1:10 pm
From: Bjorn Helgaas


On Fri, Jan 17, 2014 at 9:02 AM, Alexander Gordeev <agordeev@redhat.com> wrote:
> This series is against "next" branch in Bjorn's repo:
> git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git
>
> Changes from v1 to v2:
> - added a regression fix "ahci: Fix broken fallback to single
> MSI mode" as patch 1/9;
> - the series is reordered to move the regression fix in front;
> - at Bjorn's request pci_enable_msi() is un-deprecated;
> - as result, pci_enable_msi_range(pdev, 1, 1) styled calls
> rolled back to pci_enable_msi(pdev);
> - nvme bug fix moved out as a separate patch 5/9 "nvme: Fix
> invalid call to irq_set_affinity_hint()"
> - patches changelog elaborated a bit;
>
> Bjorn,
>
> As the release is supposedly this weekend, do you prefer
> the patches to go to your tree or to individual trees after
> the release?

I'd be happy to merge them, except for the fact that they probably
wouldn't have any time in -next before I ask Linus to pull them. So
how about if we wait until after the release, ask the area maintainers
to take them, and if they don't take them, I'll put them in my tree
for v3.15?

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/





==============================================================================
TOPIC: linux-next: Tree for Jan 14 (lowpan, 802.15.4)
http://groups.google.com/group/linux.kernel/t/a5d16a1f0a317f20?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Jan 17 2014 1:10 pm
From: Dmitry Eremin-Solenikov


Hello,

On Fri, Jan 17, 2014 at 11:13 PM, Stephen Warren <swarren@wwwdotorg.org> wrote:
> On 01/14/2014 03:54 PM, Dmitry Eremin-Solenikov wrote:
>> Hello,
>>
>>
>> On Tue, Jan 14, 2014 at 9:49 PM, Randy Dunlap <rdunlap@infradead.org> wrote:
>>>
>>> On 01/13/2014 09:51 PM, Stephen Rothwell wrote:
>>>> Hi all,
>>>>
>>>> This tree fails (more than usual) the powerpc allyesconfig build.
>>>>
>>>> Changes since 20140113:
>>>>
>>>
>>>
>>> on i386:
>>>
>>> net/built-in.o: In function `header_create':
>>> 6lowpan.c:(.text+0x166149): undefined reference to `lowpan_header_compress'
>>> net/built-in.o: In function `bt_6lowpan_recv':
>>> (.text+0x166b3c): undefined reference to `lowpan_process_data'
>>
>> Ah, nice Makefile hack there.
>> David, Marcel, could you please consider the attached patch.
>
> I think you forgot to "git add net/bluetooth/Makefile" into that patch;
> don't you need the following too (I certainly do, to build next-20140117)
>
>> diff --git a/net/bluetooth/Makefile b/net/bluetooth/Makefile
>> index cc6827e2ce68..80cb215826e8 100644
>> --- a/net/bluetooth/Makefile
>> +++ b/net/bluetooth/Makefile
>> @@ -12,8 +12,4 @@ bluetooth-y := af_bluetooth.o hci_core.o hci_conn.o hci_event.o mgmt.o \
>> hci_sock.o hci_sysfs.o l2cap_core.o l2cap_sock.o smp.o sco.o lib.o \
>> a2mp.o amp.o 6lowpan.o
>>
>> -ifeq ($(CONFIG_IEEE802154_6LOWPAN),)
>> - bluetooth-y += ../ieee802154/6lowpan_iphc.o
>> -endif
>> -
>> subdir-ccflags-y += -D__CHECK_ENDIAN__
>
> Should I send this as a separate followup patch?

Yes, please. I forgot to add it to the patch.

--
With best wishes
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/





==============================================================================
TOPIC: dcache: fix d_splice_alias handling of aliases
http://groups.google.com/group/linux.kernel/t/599bb447f08d6704?hl=en
==============================================================================

== 1 of 2 ==
Date: Fri, Jan 17 2014 1:10 pm
From: "J. Bruce Fields"


On Fri, Jan 17, 2014 at 10:39:17AM -0500, J. Bruce Fields wrote:
> On Fri, Jan 17, 2014 at 04:17:23AM -0800, Christoph Hellwig wrote:
> > Also the inode == NULL case really should be split out from
> > d_materialise_unique into a separate helper. It shares almost no
> > code, is entirely undocumented to the point that I don't really
> > understand what the purpose is, and the only caller that can get
> > there (fuse) already branches around that case in the caller anyway.
>
> I think I see what you mean, I can fix that.

Actually:

- two callers (fuse and nfs) take advantage of the NULL case.

- d_splice_alias handles inode == NULL in the same way, and
almost every caller takes advantage of that.

So at least we wouldn't want to actually make the caller handle this
case.

But maybe there's still some opportunity for cleanup or documentation.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/




== 2 of 2 ==
Date: Fri, Jan 17 2014 1:30 pm
From: "J. Bruce Fields"


On Fri, Jan 17, 2014 at 04:03:43PM -0500, J. Bruce Fields wrote:
> - d_splice_alias handles inode == NULL in the same way,

Actually, not exactly; simplifying a bit, in the NULL case they do:

d_splice_alias:

__d_instantiate(dentry, NULL);
security_d_instantiate(dentry, NULL);
if (d_unhashed(dentry))
d_rehash(dentry);

d_materialise_unique:

BUG_ON(!d_unhashed(dentry));

__d_instantiate(dentry, NULL);
d_rehash(dentry);
security_d_instantiate(dentry, NULL);

and a comment on d_splice_alias says

Cluster filesystems may call this function with a negative,
hashed dentry. In that case, we know that the inode will be a
regular file, and also this will only occur during atomic_open.

I don't understand those callers. But I guess it would be easy enough
to handle in d_materialise_unique.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/





==============================================================================
TOPIC: Why is (2 < 2) true? Is it a gcc bug?
http://groups.google.com/group/linux.kernel/t/587453bf46273e70?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Jan 17 2014 1:10 pm
From: Markus Trippelsdorf


On 2014.01.17 at 11:58 -0800, Alexei Starovoitov wrote:
> On Fri, Jan 17, 2014 at 9:58 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Fri, Jan 17, 2014 at 5:37 AM, Dorau, Lukasz <lukasz.dorau@intel.com> wrote:
> >> Hi
> >>
> >> My story is very simply...
> >> I applied the following patch:
> >>
> >> diff --git a/drivers/scsi/isci/init.c b/drivers/scsi/isci/init.c
> >> --- a/drivers/scsi/isci/init.c
> >> +++ b/drivers/scsi/isci/init.c
> >> @@ -698,8 +698,11 @@ static int isci_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >> if (err)
> >> goto err_host_alloc;
> >>
> >> - for_each_isci_host(i, isci_host, pdev)
> >> + for_each_isci_host(i, isci_host, pdev) {
> >> + pr_err("(%d < %d) == %d\n",\
> >> + i, SCI_MAX_CONTROLLERS, (i < SCI_MAX_CONTROLLERS));
> >> scsi_scan_host(to_shost(isci_host));
> >> + }
> >>
> >> return 0;
> >>
> >> --
> >> 1.8.3.1
> >>
> >> Then I issued the command 'modprobe isci' on platform with two SCU controllers (Patsburg D or T chipset)
> >> and received the following, very strange, output:
> >>
> >> (0 < 2) == 1
> >> (1 < 2) == 1
> >> (2 < 2) == 1
> >>
> >> Can anyone explain why (2 < 2) is true? Is it a gcc bug?
> >
> > gcc sees that i < array_size is the same as i < 2 as part of loop condition, so
> > it optimizes (i < sci_max_controllers) into constant 1.
> > and emits printk like:
> > printk ("\13(%d < %d) == %d\n", i_382, 2, 1);
> >
> >> (The kernel was compiled using gcc version 4.8.2.)
> >
> > it actually looks to be gcc 4.8 bug.
> > Can you try gcc 4.7 ?
> >
>
> It is interesting GCC 4.8 bug,
> since it seems to expose issues in two compiler passes.
>
> here is test case:
>
> struct isci_host;
> struct isci_orom;
>
> struct isci_pci_info {
> struct isci_host *hosts[2];
> struct isci_orom *orom;
> } v = {{(struct isci_host *)1,(struct isci_host *)1}, 0};
>
> int printf(const char *fmt, ...);
>
> int isci_pci_probe()
> {
> int i;
> struct isci_host *isci_host;
>
> for (i = 0, isci_host = v.hosts[i];
> i < 2 && isci_host;
> isci_host = v.hosts[++i]) {
> printf("(%d < %d) == %d\n", i, 2, (i < 2));
> }
>
> return 0;
> }
>
> int main()
> {
> isci_pci_probe();
> }
>
> $ gcc bug.c
> $./a.out
> 0 < 2) == 1
> (1 < 2) == 1
> $ gcc bug.c -O2
> $ ./a.out
> (0 < 2) == 1
> (1 < 2) == 1
> Segmentation fault (core dumped)

Your testcase is invalid:

markus@x4 tmp % clang -fsanitize=undefined -Wall -Wextra -O2 bug.c
markus@x4 tmp % ./a.out
(0 < 2) == 1
(1 < 2) == 1
bug.c:16:20: runtime error: index 2 out of bounds for type 'struct isci_host *[2]'

As Jakub Jelinek said on IRC, changing the loop to e.g.:

for (i = 0;
i < 2 && (isci_host = v.hosts[i]);
i++) {

fixes the issue.

--
Markus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/





==============================================================================
TOPIC: x86, mm, perf: Allow recursive faults from interrupts
http://groups.google.com/group/linux.kernel/t/3e53decd100762a7?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Jan 17 2014 1:10 pm
From: Andy Lutomirski


On Fri, Jan 17, 2014 at 12:08 PM, Waiman Long <waiman.long@hp.com> wrote:
> On 01/17/2014 02:17 PM, Andy Lutomirski wrote:
>>
>> On Fri, Jan 17, 2014 at 10:10 AM, Waiman Long<waiman.long@hp.com> wrote:
>>>
>>> On 01/16/2014 08:39 AM, tip-bot for Peter Zijlstra wrote:
>>>>
>>>> Commit-ID: c026b3591e4f2a4993df773183704bb31634e0bd
>>>> Gitweb:
>>>> http://git.kernel.org/tip/c026b3591e4f2a4993df773183704bb31634e0bd
>>>> Author: Peter Zijlstra<peterz@infradead.org>
>>>> AuthorDate: Fri, 10 Jan 2014 21:06:03 +0100
>>>> Committer: Ingo Molnar<mingo@kernel.org>
>>>> CommitDate: Thu, 16 Jan 2014 09:19:48 +0100
>>>>
>>>> x86, mm, perf: Allow recursive faults from interrupts
>>>>
>>>> Waiman managed to trigger a PMI while in a emulate_vsyscall() fault,
>>>> the PMI in turn managed to trigger a fault while obtaining a stack
>>>> trace. This triggered the sig_on_uaccess_error recursive fault logic
>>>> and killed the process dead.
>>>>
>>>> Fix this by explicitly excluding interrupts from the recursive fault
>>>> logic.
>>>>
>>>> Reported-and-Tested-by: Waiman Long<waiman.long@hp.com>
>>>> Fixes: e00b12e64be9 ("perf/x86: Further optimize copy_from_user_nmi()")
>>>> Cc: Aswin Chandramouleeswaran<aswin@hp.com>
>>>> Cc: Scott J Norton<scott.norton@hp.com>
>>>> Cc: Linus Torvalds<torvalds@linux-foundation.org>
>>>> Cc: Andy Lutomirski<luto@amacapital.net>
>>>> Cc: Arnaldo Carvalho de Melo<acme@ghostprotocols.net>
>>>> Cc: Andrew Morton<akpm@linux-foundation.org>
>>>> Signed-off-by: Peter Zijlstra<peterz@infradead.org>
>>>> Link:
>>>>
>>>> http://lkml.kernel.org/r/20140110200603.GJ7572@laptop.programming.kicks-ass.net
>>>> Signed-off-by: Ingo Molnar<mingo@kernel.org>
>>>> ---
>>>> arch/x86/mm/fault.c | 18 ++++++++++++++++++
>>>> 1 file changed, 18 insertions(+)
>>>>
>>>>
>>> Will that be picked up by Linus as it is a 3.13 regression?
>>
>> Does anyone actually know why this regressed recently? The buggy code
>> has been there for quite a while.
>>
>> --Andy
>
>
> Yes, the bug was there for a while, but a recent change by Peter (see the
> "Fixes:" line above) made it much easier to hit it.

Thanks!

So I feel slightly better now -- this particular bug didn't actually
exist when I wrote the offending code :) But that also means that
this should really be fixed in 3.13.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/





==============================================================================
TOPIC: Why does kexec use device_shutdown rather than ubind them
http://groups.google.com/group/linux.kernel/t/45a18d6ad21d78a6?hl=en
==============================================================================

== 1 of 1 ==
Date: Fri, Jan 17 2014 1:10 pm
From: Benjamin Herrenschmidt


On Fri, 2014-01-17 at 09:13 -0500, Vivek Goyal wrote:
> On Fri, Jan 17, 2014 at 04:59:13PM +1100, Benjamin Herrenschmidt wrote:
> > On Thu, 2014-01-16 at 20:52 -0800, Eric W. Biederman wrote:
> > >
> > > I think we have largely survied until now because kdump is so popular
> > > and kdump winds up having to reinitialize devices from any random
> > > state.
> >
> > kdump also doesn't care too much if the device is still DMA'ing to the
> > old kernel memory :-)
>
> In principle kdump does not care about ongoing DMAs but in practice it
> is giving us some headaches with IOMMU. Various kind of issues crop up
> during IOMMU intialization in second kernel while DMA is ongoing and
> unfortunately no good solution has made into upstream yet.
>
> Well, ongoing DMA and IOMMU seems to be orthogonal to using ->remove()
> in kexec. So I will stop here. :-)

Right, it's an orthogonal problem. I think hot resetting the bus might
solve it too though. It's even worse on ppc because the resulting iommu
errors trigger those "EEH freeze" that we have here blocking the devices
out etc...

Ben.

> Thanks
> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/





==============================================================================
TOPIC: numa,sched: tracepoints for NUMA balancing active nodemask changes
http://groups.google.com/group/linux.kernel/t/34ca8de3d5507b71?hl=en
==============================================================================

== 1 of 8 ==
Date: Fri, Jan 17 2014 1:20 pm
From: riel@redhat.com


From: Rik van Riel <riel@redhat.com>

Being able to see how the active nodemask changes over time, and why,
can be quite useful.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 8 ++++++--
2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 67e1bbf..91726b6 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -530,6 +530,40 @@ TRACE_EVENT(sched_swap_numa,
__entry->dst_pid, __entry->dst_tgid, __entry->dst_ngid,
__entry->dst_cpu, __entry->dst_nid)
);
+
+TRACE_EVENT(update_numa_active_nodes_mask,
+
+ TP_PROTO(int pid, int gid, int nid, int set, long faults, long max_faults),
+
+ TP_ARGS(pid, gid, nid, set, faults, max_faults),
+
+ TP_STRUCT__entry(
+ __field( pid_t, pid)
+ __field( pid_t, gid)
+ __field( int, nid)
+ __field( int, set)
+ __field( long, faults)
+ __field( long, max_faults);
+ ),
+
+ TP_fast_assign(
+ __entry->pid = pid;
+ __entry->gid = gid;
+ __entry->nid = nid;
+ __entry->set = set;
+ __entry->faults = faults;
+ __entry->max_faults = max_faults;
+ ),
+
+ TP_printk("pid=%d gid=%d nid=%d set=%d faults=%ld max_faults=%ld",
+ __entry->pid,
+ __entry->gid,
+ __entry->nid,
+ __entry->set,
+ __entry->faults,
+ __entry->max_faults)
+
+);

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home


Real Estate