twitter: linux.kernel - 26 new messages in 23 topics

linux.kernel
http://groups.google.com/group/linux.kernel?hl=en

linux.kernel@googlegroups.com

Today's topics:

* linux-next: build failure after merge of the target-updates tree - 1
messages, 1 author
http://groups.google.com/group/linux.kernel/t/49b0db6b50f07f48?hl=en
* arm64: audit: Add 32-bit (compat) syscall support - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/1000c63bf4615e52?hl=en
* Documentation: move all DMA documentations into Documentaion/dma - 1
messages, 1 author
http://groups.google.com/group/linux.kernel/t/c13767ccffe6446b?hl=en
* dmaengine: Add DMA_PRIVATE to BCM2835 driver - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/75f44be67fd320ae?hl=en
* backlight: turn backlight on/off when necessary - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/66a0e95b552bd82d?hl=en
* linux-next: build failure after merge of the gpio tree - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/2ac85584c4bcbf4e?hl=en
* of: fix of_update_property() - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/9a8655a1d0f481a2?hl=en
* DM: dm-insitu-comp: a compressed DM target for SSD - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/a407f4d2ed42a748?hl=en
* mm: Improve documentation of page_order - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/19098c56579a03ea?hl=en
* ping [PATCH v3] WAN: Adding support for Lantiq PEF2256 E1 chipset (FALC56) -
1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/45d385c81e5849cb?hl=en
* linux-next: Tree for Jan 20 - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/a15ec5a4e12361ec?hl=en
* x86, quirks: Add workaround for AMD F16h Erratum792 - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/51623ef7875b5521?hl=en
* fix module autoloading for ACPI enumerated devices - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/f9818216bd57b323?hl=en
* doc/kmemcheck: add kmemcheck to kernel-parameters - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/88e911b0e76495c9?hl=en
* dma: Add Xilinx AXI Video Direct Memory Access Engine driver support - 3
messages, 2 authors
http://groups.google.com/group/linux.kernel/t/0b28aeecccd70bb0?hl=en
* clk: export __clk_get_hw for re-use in others - 2 messages, 2 authors
http://groups.google.com/group/linux.kernel/t/36ad3107188cecfe?hl=en
* mm/zswap: Check all pool pages instead of one pool pages - 1 messages, 1
author
http://groups.google.com/group/linux.kernel/t/81cd07f00a212f3a?hl=en
* ARM64 / ACPI: Introduce some PCI functions when PCI is enabled - 1 messages,
1 author
http://groups.google.com/group/linux.kernel/t/5f4cccbbd3b92300?hl=en
* dmaengine: Add MOXA ART DMA engine driver - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/09c0ebd99fc97169?hl=en
* Adding hyperv.h to uapi headers - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/9109a55de9e0039d?hl=en
* Adding makefile for tools/hv - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/a45d1a1bc6149eb7?hl=en
* Staging : comedi : comedidev.h Fixed warning space coding style issue - 1
messages, 1 author
http://groups.google.com/group/linux.kernel/t/5473220d247721f2?hl=en
* gpio: mcp23s08: fix casting caused build warning - 1 messages, 1 author
http://groups.google.com/group/linux.kernel/t/63baaf004b3dd79d?hl=en

==============================================================================
TOPIC: linux-next: build failure after merge of the target-updates tree
http://groups.google.com/group/linux.kernel/t/49b0db6b50f07f48?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 19 2014 9:30 pm
From: Stephen Rothwell

Hi Nicholas,

After merging the target-updates tree, today's linux-next build (x86_64
allmodconfig) failed like this:

drivers/target/target_core_iblock.c: In function 'iblock_alloc_bip':
drivers/target/target_core_iblock.c:646:5: error: 'struct bio_integrity_payload' has no member named 'bip_size'
bip->bip_size = (cmd->data_length / dev->dev_attrib.block_size) *
^
drivers/target/target_core_iblock.c:648:5: error: 'struct bio_integrity_payload' has no member named 'bip_sector'
bip->bip_sector = bio->bi_sector;
^
drivers/target/target_core_iblock.c:648:23: error: 'struct bio' has no member named 'bi_sector'
bip->bip_sector = bio->bi_sector;
^
In file included from include/linux/printk.h:243:0,
from include/linux/kernel.h:13,
from include/linux/cache.h:4,
from include/linux/time.h:4,
from include/linux/ktime.h:24,
from include/linux/timer.h:5,
from drivers/target/target_core_iblock.c:29:
drivers/target/target_core_iblock.c:650:52: error: 'struct bio_integrity_payload' has no member named 'bip_size'
pr_debug("IBLOCK BIP Size: %u Sector: %llu\n", bip->bip_size,
^
drivers/target/target_core_iblock.c:651:27: error: 'struct bio_integrity_payload' has no member named 'bip_sector'
(unsigned long long)bip->bip_sector);
^

Caused by commit ecebbf6ccbca ("target/iblock: Add blk_integrity + BIP
passthrough support") interacting with commits 4f024f3797c4 ("block:
Abstract out bvec iterator") and d57a5f7c6605 ("bio-integrity: Convert to
bvec_iter") from the block tree.

I applied the following merge fix patch:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 20 Jan 2014 16:21:31 +1100
Subject: [PATCH] tagtet/iblock: merge for for bvec_iter changes

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
drivers/target/target_core_iblock.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/target/target_core_iblock.c b/drivers/target/target_core_iblock.c
index b7c64ef78338..554d4f75a75a 100644
--- a/drivers/target/target_core_iblock.c
+++ b/drivers/target/target_core_iblock.c
@@ -643,12 +643,12 @@ iblock_alloc_bip(struct se_cmd *cmd, struct bio *bio)
return -ENOMEM;
}

- bip->bip_size = (cmd->data_length / dev->dev_attrib.block_size) *
+ bip->bip_iter.bi_size = (cmd->data_length / dev->dev_attrib.block_size) *
dev->prot_length;
- bip->bip_sector = bio->bi_sector;
+ bip->bip_iter.bi_sector = bio->bi_iter.bi_sector;

- pr_debug("IBLOCK BIP Size: %u Sector: %llu\n", bip->bip_size,
- (unsigned long long)bip->bip_sector);
+ pr_debug("IBLOCK BIP Size: %u Sector: %llu\n", bip->bip_iter.bi_size,
+ (unsigned long long)bip->bip_iter.bi_sector);

for_each_sg(cmd->t_prot_sg, sg, cmd->t_prot_nents, i) {

--
1.8.5.3

--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

==============================================================================
TOPIC: arm64: audit: Add 32-bit (compat) syscall support
http://groups.google.com/group/linux.kernel/t/1000c63bf4615e52?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 19 2014 9:30 pm
From: AKASHI Takahiro

On 01/18/2014 01:46 AM, Will Deacon wrote:
> Hi Akashi,
>
> On Fri, Jan 17, 2014 at 08:13:17AM +0000, AKASHI Takahiro wrote:
>> Generic audit code also support compat system calls now.
>> This patch adds a small piece of architecture dependent code.
>
> [...]
>
>> static inline int syscall_get_nr(struct task_struct *task,
>> @@ -109,6 +110,15 @@ static inline void syscall_set_arguments(struct task_struct *task,
>> static inline int syscall_get_arch(struct task_struct *task,
>> struct pt_regs *regs)
>> {
>> +#ifdef CONFIG_COMPAT
>> + if (is_compat_thread(task_thread_info(task)))
>
> You can call is_compat_thread even when !CONFIG_COMPAT, so you don't need
> that #ifdef.

Right. I will remove it.

>> +#ifdef __AARCH64EB__
>> + return AUDIT_ARCH_ARMEB; /* only BE on BE */
>
> Well, actually, we only support userspace to be the same endianness as the
> kernel, so you that comment is slightly misleading. You could probably avoid
> these repeated ifdefs by defining things like ARM64_AUDIT_ARCH and
> ARM64_COMPAT_AUDIT_ARCH once depending on endianness.

As in the discussions about "audit(userspace)", if we don't have to care
about endianness, I will remove this #ifdef instead.

Thanks,
-Takahiro AKASHI

> Will
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: Documentation: move all DMA documentations into Documentaion/dma
http://groups.google.com/group/linux.kernel/t/c13767ccffe6446b?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 19 2014 9:40 pm
From: Vinod Koul

On Sat, Jan 18, 2014 at 11:59:13AM -0600, Rob Landley wrote:
> On 01/16/14 09:59, Vinod Koul wrote:
> >On Thu, Jan 16, 2014 at 06:50:04PM +0800, hongbo.zhang@freescale.com wrote:
> >>From: Hongbo Zhang <hongbo.zhang@freescale.com>
> >>
> >>Since there are already seven DMA documentations under the top Documentation/,
> >>it is better to create one dedicated directory for them.
> >
> >Well the problem is that not everything is same. Some of these mean how to use
> >dma mapping API, couple are related to dmaengine, so clubing everything into
> >"dma" doesnt sound right to me!
>
> Putting everything in the world in the top level directory isn't all
> flowers and kittens either.
>
> Where would be a _better_ place to move one of those files to?
As pointed mostly things dma* are dma-mapping and dmaengine, so would apt
that we move to two folders, unless i overlooked and assumed everything else is
dma-mapping!

--
~Vinod
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: dmaengine: Add DMA_PRIVATE to BCM2835 driver
http://groups.google.com/group/linux.kernel/t/75f44be67fd320ae?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 19 2014 9:40 pm
From: Vinod Koul

On Fri, Jan 17, 2014 at 06:06:29PM +0100, Florian Meier wrote:
> Without DMA_PRIVATE the driver is not able to allocate more than one channel.
> Since it uses dma_get_any_slave_channel that calls private_candidate,
> the second allocation fails at
> /* some channels are already publicly allocated */
> Maybe it should be fixed in the core, but at least this fixes the bug.
>
> Signed-off-by: Florian Meier <florian.meier@koalo.de>
> ---
> drivers/dma/bcm2835-dma.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/dma/bcm2835-dma.c b/drivers/dma/bcm2835-dma.c
> index 6ae0708..a036021 100644
> --- a/drivers/dma/bcm2835-dma.c
> +++ b/drivers/dma/bcm2835-dma.c
> @@ -611,6 +611,7 @@ static int bcm2835_dma_probe(struct platform_device *pdev)
> od->base = base;
>
> dma_cap_set(DMA_SLAVE, od->ddev.cap_mask);
> + dma_cap_set(DMA_PRIVATE, od->ddev.cap_mask);
> dma_cap_set(DMA_CYCLIC, od->ddev.cap_mask);
> od->ddev.device_alloc_chan_resources = bcm2835_dma_alloc_chan_resources;
> od->ddev.device_free_chan_resources = bcm2835_dma_free_chan_resources;

Applied, thanks

--
~Vinod
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: backlight: turn backlight on/off when necessary
http://groups.google.com/group/linux.kernel/t/66a0e95b552bd82d?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 19 2014 9:50 pm
From: Liu Ying

We don't have to turn backlight on/off everytime a blanking
or unblanking event comes because the backlight status may
have already been what we want. Another thought is that one
backlight device may be shared by multiple framebuffers. We
don't hope blanking one of the framebuffers may turn the
backlight off for all the other framebuffers, since they are
likely being active to display something. This patch adds
some logics to record each framebuffer's backlight usage to
determine the backlight device use count and whether the
backlight should be turned on or off. To be more specific,
only one unblank operation on a certain blanked framebuffer
may increase the backlight device's use count by one, while
one blank operation on a certain unblanked framebuffer may
decrease the use count by one, because the userspace is
likely to unblank a unblanked framebuffer or blank a blanked
framebuffer.

Signed-off-by: Liu Ying <Ying.Liu@freescale.com>
---
v1 can be found at https://lkml.org/lkml/2013/5/30/139

v1->v2:
* Make the commit message be more specific about the condition
in which backlight device use count can be increased/decreased.
* Correct the setting for bd->props.fb_blank.

drivers/video/backlight/backlight.c | 28 +++++++++++++++++++++-------
include/linux/backlight.h | 6 ++++++
2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/video/backlight/backlight.c b/drivers/video/backlight/backlight.c
index 5d05555..42044be 100644
--- a/drivers/video/backlight/backlight.c
+++ b/drivers/video/backlight/backlight.c
@@ -34,13 +34,15 @@ static const char *const backlight_types[] = {
defined(CONFIG_BACKLIGHT_CLASS_DEVICE_MODULE))
/* This callback gets called when something important happens inside a
* framebuffer driver. We're looking if that important event is blanking,
- * and if it is, we're switching backlight power as well ...
+ * and if it is and necessary, we're switching backlight power as well ...
*/
static int fb_notifier_callback(struct notifier_block *self,
unsigned long event, void *data)
{
struct backlight_device *bd;
struct fb_event *evdata = data;
+ int node = evdata->info->node;
+ int fb_blank = 0;

/* If we aren't interested in this event, skip it immediately ... */
if (event != FB_EVENT_BLANK && event != FB_EVENT_CONBLANK)
@@ -51,12 +53,24 @@ static int fb_notifier_callback(struct notifier_block *self,
if (bd->ops)
if (!bd->ops->check_fb ||
bd->ops->check_fb(bd, evdata->info)) {
- bd->props.fb_blank = *(int *)evdata->data;
- if (bd->props.fb_blank == FB_BLANK_UNBLANK)
- bd->props.state &= ~BL_CORE_FBBLANK;
- else
- bd->props.state |= BL_CORE_FBBLANK;
- backlight_update_status(bd);
+ fb_blank = *(int *)evdata->data;
+ if (fb_blank == FB_BLANK_UNBLANK &&
+ !bd->fb_bl_on[node]) {
+ bd->fb_bl_on[node] = true;
+ if (!bd->use_count++) {
+ bd->props.state &= ~BL_CORE_FBBLANK;
+ bd->props.fb_blank = FB_BLANK_UNBLANK;
+ backlight_update_status(bd);
+ }
+ } else if (fb_blank != FB_BLANK_UNBLANK &&
+ bd->fb_bl_on[node]) {
+ bd->fb_bl_on[node] = false;
+ if (!(--bd->use_count)) {
+ bd->props.state |= BL_CORE_FBBLANK;
+ bd->props.fb_blank = FB_BLANK_POWERDOWN;
+ backlight_update_status(bd);
+ }
+ }
}
mutex_unlock(&bd->ops_lock);
return 0;
diff --git a/include/linux/backlight.h b/include/linux/backlight.h
index 5f9cd96..7264742 100644
--- a/include/linux/backlight.h
+++ b/include/linux/backlight.h
@@ -9,6 +9,7 @@
#define _LINUX_BACKLIGHT_H

#include <linux/device.h>
+#include <linux/fb.h>
#include <linux/mutex.h>
#include <linux/notifier.h>

@@ -104,6 +105,11 @@ struct backlight_device {
struct list_head entry;

struct device dev;
+
+ /* Multiple framebuffers may share one backlight device */
+ bool fb_bl_on[FB_MAX];
+
+ int use_count;
};

static inline void backlight_update_status(struct backlight_device *bd)
--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: linux-next: build failure after merge of the gpio tree
http://groups.google.com/group/linux.kernel/t/2ac85584c4bcbf4e?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 19 2014 9:50 pm
From: Stephen Rothwell

Hi Linus,

After merging the gpio tree, today's linux-next build (x86_64
allmodconfig) failed like this:

drivers/gpio/gpio-mcp23s08.c: In function 'mcp23s08_irq_setup':
drivers/gpio/gpio-mcp23s08.c:482:46: error: 'struct gpio_chip' has no member named 'of_node'
mcp->irq_domain = irq_domain_add_linear(chip->of_node, chip->ngpio,
^
drivers/gpio/gpio-mcp23s08.c: In function 'mcp23s08_probe_one':
drivers/gpio/gpio-mcp23s08.c:651:55: error: 'struct gpio_chip' has no member named 'of_node'
mcp->irq_controller = of_property_read_bool(mcp->chip.of_node,
^
drivers/gpio/gpio-mcp23s08.c:654:43: error: 'struct gpio_chip' has no member named 'of_node'
mirror = of_property_read_bool(mcp->chip.of_node,
^

Caused by commit 4e47f91bf741 ("gpio: mcp23s08: Add irq functionality for
i2c chips"). The presence of of_node depends on CONFIG_OF_GPIO.

I have reverted that commit for today.
--
Cheers,
Stephen Rothwell sfr@canb.auug.org.au

==============================================================================
TOPIC: of: fix of_update_property()
http://groups.google.com/group/linux.kernel/t/9a8655a1d0f481a2?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 19 2014 10:00 pm
From: "Li.Xiubo@freescale.com"

> Subject: Re: [PATCH] of: fix of_update_property()
>
> On Thu, Jan 16, 2014 at 10:46 PM, Xiubo Li <Li.Xiubo@freescale.com> wrote:
> > The of_update_property() is intent to update a property in a node
>
> s/intent/indended/
>
> > and if the property does not exist, will add it to the node.
> >
> > The second search of the property is possibly won't be found, that
> > maybe removed by other thread just before the second search begain,
> > if so just retry it.
>
> How did you find this problem? Actual use or some artificial stress test?
>

Some artificial stress test at home.

> > Signed-off-by: Xiubo Li <Li.Xiubo@freescale.com>
> > ---
> > drivers/of/base.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/of/base.c b/drivers/of/base.c
> > index f807d0e..d0c53bc 100644
> > --- a/drivers/of/base.c
> > +++ b/drivers/of/base.c
> > @@ -1572,6 +1572,7 @@ int of_update_property(struct device_node *np, struct
> property *newprop)
> > if (!newprop->name)
> > return -EINVAL;
> >
> > +retry:
> > oldprop = of_find_property(np, newprop->name, NULL);
> > if (!oldprop)
> > return of_add_property(np, newprop);
>
> Isn't there also a race that if you do 2 updates for a non-existent
> property and both threads try to add the property, the first one will
> succeed and the 2nd will fail. The 2nd one needs to retry as well.
>

Well, yes, that will happen.

Maybe we could add one __of_add_property() without any locks, like
__of_find_property(). And then in of_update_prperty() move the searching
and adding operations to between lock and unlock, like:

raw_spin_lock_irqsave();
oldprop = __of_find_property();
if (!oldprop) {
rc = __of_add_property(np, newprop);
...
}
...
replace the node...
...
raw_spin_unlock_irqrestore();

> Also, couldn't the node itself be removed while trying to do the update?
>

For this is between the lock operations. I think this doesn't matter here.

> There seem to be multiple problems with this code, but doing multiple
> simultaneous, conflicting updates seems like an unlikely case.
>

Yes, but this will happen in theory.

Thanks,

Best Regards,
Xiubo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

==============================================================================
TOPIC: DM: dm-insitu-comp: a compressed DM target for SSD
http://groups.google.com/group/linux.kernel/t/a407f4d2ed42a748?hl=en
==============================================================================

== 1 of 1 ==
Date: Sun, Jan 19 2014 10:00 pm
From: Shaohua Li

This is a simple DM target supporting compression for SSD only. Under layer SSD
must support 512B sector size, the target only supports 4k sector size.

Disk layout:
|super|...meta...|..data...|

Store unit is 4k (a block). Super is 1 block, which stores meta and data size
and compression algorithm. Meta is a bitmap. For each data block, there are 5
bits meta.

Data:
Data of a block is compressed. Compressed data is round up to 512B, which is
the payload. In disk, payload is stored at the begining of logical sector of
the block. Let's look at an example. Say we store data to block A, which is in
sector B(A*8), its orginal size is 4k, compressed size is 1500. Compressed data
(CD) will use 3 sectors (512B). The 3 sectors are the payload. Payload will be
stored at sector B.

---------------------------------------------------
... | CD1 | CD2 | CD3 | | | | | | ...
---------------------------------------------------
^B ^B+1 ^B+2 ^B+7 ^B+8

For this block, we will not use sector B+3 to B+7 (a hole). We use 4 meta bits
to present payload size. The compressed size (1500) isn't stored in meta
directly. Instead, we store it at the last 32bits of payload. In this example,
we store it at the end of sector B+2. If compressed size + sizeof(32bits)
crosses a sector, payload size will increase one sector. If payload uses 8
sectors, we store uncompressed data directly.

If IO size is bigger than one block, we can store the data as an extent. Data
of the whole extent will compressed and stored in the similar way like above.
The first block of the extent is the head, all others are the tail. If extent
is 1 block, the block is head. We have 1 bit of meta to present if a block is
head or tail. If 4 meta bits of head block can't store extent payload size, we
will borrow tail block meta bits to store payload size. Max allowd extent size
is 128k, so we don't compress/decompress too big size data.

Meta:
Modifying data will modify meta too. Meta will be written(flush) to disk
depending on meta write policy. We support writeback and writethrough mode. In
writeback mode, meta will be written to disk in an interval or a FLUSH request.
In writethrough mode, data and meta data will be written to disk together.

Advantages:
1. simple. Since we store compressed data in-place, we don't need complicated
disk data management.
2. efficient. For each 4k, we only need 5 bits meta. 1T data will use less than
200M meta, so we can load all meta into memory. And actual compression size is
in payload. So if IO doesn't need RMW and we use writeback meta flush, we don't
need extra IO for meta.

Disadvantages:
1. hole. Since we store compressed data in-place, there are a lot of holes (in
above example, B+3 - B+7) Hole can impact IO, because we can't do IO merge.
2. 1:1 size. Compression doesn't change disk size. If disk is 1T, we can only store
1T data even we do compression.

But this is for SSD only. Generally SSD firmware has a FTL layer to map disk
sectors to flash nand. High end SSD firmware has filesystem-like FTL.
1. hole. Disk has a lot of holes, but SSD FTL can still store data continuous
in nand. Even we can't do IO merge in OS layer, SSD firmware can do it.
2. 1:1 size. On one side, we write compressed data to SSD, which means less
data is written to SSD. This will be very helpful to improve SSD garbage
collection, and so write speed and life cycle. So even this is a problem, the
target is still helpful. On the other side, advanced SSD FTL can easily do thin
provision. For example, if nand is 1T and we let SSD report it as 2T, and use
the SSD as compressed target. In such SSD, we don't have the 1:1 size issue.

So if SSD FTL can map non-continuous disk sectors to continuous nand and
support thin provision, the compressed target will work very well.

V1->V2:
1. Change name to insitu_comp, cleanup code, add comments and doc
2. Improve performance (extent locking, dedicated workqueue)

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
Documentation/device-mapper/insitu-comp.txt | 50
drivers/md/Kconfig | 6
drivers/md/Makefile | 1
drivers/md/dm-insitu-comp.c | 1483 ++++++++++++++++++++++++++++
drivers/md/dm-insitu-comp.h | 146 ++
5 files changed, 1686 insertions(+)

Index: linux/drivers/md/Kconfig
===================================================================
--- linux.orig/drivers/md/Kconfig 2014-01-17 14:37:12.186725995 +0800
+++ linux/drivers/md/Kconfig 2014-01-17 14:37:12.174726295 +0800
@@ -290,6 +290,12 @@ config DM_CACHE_CLEANER
A simple cache policy that writes back all data to the
origin. Used when decommissioning a dm-cache.

+config DM_INSITU_COMPRESSION
+ tristate "Insitu compression target"
+ depends on BLK_DEV_DM
+ ---help---
+ Allow volume managers to insitu compress data for SSD.
+
config DM_MIRROR
tristate "Mirror target"
depends on BLK_DEV_DM
Index: linux/drivers/md/Makefile
===================================================================
--- linux.orig/drivers/md/Makefile 2014-01-17 14:37:12.186725995 +0800
+++ linux/drivers/md/Makefile 2014-01-17 14:37:12.174726295 +0800
@@ -52,6 +52,7 @@ obj-$(CONFIG_DM_VERITY) += dm-verity.o
obj-$(CONFIG_DM_CACHE) += dm-cache.o
obj-$(CONFIG_DM_CACHE_MQ) += dm-cache-mq.o
obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o
+obj-$(CONFIG_DM_INSITU_COMPRESSION) += dm-insitu-comp.o

ifeq ($(CONFIG_DM_UEVENT),y)
dm-mod-objs += dm-uevent.o
Index: linux/drivers/md/dm-insitu-comp.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/drivers/md/dm-insitu-comp.c 2014-01-20 13:42:00.417454765 +0800
@@ -0,0 +1,1483 @@
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/blkdev.h>
+#include <linux/bio.h>
+#include <linux/slab.h>
+#include <linux/device-mapper.h>
+#include <linux/dm-io.h>
+#include <linux/crypto.h>
+#include <linux/lzo.h>
+#include <linux/kthread.h>
+#include <linux/page-flags.h>
+#include <linux/completion.h>
+#include "dm-insitu-comp.h"
+
+#define DM_MSG_PREFIX "dm_insitu_comp"
+
+static struct insitu_comp_compressor_data compressors[] = {
+ [INSITU_COMP_ALG_LZO] = {
+ .name = "lzo",
+ .comp_len = lzo_comp_len,
+ },
+ [INSITU_COMP_ALG_ZLIB] = {
+ .name = "deflate",
+ },
+};
+static int default_compressor;
+
+static struct kmem_cache *insitu_comp_io_range_cachep;
+static struct kmem_cache *insitu_comp_meta_io_cachep;
+
+static struct insitu_comp_io_worker insitu_comp_io_workers[NR_CPUS];
+static struct workqueue_struct *insitu_comp_wq;
+
+/* each block has 5 bits metadata */
+static u8 insitu_comp_get_meta(struct insitu_comp_info *info, u64 block_index)
+{
+ u64 first_bit = block_index * INSITU_COMP_META_BITS;
+ int bits, offset;
+ u8 data, ret = 0;
+
+ offset = first_bit & 7;
+ bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
+
+ data = info->meta_bitmap[first_bit >> 3];
+ ret = (data >> offset) & ((1 << bits) - 1);
+
+ if (bits < INSITU_COMP_META_BITS) {
+ data = info->meta_bitmap[(first_bit >> 3) + 1];
+ bits = INSITU_COMP_META_BITS - bits;
+ ret |= (data & ((1 << bits) - 1)) <<
+ (INSITU_COMP_META_BITS - bits);
+ }
+ return ret;
+}
+
+static void insitu_comp_set_meta(struct insitu_comp_info *info,
+ u64 block_index, u8 meta, bool dirty_meta)
+{
+ u64 first_bit = block_index * INSITU_COMP_META_BITS;
+ int bits, offset;
+ u8 data;
+ struct page *page;
+
+ offset = first_bit & 7;
+ bits = min_t(u8, INSITU_COMP_META_BITS, 8 - offset);
+
+ data = info->meta_bitmap[first_bit >> 3];
+ data &= ~(((1 << bits) - 1) << offset);
+ data |= (meta & ((1 << bits) - 1)) << offset;
+ info->meta_bitmap[first_bit >> 3] = data;
+
+ /*
+ * For writethrough, we write metadata directly. For writeback, if
+ * request is FUA, we do this too; otherwise we just dirty the page,
+ * which will be flush out in an interval
+ */
+ if (info->write_mode == INSITU_COMP_WRITE_BACK) {
+ page = vmalloc_to_page(&info->meta_bitmap[first_bit >> 3]);
+ if (dirty_meta)
+ SetPageDirty(page);
+ else
+ ClearPageDirty(page);
+ }
+
+ if (bits < INSITU_COMP_META_BITS) {
+ meta >>= bits;
+ data = info->meta_bitmap[(first_bit >> 3) + 1];
+ bits = INSITU_COMP_META_BITS - bits;
+ data = (data >> bits) << bits;
+ data |= meta & ((1 << bits) - 1);
+ info->meta_bitmap[(first_bit >> 3) + 1] = data;
+
+ if (info->write_mode == INSITU_COMP_WRITE_BACK) {
+ page = vmalloc_to_page(&info->meta_bitmap[
+ (first_bit >> 3) + 1]);
+ if (dirty_meta)
+ SetPageDirty(page);
+ else
+ ClearPageDirty(page);
+ }
+ }
+}
+
+/*
+ * set metadata for an extent since block @block_index, length is
+ * @logical_blocks. The extent uses @data_sectors sectors
+ */
+static void insitu_comp_set_extent(struct insitu_comp_req *req,
+ u64 block_index, u16 logical_blocks, sector_t data_sectors)
+{
+ int i;
+ u8 data;
+
+ for (i = 0; i < logical_blocks; i++) {
+ data = min_t(sector_t, data_sectors, 8);
+ data_sectors -= data;
+ if (i != 0)
+ data |= INSITU_COMP_TAIL_MASK;
+ /* For FUA, we write out meta data directly */
+ insitu_comp_set_meta(req->info, block_index + i, data,
+ !(req->bio->bi_rw & REQ_FUA));
+ }
+}
+
+/*
+ * get metadata for an extent covering block @block_index. @first_block_index
+ * returns the first block of the extent. @logical_sectors returns the extent
+ * length. @data_sectors returns the sectors the extent uses
+ */
+static void insitu_comp_get_extent(struct insitu_comp_info *info,
+ u64 block_index, u64 *first_block_index, u16 *logical_sectors,
+ u16 *data_sectors)
+{
+ u8 data;
+
+ data = insitu_comp_get_meta(info, block_index);
+ while (data & INSITU_COMP_TAIL_MASK) {
+ block_index--;
+ data = insitu_comp_get_meta(info, block_index);
+ }
+ *first_block_index = block_index;
+ *logical_sectors = INSITU_COMP_BLOCK_SIZE >> 9;
+ *data_sectors = data & INSITU_COMP_LENGTH_MASK;
+ block_index++;
+ while (block_index < info->data_blocks) {
+ data = insitu_comp_get_meta(info, block_index);
+ if (!(data & INSITU_COMP_TAIL_MASK))
+ break;
+ *logical_sectors += INSITU_COMP_BLOCK_SIZE >> 9;
+ *data_sectors += data & INSITU_COMP_LENGTH_MASK;
+ block_index++;
+ }
+}
+
+static int insitu_comp_access_super(struct insitu_comp_info *info,
+ void *addr, int rw)
+{
+ struct dm_io_region region;
+ struct dm_io_request req;
+ unsigned long io_error = 0;
+ int ret;
+
+ region.bdev = info->dev->bdev;
+ region.sector = 0;
+ region.count = INSITU_COMP_BLOCK_SIZE >> 9;
+
+ req.bi_rw = rw;
+ req.mem.type = DM_IO_KMEM;
+ req.mem.offset = 0;
+ req.mem.ptr.addr = addr;
+ req.notify.fn = NULL;
+ req.client = info->io_client;
+
+ ret = dm_io(&req, 1, &region, &io_error);
+ if (ret || io_error)
+ return -EIO;
+ return 0;
+}
+
+static void insitu_comp_meta_io_done(unsigned long error, void *context)
+{
+ struct insitu_comp_meta_io *meta_io = context;
+
+ meta_io->fn(meta_io->data, error);
+ kmem_cache_free(insitu_comp_meta_io_cachep, meta_io);
+}
+
+static int insitu_comp_write_meta(struct insitu_comp_info *info,
+ u64 start_page, u64 end_page, void *data,
+ void (*fn)(void *data, unsigned long error), int rw)
+{
+ struct insitu_comp_meta_io *meta_io;
+
+ BUG_ON(end_page > info->meta_bitmap_pages);
+
+ meta_io = kmem_cache_alloc(insitu_comp_meta_io_cachep, GFP_NOIO);
+ if (!meta_io) {
+ fn(data, -ENOMEM);
+ return -ENOMEM;
+ }
+ meta_io->data = data;
+ meta_io->fn = fn;
+
+ meta_io->io_region.bdev = info->dev->bdev;
+ meta_io->io_region.sector = INSITU_COMP_META_START_SECTOR +
+ (start_page << (PAGE_SHIFT - 9));
+ meta_io->io_region.count = (end_page - start_page) << (PAGE_SHIFT - 9);
+
+ atomic64_add(meta_io->io_region.count << 9, &info->meta_write_size);
+
+ meta_io->io_req.bi_rw = rw;
+ meta_io->io_req.mem.type = DM_IO_VMA;
+ meta_io->io_req.mem.offset = 0;
+ meta_io->io_req.mem.ptr.addr = info->meta_bitmap +
+ (start_page << PAGE_SHIFT);
+ meta_io->io_req.notify.fn = insitu_comp_meta_io_done;
+ meta_io->io_req.notify.context = meta_io;
+ meta_io->io_req.client = info->io_client;
+
+ dm_io(&meta_io->io_req, 1, &meta_io->io_region, NULL);
+ return 0;
+}
+
+struct writeback_flush_data {
+ struct completion complete;
+ atomic_t cnt;
+};
+
+static void writeback_flush_io_done(void *data, unsigned long error)
+{
+ struct writeback_flush_data *wb = data;
+
+ if (atomic_dec_return(&wb->cnt))
+ return;
+ complete(&wb->complete);
+}
+
+static void insitu_comp_flush_dirty_meta(struct insitu_comp_info *info,
+ struct writeback_flush_data *data)
+{
+ struct page *page;
+ u64 start = 0, index;
+ u32 pending = 0, cnt = 0;
+ bool dirty;
+ struct blk_plug plug;
+
+ blk_start_plug(&plug);
+ for (index = 0; index < info->meta_bitmap_pages; index++, cnt++) {
+ if (cnt == 256) {
+ cnt = 0;
+ cond_resched();
+ }
+
+ page = vmalloc_to_page(info->meta_bitmap +
+ (index << PAGE_SHIFT));
+ dirty = TestClearPageDirty(page);
+
+ if (pending == 0 && dirty) {
+ start = index;
+ pending++;
+ continue;
+ } else if (pending == 0)
+ continue;
+ else if (pending > 0 && dirty) {
+ pending++;
+ continue;
+ }
+
+ /* pending > 0 && !dirty */
+ atomic_inc(&data->cnt);
+ insitu_comp_write_meta(info, start, start + pending, data,
+ writeback_flush_io_done, WRITE);
+ pending = 0;
+ }
+
+ if (pending > 0) {
+ atomic_inc(&data->cnt);
+ insitu_comp_write_meta(info, start, start + pending, data,
+ writeback_flush_io_done, WRITE);
+ }
+ blkdev_issue_flush(info->dev->bdev, GFP_NOIO, NULL);
+ blk_finish_plug(&plug);
+}
+
+/* writeback thread flushs all dirty metadata to disk in an interval */
+static int insitu_comp_meta_writeback_thread(void *data)
+{
+ struct insitu_comp_info *info = data;
+ struct writeback_flush_data wb;
+
+ atomic_set(&wb.cnt, 1);
+ init_completion(&wb.complete);
+
+ while (!kthread_should_stop()) {
+ schedule_timeout_interruptible(
+ msecs_to_jiffies(info->writeback_delay * 1000));
+ insitu_comp_flush_dirty_meta(info, &wb);
+ }
+
+ insitu_comp_flush_dirty_meta(info, &wb);
+
+ writeback_flush_io_done(&wb, 0);
+ wait_for_completion(&wb.complete);
+ return 0;
+}
+
+static int insitu_comp_init_meta(struct insitu_comp_info *info, bool new)
+{
+ struct dm_io_region region;
+ struct dm_io_request req;
+ unsigned long io_error = 0;
+ struct blk_plug plug;
+ int ret;
+ ssize_t len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
+
+ len *= sizeof(unsigned long);
+
+ region.bdev = info->dev->bdev;
+ region.sector = INSITU_COMP_META_START_SECTOR;
+ region.count = (len + 511) >> 9;
+
+ req.mem.type = DM_IO_VMA;
+ req.mem.offset = 0;
+ req.mem.ptr.addr = info->meta_bitmap;
+ req.notify.fn = NULL;
+ req.client = info->io_client;
+
+ blk_start_plug(&plug);
+ if (new) {
+ memset(info->meta_bitmap, 0, len);
+ req.bi_rw = WRITE_FLUSH;
+ ret = dm_io(&req, 1, &region, &io_error);
+ } else {
+ req.bi_rw = READ;
+ ret = dm_io(&req, 1, &region, &io_error);
+ }
+ blk_finish_plug(&plug);
+
+ if (ret || io_error) {
+ info->ti->error = "Access metadata error";
+ return -EIO;
+ }
+
+ if (info->write_mode == INSITU_COMP_WRITE_BACK) {
+ info->writeback_tsk = kthread_run(
+ insitu_comp_meta_writeback_thread,
+ info, "insitu_comp_writeback");
+ if (!info->writeback_tsk) {
+ info->ti->error = "Create writeback thread error";
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
+static int insitu_comp_alloc_compressor(struct insitu_comp_info *info)
+{
+ int i;
+
+ for_each_possible_cpu(i) {
+ info->tfm[i] = crypto_alloc_comp(
+ compressors[info->comp_alg].name, 0, 0);
+ if (IS_ERR(info->tfm[i])) {
+ info->tfm[i] = NULL;
+ goto err;
+ }
+ }
+ return 0;
+err:
+ for_each_possible_cpu(i) {
+ if (info->tfm[i]) {
+ crypto_free_comp(info->tfm[i]);
+ info->tfm[i] = NULL;
+ }
+ }
+ return -ENOMEM;
+}
+
+static void insitu_comp_free_compressor(struct insitu_comp_info *info)
+{
+ int i;
+
+ for_each_possible_cpu(i) {
+ if (info->tfm[i]) {
+ crypto_free_comp(info->tfm[i]);
+ info->tfm[i] = NULL;
+ }
+ }
+}
+
+static int insitu_comp_read_or_create_super(struct insitu_comp_info *info)
+{
+ void *addr;
+ struct insitu_comp_super_block *super;
+ u64 total_blocks;
+ u64 data_blocks, meta_blocks;
+ u32 rem, cnt;
+ bool new_super = false;
+ int ret;
+ ssize_t len;
+
+ total_blocks = i_size_read(info->dev->bdev->bd_inode) >>
+ INSITU_COMP_BLOCK_SHIFT;
+ data_blocks = total_blocks - 1;
+ rem = do_div(data_blocks, INSITU_COMP_BLOCK_SIZE * 8 +
+ INSITU_COMP_META_BITS);
+ meta_blocks = data_blocks * INSITU_COMP_META_BITS;
+ data_blocks *= INSITU_COMP_BLOCK_SIZE * 8;
+
+ cnt = rem;
+ rem /= (INSITU_COMP_BLOCK_SIZE * 8 / INSITU_COMP_META_BITS + 1);
+ data_blocks += rem * (INSITU_COMP_BLOCK_SIZE * 8 /
+ INSITU_COMP_META_BITS);
+ meta_blocks += rem;
+
+ cnt %= (INSITU_COMP_BLOCK_SIZE * 8 / INSITU_COMP_META_BITS + 1);
+ meta_blocks += 1;
+ data_blocks += cnt - 1;
+
+ info->data_blocks = data_blocks;
+ info->data_start = (1 + meta_blocks) << INSITU_COMP_BLOCK_SECTOR_SHIFT;
+
+ addr = kzalloc(INSITU_COMP_BLOCK_SIZE, GFP_KERNEL);
+ if (!addr) {
+ info->ti->error = "Cannot allocate super";
+ return -ENOMEM;
+ }
+
+ super = addr;
+ ret = insitu_comp_access_super(info, addr, READ);
+ if (ret)
+ goto out;
+
+ if (le64_to_cpu(super->magic) == INSITU_COMP_SUPER_MAGIC) {
+ if (le64_to_cpu(super->meta_blocks) != meta_blocks ||
+ le64_to_cpu(super->data_blocks) != data_blocks) {
+ info->ti->error = "Super is invalid";
+ ret = -EINVAL;
+ goto out;
+ }
+ if (!crypto_has_comp(compressors[super->comp_alg].name, 0, 0)) {
+ info->ti->error =
+ "Compressor algorithm doesn't support";
+ ret = -EINVAL;
+ goto out;
+ }
+ } else {
+ super->magic = cpu_to_le64(INSITU_COMP_SUPER_MAGIC);
+ super->meta_blocks = cpu_to_le64(meta_blocks);
+ super->data_blocks = cpu_to_le64(data_blocks);
+ super->comp_alg = default_compressor;
+ ret = insitu_comp_access_super(info, addr, WRITE_FUA);
+ if (ret) {
+ info->ti->error = "Access super fails";
+ goto out;
+ }
+ new_super = true;
+ }
+
+ info->comp_alg = super->comp_alg;
+ if (insitu_comp_alloc_compressor(info)) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ info->meta_bitmap_bits = data_blocks * INSITU_COMP_META_BITS;
+ len = DIV_ROUND_UP_ULL(info->meta_bitmap_bits, BITS_PER_LONG);
+ len *= sizeof(unsigned long);
+ info->meta_bitmap_pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ info->meta_bitmap = vmalloc(info->meta_bitmap_pages * PAGE_SIZE);
+ if (!info->meta_bitmap) {
+ ret = -ENOMEM;
+ goto bitmap_err;
+ }
+
+ ret = insitu_comp_init_meta(info, new_super);
+ if (ret)
+ goto meta_err;
+
+ return 0;
+meta_err:
+ vfree(info->meta_bitmap);
+bitmap_err:
+ insitu_comp_free_compressor(info);
+out:
+ kfree(addr);
+ return ret;
+}
+
+/*
+ * <dev> <writethough>/<writeback> <meta_commit_delay>
+ */
+static int insitu_comp_ctr(struct dm_target *ti, unsigned int argc, char **argv)
+{
+ struct insitu_comp_info *info;
+ char write_mode[15];
+ int ret, i;
+
+ if (argc < 2) {
+ ti->error = "Invalid argument count";
+ return -EINVAL;
+ }
+
+ info = kzalloc(sizeof(*info), GFP_KERNEL);
+ if (!info) {
+ ti->error = "Cannot allocate context";
+ return -ENOMEM;
+ }
+ info->ti = ti;
+
+ if (sscanf(argv[1], "%s", write_mode) != 1) {
+ ti->error = "Invalid argument";
+ ret = -EINVAL;
+ goto err_para;
+ }
+
+ if (strcmp(write_mode, "writeback") == 0) {
+ if (argc != 3) {
+ ti->error = "Invalid argument";
+ ret = -EINVAL;
+ goto err_para;
+ }
+ info->write_mode = INSITU_COMP_WRITE_BACK;
+ if (sscanf(argv[2], "%u", &info->writeback_delay) != 1) {
+ ti->error = "Invalid argument";
+ ret = -EINVAL;
+ goto err_para;
+ }
+ } else if (strcmp(write_mode, "writethrough") == 0) {
+ info->write_mode = INSITU_COMP_WRITE_THROUGH;
+ } else {
+ ti->error = "Invalid argument";
+ ret = -EINVAL;
+ goto err_para;
+ }
+
+ if (dm_get_device(ti, argv[0], dm_table_get_mode(ti->table),
+ &info->dev)) {
+ ti->error = "Can't get device";
+ ret = -EINVAL;
+ goto err_para;
+ }
+
+ info->io_client = dm_io_client_create();
+ if (!info->io_client) {
+ ti->error = "Can't create io client";
+ ret = -EINVAL;
+ goto err_ioclient;
+ }
+
+ if (bdev_logical_block_size(info->dev->bdev) != 512) {
+ ti->error = "Can't logical block size too big";
+ ret = -EINVAL;
+ goto err_blocksize;
+ }
+
+ ret = insitu_comp_read_or_create_super(info);
+ if (ret)
+ goto err_blocksize;
+
+ for (i = 0; i < BITMAP_HASH_LEN; i++) {
+ info->bitmap_locks[i].io_running = 0;
+ spin_lock_init(&info->bitmap_locks[i].wait_lock);
+ INIT_LIST_HEAD(&info->bitmap_locks[i].wait_list);
+ }
+
+ atomic64_set(&info->compressed_write_size, 0);
+ atomic64_set(&info->uncompressed_write_size, 0);
+ atomic64_set(&info->meta_write_size, 0);
+ ti->num_flush_bios = 1;
+ /* doesn't support discard yet */
+ ti->per_bio_data_size = sizeof(struct insitu_comp_req);
+ ti->private = info;
+ return 0;
+err_blocksize:
+ dm_io_client_destroy(info->io_client);
+err_ioclient:
+ dm_put_device(ti, info->dev);
+err_para:
+ kfree(info);
+ return ret;
+}
+
+static void insitu_comp_dtr(struct dm_target *ti)
+{
+ struct insitu_comp_info *info = ti->private;
+
+ if (info->write_mode == INSITU_COMP_WRITE_BACK)
+ kthread_stop(info->writeback_tsk);
+ insitu_comp_free_compressor(info);
+ vfree(info->meta_bitmap);
+ dm_io_client_destroy(info->io_client);
+ dm_put_device(ti, info->dev);
+ kfree(info);
+}
+
+static u64 insitu_comp_sector_to_block(sector_t sect)
+{
+ return sect >> INSITU_COMP_BLOCK_SECTOR_SHIFT;
+}
+
+static struct insitu_comp_hash_lock *
+insitu_comp_block_hash_lock(struct insitu_comp_info *info, u64 block_index)
+{
+ return &info->bitmap_locks[(block_index >> HASH_LOCK_SHIFT) &
+ BITMAP_HASH_MASK];
+}
+
+static struct insitu_comp_hash_lock *
+insitu_comp_trylock_block(struct insitu_comp_info *info,
+ struct insitu_comp_req *req, u64 block_index)
+{
+ struct insitu_comp_hash_lock *hash_lock;
+
+ hash_lock = insitu_comp_block_hash_lock(req->info, block_index);
+
+ spin_lock_irq(&hash_lock->wait_lock);
+ if (!hash_lock->io_running) {
+ hash_lock->io_running = 1;
+ spin_unlock_irq(&hash_lock->wait_lock);
+ return hash_lock;
+ }
+ list_add_tail(&req->sibling, &hash_lock->wait_list);
+ spin_unlock_irq(&hash_lock->wait_lock);
+ return NULL;
+}
+
+static void insitu_comp_queue_req_list(struct insitu_comp_info *info,
+ struct list_head *list);
+static void insitu_comp_unlock_block(struct insitu_comp_info *info,
+ struct insitu_comp_req *req, struct insitu_comp_hash_lock *hash_lock)
+{
+ LIST_HEAD(pending_list);
+ unsigned long flags;
+
+ spin_lock_irqsave(&hash_lock->wait_lock, flags);
+ /* wakeup all pending reqs to avoid live lock */
+ list_splice_init(&hash_lock->wait_list, &pending_list);
+ hash_lock->io_running = 0;
+ spin_unlock_irqrestore(&hash_lock->wait_lock, flags);
+
+ insitu_comp_queue_req_list(info, &pending_list);
+}
+
+static void insitu_comp_unlock_req_range(struct insitu_comp_req *req)
+{
+ insitu_comp_unlock_block(req->info, req, req->lock);
+}
+
+/* Check comments of HASH_LOCK_SHIFT. each request only need take one lock */
+static int insitu_comp_lock_req_range(struct insitu_comp_req *req)
+{
+ u64 block_index, tmp;
+
+ block_index = insitu_comp_sector_to_block(req->bio->bi_sector);
+ tmp = insitu_comp_sector_to_block(bio_end_sector(req->bio) - 1);
+ BUG_ON(insitu_comp_block_hash_lock(req->info, block_index) !=
+ insitu_comp_block_hash_lock(req->info, tmp));
+
+ req->lock = insitu_comp_trylock_block(req->info, req, block_index);
+ if (!req->lock)
+ return 0;
+
+ return 1;
+}
+
+static void insitu_comp_queue_req(struct insitu_comp_info *info,
+ struct insitu_comp_req *req)
+{
+ unsigned long flags;
+ struct insitu_comp_io_worker *worker =
+ &insitu_comp_io_workers[req->cpu];
+
+ spin_lock_irqsave(&worker->lock, flags);
+ list_add_tail(&req->sibling, &worker->pending);
+ spin_unlock_irqrestore(&worker->lock, flags);
+
+ queue_work_on(req->cpu, insitu_comp_wq, &worker->work);
+}
+
+static void insitu_comp_queue_req_list(struct insitu_comp_info *info,
+ struct list_head *list)
+{
+ struct insitu_comp_req *req;
+ while (!list_empty(list)) {
+ req = list_first_entry(list, struct insitu_comp_req, sibling);
+ list_del_init(&req->sibling);
+ insitu_comp_queue_req(info, req);
+ }
+}
+
+static void insitu_comp_get_req(struct insitu_comp_req *req)
+{
+ atomic_inc(&req->io_pending);
+}
+
+static void insitu_comp_free_io_range(struct insitu_comp_io_range *io)
+{
+ kfree(io->decomp_data);
+ kfree(io->comp_data);
+ kmem_cache_free(insitu_comp_io_range_cachep, io);
+}
+
+static void insitu_comp_put_req(struct insitu_comp_req *req)
+{
+ struct insitu_comp_io_range *io;
+
+ if (atomic_dec_return(&req->io_pending))
+ return;
+
+ if (req->stage == STAGE_INIT) /* waiting for locking */
+ return;
+
+ if (req->stage == STAGE_READ_DECOMP ||
+ req->stage == STAGE_WRITE_COMP ||
+ req->result)
+ req->stage = STAGE_DONE;
+
+ if (req->stage != STAGE_DONE) {
+ insitu_comp_queue_req(req->info, req);
+ return;
+ }
+
+ while (!list_empty(&req->all_io)) {
+ io = list_entry(req->all_io.next, struct insitu_comp_io_range,
+ next);
+ list_del(&io->next);
+ insitu_comp_free_io_range(io);
+ }
+
+ insitu_comp_unlock_req_range(req);
+
+ bio_endio(req->bio, req->result);
+}
+
+static void insitu_comp_io_range_done(unsigned long error, void *context)
+{
+ struct insitu_comp_io_range *io = context;
+
+ if (error)
+ io->req->result = error;
+ insitu_comp_put_req(io->req);
+}
+
+static inline int insitu_comp_compressor_len(struct insitu_comp_info *info,
+ int len)
+{
+ if (compressors[info->comp_alg].comp_len)
+ return compressors[info->comp_alg].comp_len(len);
+ return len;
+}
+
+/*
+ * caller should set region.sector, region.count. bi_rw. IO always to/from
+ * comp_data
+ */
+static struct insitu_comp_io_range *
+insitu_comp_create_io_range(struct insitu_comp_req *req, int comp_len,
+ int decomp_len)
+{
+ struct insitu_comp_io_range *io;
+
+ io = kmem_cache_alloc(insitu_comp_io_range_cachep, GFP_NOIO);
+ if (!io)
+ return NULL;
+
+ io->comp_data = kmalloc(insitu_comp_compressor_len(req->info, comp_len),
+ GFP_NOIO);
+ io->decomp_data = kmalloc(decomp_len, GFP_NOIO);
+ if (!io->decomp_data || !io->comp_data) {
+ kfree(io->decomp_data);
+ kfree(io->comp_data);
+ kmem_cache_free(insitu_comp_io_range_cachep, io);
+ return NULL;
+ }
+
+ io->io_req.notify.fn = insitu_comp_io_range_done;
+ io->io_req.notify.context = io;
+ io->io_req.client = req->info->io_client;
+ io->io_req.mem.type = DM_IO_KMEM;
+ io->io_req.mem.ptr.addr = io->comp_data;
+ io->io_req.mem.offset = 0;
+
+ io->io_region.bdev = req->info->dev->bdev;
+
+ io->decomp_len = decomp_len;
+ io->comp_len = comp_len;
+ io->req = req;
+ return io;
+}
+
+static void insitu_comp_bio_copy(struct bio *bio, off_t bio_off, void *buf,
+ ssize_t len, bool to_buf)
+{
+ struct bio_vec *bv;
+ off_t buf_off = 0;
+ ssize_t size;
+ void *addr;
+
+ if (bio_off + len > (bio_sectors(bio) << 9))
+ BUG();
+
+ bv = __BVEC_START(bio);
+ while (bio_off > bv->bv_len) {
+ bio_off -= bv->bv_len;
+ bv++;
+ }
+
+ while (len) {
+ addr = kmap_atomic(bv->bv_page);
+ size = min_t(ssize_t, len, bv->bv_len - bio_off);
+ if (to_buf)
+ memcpy(buf + buf_off, addr + bio_off + bv->bv_offset,
+ size);
+ else
+ memcpy(addr + bio_off + bv->bv_offset, buf + buf_off,
+ size);
+ kunmap_atomic(addr);
+
+ bio_off = 0;
+ buf_off += size;
+ len -= size;
+ bv++;
+ }
+}
+
+/*
+ * return value:
+ * < 0 : error
+ * == 0 : ok
+ * == 1 : ok, but comp/decomp is skipped
+ * Compressed data size is roundup of 512, which makes the payload.
+ * We store the actual compressed length in the last u32 of the payload.
+ * If there is no free space, we add 512 to the payload size.
+ */
+static int insitu_comp_io_range_comp(struct insitu_comp_info *info,
+ void *comp_data, unsigned int *comp_len, void *decomp_data,
+ unsigned int decomp_len, bool do_comp)
+{
+ struct crypto_comp *tfm;
+ u32 *addr;
+ unsigned int actual_comp_len;
+ int ret;
+
+ if (do_comp) {
+ actual_comp_len = *comp_len;
+
+ tfm = info->tfm[get_cpu()];
+ ret = crypto_comp_compress(tfm, decomp_data, decomp_len,
+ comp_data, &actual_comp_len);
+ put_cpu();
+
+ atomic64_add(decomp_len, &info->uncompressed_write_size);
+ if (ret || decomp_len < actual_comp_len + sizeof(u32) + 512) {
+ *comp_len = decomp_len;
+ atomic64_add(*comp_len, &info->compressed_write_size);
+ return 1;
+ }
+
+ *comp_len = round_up(actual_comp_len, 512);
+ if (*comp_len - actual_comp_len < sizeof(u32))
+ *comp_len += 512;
+ atomic64_add(*comp_len, &info->compressed_write_size);
+ addr = comp_data + *comp_len;
+ addr--;
+ *addr = cpu_to_le32(actual_comp_len);
+ } else {
+ if (*comp_len == decomp_len)
+ return 1;
+ addr = comp_data + *comp_len;
+ addr--;
+ actual_comp_len = le32_to_cpu(*addr);
+
+ tfm = info->tfm[get_cpu()];
+ ret = crypto_comp_decompress(tfm, comp_data, actual_comp_len,
+ decomp_data, &decomp_len);
+ put_cpu();
+ if (ret)
+ return -EINVAL;
+ }
+ return 0;
+}
+
+/*
+ * compressed data is updated. We decompress it and fill bio. If there is no
+ * valid compressed data, we just zero bio
+ */
+static void insitu_comp_handle_read_decomp(struct insitu_comp_req *req)
+{
+ struct insitu_comp_io_range *io;
+ off_t bio_off = 0;
+ int ret;
+
+ req->stage = STAGE_READ_DECOMP;
+
+ if (req->result)
+ return;
+
+ list_for_each_entry(io, &req->all_io, next) {
+ ssize_t dst_off = 0, src_off = 0, len;
+
+ io->io_region.sector -= req->info->data_start;
+
+ /* Do decomp here */
+ ret = insitu_comp_io_range_comp(req->info, io->comp_data,
+ &io->comp_len, io->decomp_data, io->decomp_len, false);
+ if (ret < 0) {
+ req->result = -EIO;
+ return;
+ }
+
+ if (io->io_region.sector >= req->bio->bi_sector)
+ dst_off = (io->io_region.sector - req->bio->bi_sector)
+ << 9;
+ else
+ src_off = (req->bio->bi_sector - io->io_region.sector)
+ << 9;
+ len = min_t(ssize_t, io->decomp_len - src_off,
+ (bio_sectors(req->bio) << 9) - dst_off);
+
+ /* io range in all_io list is ordered for read IO */
+ while (bio_off != dst_off) {
+ ssize_t size = min_t(ssize_t, PAGE_SIZE,
+ dst_off - bio_off);
+ insitu_comp_bio_copy(req->bio, bio_off,
+ empty_zero_page, size, false);
+ bio_off += size;
+ }
+
+ if (ret == 1) /* uncompressed, valid data is in .comp_data */
+ insitu_comp_bio_copy(req->bio, dst_off,
+ io->comp_data + src_off, len, false);
+ else
+ insitu_comp_bio_copy(req->bio, dst_off,
+ io->decomp_data + src_off, len, false);
+ bio_off = dst_off + len;
+ }
+
+ while (bio_off != (bio_sectors(req->bio) << 9)) {
+ ssize_t size = min_t(ssize_t, PAGE_SIZE,
+ (bio_sectors(req->bio) << 9) - bio_off);
+ insitu_comp_bio_copy(req->bio, bio_off, empty_zero_page,
+ size, false);
+ bio_off += size;
+ }
+}
+
+/*
+ * read one extent data from disk. The extent starts from block @block and has
+ * @data_sectors data
+ */
+static void insitu_comp_read_one_extent(struct insitu_comp_req *req, u64 block,
+ u16 logical_sectors, u16 data_sectors)
+{
+ struct insitu_comp_io_range *io;
+
+ io = insitu_comp_create_io_range(req, data_sectors << 9,
+ logical_sectors << 9);
+ if (!io) {
+ req->result = -EIO;
+ return;
+ }
+
+ insitu_comp_get_req(req);
+ list_add_tail(&io->next, &req->all_io);
+
+ io->io_region.sector = (block << INSITU_COMP_BLOCK_SECTOR_SHIFT) +
+ req->info->data_start;
+ io->io_region.count = data_sectors;
+
+ io->io_req.bi_rw = READ;
+ dm_io(&io->io_req, 1, &io->io_region, NULL);
+}
+
+static void insitu_comp_handle_read_read_existing(struct insitu_comp_req *req)
+{
+ u64 block_index, first_block_index;
+ u16 logical_sectors, data_sectors;
+
+ req->stage = STAGE_READ_EXISTING;
+
+ block_index = insitu_comp_sector_to_block(req->bio->bi_sector);
+again:
+ insitu_comp_get_extent(req->info, block_index, &first_block_index,
+ &logical_sectors, &data_sectors);
+ if (data_sectors > 0)
+ insitu_comp_read_one_extent(req, first_block_index,
+ logical_sectors, data_sectors);
+
+ if (req->result)
+ return;
+
+ block_index = first_block_index + (logical_sectors >>
+ INSITU_COMP_BLOCK_SECTOR_SHIFT);
+ /* the request might cover several extents */
+ if ((block_index << INSITU_COMP_BLOCK_SECTOR_SHIFT) <
+ bio_end_sector(req->bio))
+ goto again;
+
+ /* A shortcut if all data is in already */
+ if (list_empty(&req->all_io))
+ insitu_comp_handle_read_decomp(req);
+}
+
+static void insitu_comp_handle_read_request(struct insitu_comp_req *req)
+{
+ insitu_comp_get_req(req);
+
+ if (req->stage == STAGE_INIT) {
+ if (!insitu_comp_lock_req_range(req)) {
+ insitu_comp_put_req(req);
+ return;
+ }
+
+ insitu_comp_handle_read_read_existing(req);
+ } else if (req->stage == STAGE_READ_EXISTING)
+ insitu_comp_handle_read_decomp(req);
+
+ insitu_comp_put_req(req);
+}
+
+static void insitu_comp_write_meta_done(void *context, unsigned long error)
+{
+ struct insitu_comp_req *req = context;
+ insitu_comp_put_req(req);
+}
+
+static u64 insitu_comp_block_meta_page_index(u64 block, bool end)
+{
+ u64 bits = block * INSITU_COMP_META_BITS - !!end;
+ /* (1 << 3) bits per byte */
+ return bits >> (3 + PAGE_SHIFT);
+}
+
+/*
+ * the request covers some extents partially. Decompress data of the extents,
+ * compress remaining valid data, and finally write them out
+ */
+static int insitu_comp_handle_write_modify(struct insitu_comp_io_range *io,
+ u64 *meta_start, u64 *meta_end, bool *handle_bio)
+{
+ struct insitu_comp_req *req = io->req;
+ sector_t start, count;
+ unsigned int comp_len;
+ off_t offset;
+ u64 page_index;
+ int ret;
+
+ io->io_region.sector -= req->info->data_start;
+
+ /* decompress original data */
+ ret = insitu_comp_io_range_comp(req->info, io->comp_data, &io->comp_len,
+ io->decomp_data, io->decomp_len, false);
+ if (ret < 0) {
+ req->result = -EINVAL;
+ return -EIO;
+ }
+
+ start = io->io_region.sector;
+ count = io->decomp_len >> 9;
+ if (start < req->bio->bi_sector && start + count >
+ bio_end_sector(req->bio)) {
+ /* we don't split an extent */
+ if (ret == 1) {
+ memcpy(io->decomp_data, io->comp_data, io->decomp_len);
+ insitu_comp_bio_copy(req->bio, 0,
+ io->decomp_data + ((req->bio->bi_sector - start) <<
+ 9), bio_sectors(req->bio) << 9, true);
+ } else {
+ insitu_comp_bio_copy(req->bio, 0,
+ io->decomp_data + ((req->bio->bi_sector - start) <<
+ 9), bio_sectors(req->bio) << 9, true);
+ kfree(io->comp_data);
+ /* New compressed len might be bigger */
+ io->comp_data = kmalloc(insitu_comp_compressor_len(
+ req->info, io->decomp_len), GFP_NOIO);
+ io->comp_len = io->decomp_len;
+ if (!io->comp_data) {
+ req->result = -ENOMEM;
+ return -EIO;
+ }
+ io->io_req.mem.ptr.addr = io->comp_data;
+ }
+ /* need compress data */
+ ret = 0;
+ offset = 0;
+ *handle_bio = false;
+ } else if (start < req->bio->bi_sector) {
+ count = req->bio->bi_sector - start;
+ offset = 0;
+ } else {
+ offset = bio_end_sector(req->bio) - start;
+ start = bio_end_sector(req->bio);
+ count = count - offset;
+ }
+
+ /* Original data is uncompressed, we don't need writeback */
+ if (ret == 1) {
+ comp_len = count << 9;
+ goto handle_meta;
+ }
+
+ /* assume compress less data uses less space (at least 4k lsess data) */
+ comp_len = io->comp_len;
+ ret = insitu_comp_io_range_comp(req->info, io->comp_data, &comp_len,
+ io->decomp_data + (offset << 9), count << 9, true);
+ if (ret < 0) {
+ req->result = -EIO;
+ return -EIO;
+ }
+
+ insitu_comp_get_req(req);
+ if (ret == 1)
+ io->io_req.mem.ptr.addr = io->decomp_data + (offset << 9);
+ io->io_region.count = comp_len >> 9;
+ io->io_region.sector = start + req->info->data_start;
+
+ io->io_req.bi_rw = req->bio->bi_rw;
+ dm_io(&io->io_req, 1, &io->io_region, NULL);
+handle_meta:
+ insitu_comp_set_extent(req, start >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
+ count >> INSITU_COMP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
+
+ page_index = insitu_comp_block_meta_page_index(start >>
+ INSITU_COMP_BLOCK_SECTOR_SHIFT, false);
+ if (*meta_start > page_index)
+ *meta_start = page_index;
+ page_index = insitu_comp_block_meta_page_index(
+ (start + count) >> INSITU_COMP_BLOCK_SECTOR_SHIFT, true);
+ if (*meta_end < page_index)
+ *meta_end = page_index;
+ return 0;
+}
+
+/* Compress data and write it out */
+static void insitu_comp_handle_write_comp(struct insitu_comp_req *req)
+{
+ struct insitu_comp_io_range *io;
+ sector_t count;
+ unsigned int comp_len;
+ u64 meta_start = -1L, meta_end = 0, page_index;
+ int ret;
+ bool handle_bio = true;
+
+ req->stage = STAGE_WRITE_COMP;
+
+ if (req->result)
+ return;
+
+ list_for_each_entry(io, &req->all_io, next) {
+ if (insitu_comp_handle_write_modify(io, &meta_start, &meta_end,
+ &handle_bio))
+ return;
+ }
+
+ if (!handle_bio)
+ goto update_meta;
+
+ count = bio_sectors(req->bio);
+ io = insitu_comp_create_io_range(req, count << 9, count << 9);
+ if (!io) {
+ req->result = -EIO;
+ return;
+ }
+ insitu_comp_bio_copy(req->bio, 0, io->decomp_data, count << 9, true);
+
+ /* compress data */
+ comp_len = io->comp_len;
+ ret = insitu_comp_io_range_comp(req->info, io->comp_data, &comp_len,
+ io->decomp_data, count << 9, true);
+ if (ret < 0) {
+ insitu_comp_free_io_range(io);
+ req->result = -EIO;
+ return;
+ }
+
+ insitu_comp_get_req(req);
+ list_add_tail(&io->next, &req->all_io);
+ io->io_region.sector = req->bio->bi_sector + req->info->data_start;
+ if (ret == 1)
+ io->io_req.mem.ptr.addr = io->decomp_data;
+ io->io_region.count = comp_len >> 9;
+ io->io_req.bi_rw = req->bio->bi_rw;
+ dm_io(&io->io_req, 1, &io->io_region, NULL);
+ insitu_comp_set_extent(req,
+ req->bio->bi_sector >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
+ count >> INSITU_COMP_BLOCK_SECTOR_SHIFT, comp_len >> 9);
+
+ page_index = insitu_comp_block_meta_page_index(
+ req->bio->bi_sector >> INSITU_COMP_BLOCK_SECTOR_SHIFT, false);
+ if (meta_start > page_index)
+ meta_start = page_index;
+ page_index = insitu_comp_block_meta_page_index(
+ (req->bio->bi_sector + count) >> INSITU_COMP_BLOCK_SECTOR_SHIFT,
+ true);
+ if (meta_end < page_index)
+ meta_end = page_index;
+update_meta:
+ if (req->info->write_mode == INSITU_COMP_WRITE_THROUGH ||
+ (req->bio->bi_rw & REQ_FUA)) {
+ insitu_comp_get_req(req);
+ insitu_comp_write_meta(req->info, meta_start, meta_end + 1, req,
+ insitu_comp_write_meta_done, req->bio->bi_rw);
+ }
+}
+
+/* request might cover some extents partially, read them first */
+static void insitu_comp_handle_write_read_existing(struct insitu_comp_req *req)
+{
+ u64 block_index, first_block_index;
+ u16 logical_sectors, data_sectors;
+
+ req->stage = STAGE_READ_EXISTING;
+
+ block_index = insitu_comp_sector_to_block(req->bio->bi_sector);
+ insitu_comp_get_extent(req->info, block_index, &first_block_index,
+ &logical_sectors, &data_sectors);
+ if (data_sectors > 0 && (first_block_index < block_index ||
+ first_block_index + insitu_comp_sector_to_block(logical_sectors) >
+ insitu_comp_sector_to_block(bio_end_sector(req->bio))))
+ insitu_comp_read_one_extent(req, first_block_index,
+ logical_sectors, data_sectors);
+
+ if (req->result)
+ return;
+
+ if (first_block_index + insitu_comp_sector_to_block(logical_sectors) >=
+ insitu_comp_sector_to_block(bio_end_sector(req->bio)))
+ goto out;
+
+ block_index = insitu_comp_sector_to_block(bio_end_sector(req->bio)) - 1;
+ insitu_comp_get_extent(req->info, block_index, &first_block_index,
+ &logical_sectors, &data_sectors);
+ if (data_sectors > 0 &&
+ first_block_index + insitu_comp_sector_to_block(logical_sectors) >
+ block_index + 1)
+ insitu_comp_read_one_extent(req, first_block_index,
+ logical_sectors, data_sectors);
+
+ if (req->result)
+ return;
+out:
+ if (list_empty(&req->all_io))
+ insitu_comp_handle_write_comp(req);
+}
+
+static void insitu_comp_handle_write_request(struct insitu_comp_req *req)
+{
+ insitu_comp_get_req(req);
+
+ if (req->stage == STAGE_INIT) {
+ if (!insitu_comp_lock_req_range(req)) {
+ insitu_comp_put_req(req);
+ return;
+ }
+
+ insitu_comp_handle_write_read_existing(req);
+ } else if (req->stage == STAGE_READ_EXISTING)
+ insitu_comp_handle_write_comp(req);
+
+ insitu_comp_put_req(req);
+}
+
+/* For writeback mode */
+static void insitu_comp_handle_flush_request(struct insitu_comp_req *req)
+{
+ struct writeback_flush_data wb;
+
+ atomic_set(&wb.cnt, 1);
+ init_completion(&wb.complete);
+
+ insitu_comp_flush_dirty_meta(req->info, &wb);
+
+ writeback_flush_io_done(&wb, 0);
+ wait_for_completion(&wb.complete);
+
+ bio_endio(req->bio, 0);
+}
+
+static void insitu_comp_handle_request(struct insitu_comp_req *req)
+{
+ if (req->bio->bi_rw & REQ_FLUSH)
+ insitu_comp_handle_flush_request(req);
+ else if (req->bio->bi_rw & REQ_WRITE)
+ insitu_comp_handle_write_request(req);
+ else
+ insitu_comp_handle_read_request(req);
+}
+
+static void insitu_comp_do_request_work(struct work_struct *work)
+{
+ struct insitu_comp_io_worker *worker = container_of(work,
+ struct insitu_comp_io_worker, work);
+ LIST_HEAD(list);
+ struct insitu_comp_req *req;
+ struct blk_plug plug;
+ bool repeat;
+
+ blk_start_plug(&plug);
+again:
+ spin_lock_irq(&worker->lock);
+ list_splice_init(&worker->pending, &list);
+ spin_unlock_irq(&worker->lock);
+
+ repeat = !list_empty(&list);
+ while (!list_empty(&list)) {
+ req = list_first_entry(&list, struct insitu_comp_req, sibling);
+ list_del(&req->sibling);
+
+ insitu_comp_handle_request(req);
+ }
+ if (repeat)
+ goto again;
+ blk_finish_plug(&plug);
+}
+
+static int insitu_comp_map(struct dm_target *ti, struct bio *bio)
+{
+ struct insitu_comp_info *info = ti->private;
+ struct insitu_comp_req *req;
+
+ req = dm_per_bio_data(bio, sizeof(struct insitu_comp_req));
+
+ if ((bio->bi_rw & REQ_FLUSH) &&
+ info->write_mode == INSITU_COMP_WRITE_THROUGH) {
+ bio->bi_bdev = info->dev->bdev;
+ return DM_MAPIO_REMAPPED;
+ }
+
+ req->bio = bio;
+ req->info = info;
+ atomic_set(&req->io_pending, 0);
+ INIT_LIST_HEAD(&req->all_io);
+ req->result = 0;
+ req->stage = STAGE_INIT;
+
+ req->cpu = raw_smp_processor_id();
+ insitu_comp_queue_req(info, req);
+
+ return DM_MAPIO_SUBMITTED;
+}
+
+/*
+ * INFO: uncompressed_data_size compressed_data_size metadata_size
+ * TABLE: writethrough/writeback commit_delay
+ */
+static void insitu_comp_status(struct dm_target *ti, status_type_t type,
+ unsigned status_flags, char *result, unsigned maxlen)
+{
+ struct insitu_comp_info *info = ti->private;
+ unsigned int sz = 0;
+
+ switch (type) {
+ case STATUSTYPE_INFO:
+ DMEMIT("%lu %lu %lu",
+ atomic64_read(&info->uncompressed_write_size),
+ atomic64_read(&info->compressed_write_size),
+ atomic64_read(&info->meta_write_size));
+ break;
+ case STATUSTYPE_TABLE:
+ if (info->write_mode == INSITU_COMP_WRITE_BACK)
+ DMEMIT("%s %s %d", info->dev->name, "writeback",
+ info->writeback_delay);
+ else
+ DMEMIT("%s %s", info->dev->name, "writethrough");
+ break;
+ }
+}
+
+static int insitu_comp_iterate_devices(struct dm_target *ti,
+ iterate_devices_callout_fn fn, void *data)
+{
+ struct insitu_comp_info *info = ti->private;
+
+ return fn(ti, info->dev, info->data_start,
+ info->data_blocks << INSITU_COMP_BLOCK_SECTOR_SHIFT, data);
+}
+
+static void insitu_comp_io_hints(struct dm_target *ti,
+ struct queue_limits *limits)
+{
+ /* No blk_limits_logical_block_size */
+ limits->logical_block_size = limits->physical_block_size =
+ limits->io_min = INSITU_COMP_BLOCK_SIZE;
+ blk_limits_max_hw_sectors(limits, INSITU_COMP_MAX_SIZE >> 9);
+}
+
+static int insitu_comp_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
+ struct bio_vec *biovec, int max_size)
+{
+ /* Guarantee request can only cover one aligned 128k range */
+ return min_t(int, max_size, INSITU_COMP_MAX_SIZE - bvm->bi_size -
+ ((bvm->bi_sector << 9) % INSITU_COMP_MAX_SIZE));
+}
+
+static struct target_type insitu_comp_target = {
+ .name = "insitu_comp",
+ .version = {1, 0, 0},
+ .module = THIS_MODULE,
+ .ctr = insitu_comp_ctr,
+ .dtr = insitu_comp_dtr,
+ .map = insitu_comp_map,
+ .status = insitu_comp_status,
+ .iterate_devices = insitu_comp_iterate_devices,
+ .io_hints = insitu_comp_io_hints,
+ .merge = insitu_comp_merge,
+};
+
+static int __init insitu_comp_init(void)
+{
+ int r;
+
+ for (r = 0; r < ARRAY_SIZE(compressors); r++)
+ if (crypto_has_comp(compressors[r].name, 0, 0))
+ break;
+ if (r >= ARRAY_SIZE(compressors)) {
+ DMWARN("No crypto compressors are supported");
+ return -EINVAL;
+ }
+
+ default_compressor = r;
+
+ r = -ENOMEM;
+ insitu_comp_io_range_cachep = kmem_cache_create("insitu_comp_io_range",
+ sizeof(struct insitu_comp_io_range), 0, 0, NULL);
+ if (!insitu_comp_io_range_cachep) {
+ DMWARN("Can't create io_range cache");
+ goto err;
+ }
+
+ insitu_comp_meta_io_cachep = kmem_cache_create("insitu_comp_meta_io",
+ sizeof(struct insitu_comp_meta_io), 0, 0, NULL);
+ if (!insitu_comp_meta_io_cachep) {
+ DMWARN("Can't create meta_io cache");
+ goto err;
+ }
+
+ insitu_comp_wq = alloc_workqueue("insitu_comp_io",
+ WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
+ if (!insitu_comp_wq) {
+ DMWARN("Can't create io workqueue");
+ goto err;
+ }
+
+ r = dm_register_target(&insitu_comp_target);
+ if (r < 0) {
+ DMWARN("target registration failed");
+ goto err;
+ }
+
+ for_each_possible_cpu(r) {
+ INIT_LIST_HEAD(&insitu_comp_io_workers[r].pending);
+ spin_lock_init(&insitu_comp_io_workers[r].lock);
+ INIT_WORK(&insitu_comp_io_workers[r].work,
+ insitu_comp_do_request_work);
+ }
+ return 0;
+err:
+ if (insitu_comp_io_range_cachep)
+ kmem_cache_destroy(insitu_comp_io_range_cachep);
+ if (insitu_comp_meta_io_cachep)
+ kmem_cache_destroy(insitu_comp_meta_io_cachep);
+ if (insitu_comp_wq)
+ destroy_workqueue(insitu_comp_wq);
+
+ return r;
+}
+
+static void __exit insitu_comp_exit(void)
+{
+ dm_unregister_target(&insitu_comp_target);
+ kmem_cache_destroy(insitu_comp_io_range_cachep);
+ kmem_cache_destroy(insitu_comp_meta_io_cachep);
+ destroy_workqueue(insitu_comp_wq);
+}
+
+module_init(insitu_comp_init);
+module_exit(insitu_comp_exit);
+
+MODULE_AUTHOR("Shaohua Li <shli@kernel.org>");
+MODULE_DESCRIPTION(DM_NAME " target with insitu data compression for SSD");
+MODULE_LICENSE("GPL");
Index: linux/drivers/md/dm-insitu-comp.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/drivers/md/dm-insitu-comp.h 2014-01-20 12:28:30.937658907 +0800
@@ -0,0 +1,146 @@
+#ifndef __DM_INSITU_COMPRESSION_H__
+#define __DM_INSITU_COMPRESSION_H__
+#include <linux/types.h>
+
+#define INSITU_COMP_SUPER_MAGIC 0x106526c206506c09
+struct insitu_comp_super_block {
+ __le64 magic;
+ __le64 meta_blocks;
+ __le64 data_blocks;
+ u8 comp_alg;
+} __attribute__((packed));
+
+#define INSITU_COMP_ALG_LZO 0
+#define INSITU_COMP_ALG_ZLIB 1
+
+#ifdef __KERNEL__
+struct insitu_comp_compressor_data {
+ char *name;
+ int (*comp_len)(int comp_len);
+};
+
+static inline int lzo_comp_len(int comp_len)
+{
+ return lzo1x_worst_compress(comp_len);
+}
+
+/*
+ * Minium logical sector size of this target is 4096 byte, which is a block.
+ * Data of a block is compressed. Compressed data is round up to 512B, which is
+ * the payload. For each block, we have 5 bits meta data. bit 0 - 3 stands
+ * payload length. 0 - 8 sectors. If compressed payload length is 8 sectors, we
+ * just store uncompressed data. Actual compressed data length is stored at the
+ * last 32 bits of payload if data is compressed. In disk, payload is stored at
+ * the begining of logical sector of the block. If IO size is bigger than one
+ * block, we store the whole data as an extent. Bit 4 stands tail for an
+ * extent. Max allowed extent size is 128k.
+ */
+#define INSITU_COMP_BLOCK_SIZE 4096
+#define INSITU_COMP_BLOCK_SHIFT 12
+#define INSITU_COMP_BLOCK_SECTOR_SHIFT (INSITU_COMP_BLOCK_SHIFT - 9)
+
+#define INSITU_COMP_MIN_SIZE 4096
+/* Change this should change HASH_LOCK_SHIFT too */
+#define INSITU_COMP_MAX_SIZE (128 * 1024)
+
+#define INSITU_COMP_LENGTH_MASK ((1 << 4) - 1)
+#define INSITU_COMP_TAIL_MASK (1 << 4)
+#define INSITU_COMP_META_BITS 5
+
+#define INSITU_COMP_META_START_SECTOR (INSITU_COMP_BLOCK_SIZE >> 9)
+
+enum INSITU_COMP_WRITE_MODE {
+ INSITU_COMP_WRITE_BACK,
+ INSITU_COMP_WRITE_THROUGH,
+};
+
+/*
+ * request can cover one aligned 128k (4k * (1 << 5)) range. Since maxium
+ * request size is 128k, we only need take one lock for each request
+ */
+#define HASH_LOCK_SHIFT 5
+
+#define BITMAP_HASH_SHIFT 9
+#define BITMAP_HASH_MASK ((1 << BITMAP_HASH_SHIFT) - 1)
+#define BITMAP_HASH_LEN (1 << BITMAP_HASH_SHIFT)
+
+struct insitu_comp_hash_lock {
+ int io_running;
+ spinlock_t wait_lock;
+ struct list_head wait_list;
+};
+
+struct insitu_comp_info {
+ struct dm_target *ti;
+ struct dm_dev *dev;
+
+ int comp_alg;
+ struct crypto_comp *tfm[NR_CPUS];
+
+ sector_t data_start;
+ u64 data_blocks;
+
+ char *meta_bitmap;
+ u64 meta_bitmap_bits;
+ u64 meta_bitmap_pages;
+ struct insitu_comp_hash_lock bitmap_locks[BITMAP_HASH_LEN];
+
+ enum INSITU_COMP_WRITE_MODE write_mode;
+ unsigned int writeback_delay; /* second unit */
+ struct task_struct *writeback_tsk;
+ struct dm_io_client *io_client;
+
+ atomic64_t compressed_write_size;
+ atomic64_t uncompressed_write_size;
+ atomic64_t meta_write_size;
+};
+
+struct insitu_comp_meta_io {
+ struct dm_io_request io_req;
+ struct dm_io_region io_region;
+ void *data;
+ void (*fn)(void *data, unsigned long error);
+};
+
+struct insitu_comp_io_range {
+ struct dm_io_request io_req;
+ struct dm_io_region io_region;
+ void *decomp_data;
+ unsigned int decomp_len;
+ void *comp_data;
+ unsigned int comp_len; /* For write, this is estimated */
+ struct list_head next;
+ struct insitu_comp_req *req;
+};
+
+enum INSITU_COMP_REQ_STAGE {
+ STAGE_INIT,
+ STAGE_READ_EXISTING,
+ STAGE_READ_DECOMP,
+ STAGE_WRITE_COMP,
+ STAGE_DONE,
+};
+
+struct insitu_comp_req {
+ struct bio *bio;
+ struct insitu_comp_info *info;
+ struct list_head sibling;
+
+ struct list_head all_io;
+ atomic_t io_pending;
+ enum INSITU_COMP_REQ_STAGE stage;
+
+ struct insitu_comp_hash_lock *lock;
+ int result;
+
+ int cpu;
+};
+
+struct insitu_comp_io_worker {
+ struct list_head pending;
+ spinlock_t lock;
+ struct work_struct work;
+};
+

twitter

Monday, January 20, 2014

linux.kernel - 26 new messages in 23 topics - digest

0 Comments:

Post a Comment

About Me

Previous Posts