[BUG]kmemleak in cfq_get_queue+0x213

2018-09-19 Thread Jinpu Wang
Hi,

I notice following kmemleak warning on 4.14.65 when running blktests
unreferenced object 0x88031d34f7c0 (size 240):
  comm "kworker/0:1", pid 18344, jiffies 4301519574 (age 3716.610s)
  hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 00 51 f9 bc 02 88 ff ff  .Q..
d0 f7 34 1d 03 88 ff ff 00 00 00 00 00 00 00 00  ..4.
  backtrace:
[] cfq_get_queue+0x213/0x530
[] cfq_set_request+0x43f/0x550
[] get_request+0xe20/0xfc0
[] blk_get_request+0xf1/0x1a0
[] scsi_execute+0x4a/0x390 [scsi_mod]
[] scsi_test_unit_ready+0x8b/0x130 [scsi_mod]
[] sd_check_events+0x1c9/0x2e0 [sd_mod]
[] disk_check_events+0x8c/0x1f0
[] process_one_work+0x350/0x8e0
[] worker_thread+0x99/0x720
[] kthread+0x1c3/0x210
[] ret_from_fork+0x3a/0x50
[] 0x

cfq_get_queue+0x213 point to   line 3838   cfqq =
kmem_cache_alloc_node(cfq_pool,in cfq-scheduler.c

Looks we're leaking cfq_queue in some case.

Does anyone see the leak before?

Thanks!
-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:   +49 30 577 008  042
Fax:  +49 30 577 008 299
Email:jinpu.w...@profitbricks.com
URL:  https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss, Matthias Steinberg, Christoph Steffens


Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-16 Thread Jinpu Wang
On Fri, Apr 13, 2018 at 6:59 PM, Martin K. Petersen
 wrote:
>
> Jinpu,
>
> [CC:ed the mpt3sas maintainers]
>
> The ratelimit patch is just an attempt to treat the symptom, not the
> cause.
Agree. If we can fix the root cause, it will be great.
>
>> Thanks for asking, we updated mpt3sas driver which enables DIX support
>> (prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
>> After reboot, kernel reports the IO errors from all the drives behind
>> HBA, seems for almost every read IO, which turns the system unusable:
>> [   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
>> [   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
>> [   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
>> [   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
>> [   13.080594] sda: ref tag error at location 8 (rcvd 143196159)
>
> That sounds like a bug in the mpt3sas driver or firmware. I guess the
> HBA could conceivably be operating a SATA device as DIX Type 0 and strip
> the PI on the drive side. But that doesn't seem to be a particularly
> useful mode of operation.
>
> Jinpu: Which firmware are you running? Also, please send us the output
> of:
>
> sg_readcap -l /dev/sda
> sg_inq -x /dev/sda
> sg_vpd /dev/sda
>
Disks are INTEL SSDSC2BX48, directly attached to HBA.
LSISAS3008: FWVersion(13.00.00.00), ChipRevision(0x02), BiosVersion(08.11.00.00)
mpt3sas_cm2: Protocol=(Initiator,Target),
Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set
Full,NCQ)

jwang@x:~$ sudo sg_vpd /dev/sdz
Supported VPD pages VPD page:
  Supported VPD pages [sv]
  Unit serial number [sn]
  Device identification [di]
  Mode page policy [mpp]
  ATA information (SAT) [ai]
  Block limits (SBC) [bl]
  Block device characteristics (SBC) [bdc]
  Logical block provisioning (SBC) [lbpv]
jwang@x:~$ sudo sg_inq -x /dev/sdz
VPD INQUIRY: extended INQUIRY data page
inquiry: field in cdb illegal (page not supported)
jwang@x:~$ sudo sg_readcap -l /dev/sdz
Read Capacity results:
   Protection: prot_en=0, p_type=0, p_i_exponent=0
   Logical block provisioning: lbpme=1, lbprz=1
   Last logical block address=937703087 (0x37e436af), Number of
logical blocks=937703088
   Logical block length=512 bytes
   Logical blocks per physical block exponent=3 [so physical block
length=4096 bytes]
   Lowest aligned logical block address=0
Hence:
   Device size: 480103981056 bytes, 457862.8 MiB, 480.10 GB


> Broadcom: How is DIX supposed to work for SATA drives behind an mpt3sas
> controller?
>
> --
> Martin K. Petersen  Oracle Linux Engineering


Thanks!

-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin


Re: [PATCH v2] block: ratelimite pr_err on IO path

2018-04-13 Thread Jinpu Wang
On Thu, Apr 12, 2018 at 11:20 PM, Martin K. Petersen
 wrote:
>
> Jack,
>
>> + pr_err_ratelimited("%s: ref tag error at 
>> location %llu (rcvd %u)\n",
>
> I'm a bit concerned about dropping records of potential data loss.
>
> Also, what are you doing that compels all these to be logged? This
> should be a very rare occurrence.
>
> --
> Martin K. Petersen  Oracle Linux Engineering
Hi Martin,

Thanks for asking, we updated mpt3sas driver which enables DIX support
(prot_mask=0x7f), all disks are SATA SSDs, no DIF support.
After reboot, kernel reports the IO errors from all the drives behind
HBA, seems for almost every read IO, which turns the system unusable:
[   13.079375] sda: ref tag error at location 0 (rcvd 143196159)
[   13.079989] sda: ref tag error at location 937702912 (rcvd 143196159)
[   13.080233] sda: ref tag error at location 937703072 (rcvd 143196159)
[   13.080407] sda: ref tag error at location 0 (rcvd 143196159)
[   13.080594] sda: ref tag error at location 8 (rcvd 143196159)
[   13.080996] sda: ref tag error at location 0 (rcvd 143196159)
[   13.089878] sdb: ref tag error at location 0 (rcvd 143196159)
[   13.090275] sdb: ref tag error at location 937702912 (rcvd 277413887)
[   13.090448] sdb: ref tag error at location 937703072 (rcvd 143196159)
[   13.090655] sdb: ref tag error at location 0 (rcvd 143196159)
[   13.090823] sdb: ref tag error at location 8 (rcvd 277413887)
[   13.091218] sdb: ref tag error at location 0 (rcvd 143196159)
[   13.095412] sdc: ref tag error at location 0 (rcvd 143196159)
[   13.095859] sdc: ref tag error at location 937702912 (rcvd 143196159)
[   13.096058] sdc: ref tag error at location 937703072 (rcvd 143196159)
[   13.096228] sdc: ref tag error at location 0 (rcvd 143196159)
[   13.096445] sdc: ref tag error at location 8 (rcvd 143196159)
[   13.096833] sdc: ref tag error at location 0 (rcvd 277413887)
[   13.097187] sds: ref tag error at location 0 (rcvd 277413887)
[   13.097707] sds: ref tag error at location 937702912 (rcvd 143196159)
[   13.097855] sds: ref tag error at location 937703072 (rcvd 277413887)

Kernel version 4.15 and 4.14.28, I scan the commits in upstream,
haven't found any relevant.
in  4.4.112, there's no such errors.
Diable DIX support (prot_mask=0x7) in mpt3sas fixes the problem.

Regards,
-- 
Jack Wang
Linux Kernel DeveloperProfitBricks GmbH


Re: [PATCH] block: ratelimite pr_err on IO path

2018-04-12 Thread Jinpu Wang
On Wed, Apr 11, 2018 at 7:07 PM, Elliott, Robert (Persistent Memory)
 wrote:
>> -Original Message-
>> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
>> ow...@vger.kernel.org] On Behalf Of Jack Wang
>> Sent: Wednesday, April 11, 2018 8:21 AM
>> Subject: [PATCH] block: ratelimite pr_err on IO path
>>
>> From: Jack Wang 
> ...
>> - pr_err("%s: ref tag error at location %llu " \
>> -"(rcvd %u)\n", iter->disk_name,
>> -(unsigned long long)
>> -iter->seed, be32_to_cpu(pi->ref_tag));
>> + pr_err_ratelimited("%s: ref tag error at "
>> +"location %llu (rcvd %u)\n",
>
> Per process/coding-style.rst, you should keep a string like that on
> one line even if that exceeds 80 columns:
>
>   Statements longer than 80 columns will be broken into sensible chunks, 
> unless
>   exceeding 80 columns significantly increases readability and does not hide
>   information. ... However, never break user-visible strings such as
>   printk messages, because that breaks the ability to grep for them.
>
>
Thanks Robert, as the original code keep the 80 columns, I just
followed, I will fix it in v2.


-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Jinpu Wang
On Mon, Feb 5, 2018 at 5:16 PM, Bart Van Assche <bart.vanass...@wdc.com> wrote:
> On Mon, 2018-02-05 at 09:56 +0100, Jinpu Wang wrote:
>> Hi Bart,
>>
>> My another 2 cents:)
>> On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche <bart.vanass...@wdc.com> 
>> wrote:
>> > On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> > > o Simple configuration of IBNBD:
>> > >- Server side is completely passive: volumes do not need to be
>> > >  explicitly exported.
>> >
>> > That sounds like a security hole? I think the ability to configure whether 
>> > or
>> > not an initiator is allowed to log in is essential and also which volumes 
>> > an
>> > initiator has access to.
>>
>> Our design target for well controlled production environment, so security is
>> handle in other layer. On server side, admin can set the dev_search_path in
>> module parameter to set parent directory, this will concatenate with the path
>> client send in open message to open a block device.
>
> Hello Jack,
>
> That approach may work well for your employer but sorry I don't think this is
> sufficient for an upstream driver. I think that most users who configure a
> network storage target expect full control over which storage devices are 
> exported
> and also over which clients do have and do not have access.
>
> Bart.
Hello Bart,

I agree for general purpose, it may be good to have better access control.

Thanks,
-- 
Jack Wang
Linux Kernel Developer


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Jinpu Wang
Hi Bart,

My another 2 cents:)
On Fri, Feb 2, 2018 at 6:05 PM, Bart Van Assche  wrote:
> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> o Simple configuration of IBNBD:
>>- Server side is completely passive: volumes do not need to be
>>  explicitly exported.
>
> That sounds like a security hole? I think the ability to configure whether or
> not an initiator is allowed to log in is essential and also which volumes an
> initiator has access to.
Our design target for well controlled production environment, so
security is handle in other layer.
On server side, admin can set the dev_search_path in module parameter
to set parent directory,
this will concatenate with the path client send in open message to
open  a block device.


>
>>- Only IB port GID and device path needed on client side to map
>>  a block device.
>
> I think IP addressing is preferred over GID addressing in RoCE networks.
> Additionally, have you noticed that GUID configuration support has been added
> to the upstream ib_srpt driver? Using GIDs has a very important disadvantage,
> namely that at least in IB networks the prefix will change if the subnet
> manager is reconfigured. Additionally, in IB networks it may happen that the
> target driver is loaded and configured before the GID has been assigned to
> all RDMA ports.
>
> Thanks,
>
> Bart.

Sorry, the above description is not accurate, IBNBD/IBTRS support
GID/IPv4/IPv6 addressing.
We will adjust in next post.

Thanks,
-- 
Jack Wang
Linux Kernel Developer


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-05 Thread Jinpu Wang
On Fri, Feb 2, 2018 at 5:40 PM, Doug Ledford  wrote:
> On Fri, 2018-02-02 at 16:07 +, Bart Van Assche wrote:
>> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
>> > Since the first version the following was changed:
>> >
>> >- Load-balancing and IO fail-over using multipath features were added.
>> >- Major parts of the code were rewritten, simplified and overall code
>> >  size was reduced by a quarter.
>>
>> That is interesting to know, but what happened to the feedback that Sagi and
>> I provided on v1? Has that feedback been addressed? See also
>> https://www.spinics.net/lists/linux-rdma/msg47819.html and
>> https://www.spinics.net/lists/linux-rdma/msg47879.html.
>>
>> Regarding multipath support: there are already two multipath implementations
>> upstream (dm-mpath and the multipath implementation in the NVMe initiator).
>> I'm not sure we want a third multipath implementation in the Linux kernel.
>
> There's more than that.  There was also md-multipath, and smc-r includes
> another version of multipath, plus I assume we support mptcp as well.
>
> But, to be fair, the different multipaths in this list serve different
> purposes and I'm not sure they could all be generalized out and served
> by a single multipath code.  Although, fortunately, md-multipath is
> deprecated, so no need to worry about it, and it is only dm-multipath
> and nvme multipath that deal directly with block devices and assume
> block semantics.  If I read the cover letter right (and I haven't dug
> into the code to confirm this), the ibtrs multipath has much more in
> common with smc-r multipath, where it doesn't really assume a block
> layer device sits on top of it, it's more of a pure network multipath,
> which the implementation of smc-r is and mptcp would be too.  I would
> like to see a core RDMA multipath implementation soon that would
> abstract out some of these multipath tasks, at least across RDMA links,
> and that didn't have the current limitations (smc-r only supports RoCE
> links, and it sounds like ibtrs only supports IB like links, but maybe
> I'm wrong there, I haven't looked at the patches yet).
Hi Doug, hi Bart,

Thanks for your valuable input, here is my 2 cents:

IBTRS multipath is indeed a network multipath, with sysfs interface to
remove/add path dynamically.
IBTRS is built on rdma-cm, so expect to support RoCE and iWARP, but we
mainly tested in IB environment,
slightly tested on RXE.


Regards,
-- 
Jack Wang
Linux Kernel Developer


Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)

2017-03-27 Thread Jinpu Wang
Hi Sagi,

On Mon, Mar 27, 2017 at 4:20 AM, Sagi Grimberg  wrote:
>
>> This series introduces IBNBD/IBTRS kernel modules.
>>
>> IBNBD (InfiniBand network block device) allows for an RDMA transfer of
>> block IO
>> over InfiniBand network. The driver presents itself as a block device on
>> client
>> side and transmits the block requests in a zero-copy fashion to the
>> server-side
>> via InfiniBand. The server part of the driver converts the incoming
>> buffers back
>> into BIOs and hands them down to the underlying block device. As soon as
>> IO
>> responses come back from the drive, they are being transmitted back to the
>> client.
>
>
> Hi Jack, Danil and Roman,
>
> I met Danil and Roman last week at Vault, and I think you guys
> are awesome, thanks a lot for open-sourcing your work! However,
> I have a couple of issues here, some are related to the code and
> some are fundamental actually.

Thanks for comments and suggestions, reply in inline.

>
> - Is there room for this ibnbd? If we were to take every block driver
>   that was submitted without sufficient justification, it'd be very
>   hard to maintain. What advantage (if any) does this buy anyone over
>   existing rdma based protocols (srp, iser, nvmf)? I'm really (*really*)
>   not sold on this one...
>
> - To me, the fundamental design that the client side owns a pool of
>   buffers that it issues writes too, seems inferior than the
>   one taken in iser/nvmf (immediate data). IMO, the ibnbd design has
>   scalability issues both in terms of server side resources, client
>   side contention and network congestion (on infiniband the latter is
>   less severe).
>
> - I suggest that for your next post, you provide a real-life use-case
>   where each of the existing drivers can't suffice, and by can't
>   suffice I mean that it has a fundamental issue with it, not something
>   that requires a fix. With that our feedback can be much more concrete
>   and (at least on my behalf) more open to accept it.
>
> - I'm not exactly sure why you would suggest that your implementation
>   supports only infiniband if you use rdma_cm for address resolution,
>   nor I understand why you emphasize feature (2) below, nor why even
>   in the presence of rdma_cm you have ibtrs_ib_path? (confused...)
>   iWARP needs a bit more attention if you don't use the new generic
>   interfaces though...
You reminds me, we also tested in rxe drivers in the past, but not iWARP.
Might work.
ibtrs_ib_path was leftover for APM feature, we used in house, will
remove next round.

>
> - I honestly do not understand why you need *19057* LOC to implement
>   a rdma based block driver. Thats almost larger than all of our
>   existing block drivers combined... First glance at the code provides
>   some explanations, (1) you have some strange code that has no business
>   in a block driver like ibtrs_malloc/ibtrs_zalloc (yikes) or
>   open-coding various existing logging routines, (2) you are for some
>   reason adding a second tag allocation scheme (why?), (3) you are open
>   coding a lot of stuff that we added to the stack in the past months...
>   (4) you seem to over-layer your code for reasons that I do not
>   really understand. And I didn't really look deep at all into the
>   code, just to get the feel of it, and it seems like it needs a lot
>   of work before it can even be considered upstream ready.
Agree, we will clean up the code further, that's why I sent it RFC to
get a early feedback.


>
>> We design and implement this solution based on our need for Cloud
>> Computing,
>> the key features are:
>> - High throughput and low latency due to:
>> 1) Only two rdma messages per IO
>
>
> Where exactly did you witnessed latency that was meaningful by having
> another rdma message on the wire? That's only for writes, anyway, and
> we have first data bursts for that..
Clearly, we need to benchmark on latest kernel.

>
>> 2) Simplified client side server memory management
>> 3) Eliminated SCSI sublayer
>
>
> That's hardly an advantage given all we are losing without it...
>
> ...
>
> Cheers,
> Sagi.

Thanks,
-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:   +49 30 577 008  042
Fax:  +49 30 577 008 299
Email:jinpu.w...@profitbricks.com
URL:  https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss


Re: [PATCH 01/28] ibtrs: add header shared between ibtrs_client and ibtrs_server

2017-03-24 Thread Jinpu Wang
On Fri, Mar 24, 2017 at 3:31 PM, Johannes Thumshirn <jthumsh...@suse.de> wrote:
> On Fri, Mar 24, 2017 at 01:54:04PM +0100, Jinpu Wang wrote:
>> >> +
>> >> +#define XX(a) case (a): return #a
>> >
>> > please no macros with retun in them and XX isn't quite too descriptive as
>> > well.
>> >
>> > [...]
>> >
>> >> +static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
>> >> +{
>> >> + switch (opcode) {
>> >> + XX(IB_WC_SEND);
>> >> + XX(IB_WC_RDMA_WRITE);
>> >> + XX(IB_WC_RDMA_READ);
>> >> + XX(IB_WC_COMP_SWAP);
>> >> + XX(IB_WC_FETCH_ADD);
>> >> + /* recv-side); inbound completion */
>> >> + XX(IB_WC_RECV);
>> >> + XX(IB_WC_RECV_RDMA_WITH_IMM);
>> >> + default: return "IB_WC_OPCODE_UNKNOWN";
>> >> + }
>> >> +}
>> >
>> > How about:
>> >
>> > struct {
>> > char *name;
>> > enum ib_wc_opcode opcode;
>> > } ib_wc_opcode_table[] = {
>> > { stringyfy(IB_WC_SEND), IB_WC_SEND },
>> > { stringyfy(IB_WC_RDMA_WRITE), IB_WC_RDMA_WRITE },
>> > { stringyfy(IB_WC_RDMA_READ ), IB_WC_RDMA_READ }
>> > { stringyfy(IB_WC_COMP_SWAP), IB_WC_COMP_SWAP },
>> > { stringyfy(IB_WC_FETCH_ADD), IB_WC_FETCH_ADD },
>> > { stringyfy(IB_WC_RECV), IB_WC_RECV },
>> > { stringyfy(IB_WC_RECV_RDMA_WITH_IMM), IB_WC_RECV_RDMA_WITH_IMM },
>> > { NULL, 0 },
>> > };
>> >
>> > static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
>> > {
>> > int i;
>> >
>> > for (i = 0; i < ARRAY_SIZE(ib_wc_opcode_table); i++)
>> > if (ib_wc_opcode_table[i].opcode == opcode)
>> > return ib_wc_opcode_table[i].name;
>> >
>> > return "IB_WC_OPCODE_UNKNOWN";
>> > }
>> >
>> Looks nice, might be better to put it into ib_verbs.h?
>
> Probably yes, as are your kvec functions for lib/iov_iter.c
Thanks, will do in next round!

>
> [...]
>
>> > What about resolving the kernel bug instead of making workarounds?
>> I tried to send a patch upsteam, but was rejected by Sean.
>> http://www.spinics.net/lists/linux-rdma/msg22381.html
>>
>
> I don't see a NACK in this thread.
>
> From http://www.spinics.net/lists/linux-rdma/msg22410.html:
> "The port space (which maps to the service ID) needs to be included as part of
> the check that determines the format of the private data, and not simply the
> address family."
>
> After such a state I would have expected to see a v2 of the patch with above
> comment addressed.
I might busy with other staff at that time, I will check again and
revisit the bug.

>
> Byte,
> Johannes
> --
> Johannes Thumshirn  Storage
> jthumsh...@suse.de+49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Regards,
-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:   +49 30 577 008  042
Fax:  +49 30 577 008 299
Email:jinpu.w...@profitbricks.com
URL:  https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss


Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)

2017-03-24 Thread Jinpu Wang
On Fri, Mar 24, 2017 at 2:31 PM, Bart Van Assche
<bart.vanass...@sandisk.com> wrote:
> On Fri, 2017-03-24 at 13:46 +0100, Jinpu Wang wrote:
>> Our IBNBD project was started 3 years ago based on our need for Cloud
>> Computing, NVMeOF is a bit younger.
>> - IBNBD is one of our components, part of our software defined storage 
>> solution.
>> - As I listed in features, IBNBD has it's own features
>>
>> We're planning to look more into NVMeOF, but it's not a replacement for 
>> IBNBD.
>
> Hello Jack, Danil and Roman,
>
> Thanks for having taken the time to open source this work and to travel to
> Boston to present this work at the Vault conference. However, my
> understanding of IBNBD is that this driver has several shortcomings neither
> NVMeOF nor iSER nor SRP have:
> * Doesn't scale in terms of number of CPUs submitting I/O. The graphs shown
>   during the Vault talk clearly illustrate this. This is probably the result
>   of sharing a data structure across all client CPUs, maybe the bitmap that
>   tracks which parts of the target buffer space are in use.
> * Supports IB but none of the other RDMA transports (RoCE / iWARP).
>
> We also need performance numbers that compare IBNBD against SRP and/or
> NVMeOF with memory registration disabled to see whether and how much faster
> IBNBD is compared to these two protocols.
>
> The fact that IBNBD only needs to messages per I/O is an advantage it has
> today over SRP but not over NVMeOF nor over iSER. The upstream initiator
> drivers for the latter two protocols already support inline data.
>
> Another question I have is whether integration with multipathd is supported?
> If multipathd tries to run scsi_id against an IBNBD client device that will
> fail.
>
> Thanks,
>
> Bart.
Hello Bart,

Thanks for your comments. As usual in house driver mainly covers needs
for ProfitBricks,
We only tested in our hardware environment. We only use IB not
RoCE/iWARP. The idea to
opensource is :
- Present our design/implementation/tradeoff, others might be interested.
- Attract more attention from developers/testers, so we can improve
the project better and faster.

We will gather performance data compare with NVMeOF in next submitting.

multipath is not supported, we're using APM for failover. (patch from
Mellanox developers)

Thanks,
-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:   +49 30 577 008  042
Fax:  +49 30 577 008 299
Email:jinpu.w...@profitbricks.com
URL:  https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss


Re: [PATCH 01/28] ibtrs: add header shared between ibtrs_client and ibtrs_server

2017-03-24 Thread Jinpu Wang
>> +
>> +#define XX(a) case (a): return #a
>
> please no macros with retun in them and XX isn't quite too descriptive as
> well.
>
> [...]
>
>> +static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
>> +{
>> + switch (opcode) {
>> + XX(IB_WC_SEND);
>> + XX(IB_WC_RDMA_WRITE);
>> + XX(IB_WC_RDMA_READ);
>> + XX(IB_WC_COMP_SWAP);
>> + XX(IB_WC_FETCH_ADD);
>> + /* recv-side); inbound completion */
>> + XX(IB_WC_RECV);
>> + XX(IB_WC_RECV_RDMA_WITH_IMM);
>> + default: return "IB_WC_OPCODE_UNKNOWN";
>> + }
>> +}
>
> How about:
>
> struct {
> char *name;
> enum ib_wc_opcode opcode;
> } ib_wc_opcode_table[] = {
> { stringyfy(IB_WC_SEND), IB_WC_SEND },
> { stringyfy(IB_WC_RDMA_WRITE), IB_WC_RDMA_WRITE },
> { stringyfy(IB_WC_RDMA_READ ), IB_WC_RDMA_READ }
> { stringyfy(IB_WC_COMP_SWAP), IB_WC_COMP_SWAP },
> { stringyfy(IB_WC_FETCH_ADD), IB_WC_FETCH_ADD },
> { stringyfy(IB_WC_RECV), IB_WC_RECV },
> { stringyfy(IB_WC_RECV_RDMA_WITH_IMM), IB_WC_RECV_RDMA_WITH_IMM },
> { NULL, 0 },
> };
>
> static inline const char *ib_wc_opcode_str(enum ib_wc_opcode opcode)
> {
> int i;
>
> for (i = 0; i < ARRAY_SIZE(ib_wc_opcode_table); i++)
> if (ib_wc_opcode_table[i].opcode == opcode)
> return ib_wc_opcode_table[i].name;
>
> return "IB_WC_OPCODE_UNKNOWN";
> }
>
Looks nice, might be better to put it into ib_verbs.h?

>
> [...]
>
>> +/**
>> + * struct ibtrs_msg_hdr - Common header of all IBTRS messages
>> + * @type:Message type, valid values see: enum ibtrs_msg_types
>> + * @tsize:   Total size of transferred data
>> + *
>> + * Don't move the first 8 padding bytes! It's a workaround for a kernel bug.
>> + * See IBNBD-610 for details
>
> What about resolving the kernel bug instead of making workarounds?
I tried to send a patch upsteam, but was rejected by Sean.
http://www.spinics.net/lists/linux-rdma/msg22381.html

>
>> + *
>> + * DO NOT CHANGE!
>> + */
>> +struct ibtrs_msg_hdr {
>> + u8  __padding1;
>> + u8  type;
>> + u16 __padding2;
>> + u32 tsize;
>> +};
>
> [...]
>
> --
> Johannes Thumshirn  Storage
> jthumsh...@suse.de+49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Thanks Johannes for review.


-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:   +49 30 577 008  042
Fax:  +49 30 577 008 299
Email:jinpu.w...@profitbricks.com
URL:  https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss


Re: [RFC PATCH 00/28] INFINIBAND NETWORK BLOCK DEVICE (IBNBD)

2017-03-24 Thread Jinpu Wang
On Fri, Mar 24, 2017 at 1:15 PM, Johannes Thumshirn  wrote:
> On Fri, Mar 24, 2017 at 11:45:15AM +0100, Jack Wang wrote:
>> From: Jack Wang 
>>
>> This series introduces IBNBD/IBTRS kernel modules.
>>
>> IBNBD (InfiniBand network block device) allows for an RDMA transfer of block 
>> IO
>> over InfiniBand network. The driver presents itself as a block device on 
>> client
>> side and transmits the block requests in a zero-copy fashion to the 
>> server-side
>> via InfiniBand. The server part of the driver converts the incoming buffers 
>> back
>> into BIOs and hands them down to the underlying block device. As soon as IO
>> responses come back from the drive, they are being transmitted back to the
>> client.
>>
>> We design and implement this solution based on our need for Cloud Computing,
>> the key features are:
>> - High throughput and low latency due to:
>> 1) Only two rdma messages per IO
>> 2) Simplified client side server memory management
>> 3) Eliminated SCSI sublayer
>> - Simple configuration and handling
>> 1) Server side is completely passive: volumes do not need to be
>> explicitly exported
>> 2) Only IB port GID and device path needed on client side to map
>> a block device
>> 3) A device can be remapped automatically i.e. after storage
>> reboot
>> - Pinning of IO-related processing to the CPU of the producer
>>
>> For usage please refer to Documentation/IBNBD.txt in later patch.
>> My colleague Danil Kpnis presents IBNBD in Vault-2017 about our 
>> design/feature/
>> tradeoff/performance:
>>
>> http://events.linuxfoundation.org/sites/events/files/slides/IBNBD-Vault-2017.pdf
>>
>
> Hi Jack,
>
> Sorry to ask (I haven't attented the Vault presentation) but why can't you use
> NVMe over Fabrics in your environment? From what I see in your presentation
> and cover letter, it provides all you need and is in fact a standard Linux and
> Windows already have implemented.
>
> Thanks,
> Johannes
> --
> Johannes Thumshirn  Storage
> jthumsh...@suse.de+49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

Hi Johnnes,

Our IBNBD project was started 3 years ago based on our need for Cloud
Computing, NVMeOF is a bit younger.
- IBNBD is one of our components, part of our software defined storage solution.
- As I listed in features, IBNBD has it's own features

We're planning to look more into NVMeOF, but it's not a replacement for IBNBD.

Thanks,
-- 
Jack Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:   +49 30 577 008  042
Fax:  +49 30 577 008 299
Email:jinpu.w...@profitbricks.com
URL:  https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss


Re: [BUG] Deadlock in blk_mq_register_disk error path

2016-08-16 Thread Jinpu Wang
On Mon, Aug 15, 2016 at 6:22 PM, Bart Van Assche
<bart.vanass...@sandisk.com> wrote:
> On 08/15/2016 09:01 AM, Jinpu Wang wrote:
>>
>> It's more likely you hit another bug, my colleague Roman fix that:
>>
>> http://www.spinics.net/lists/linux-block/msg04552.html
>
>
> Hello Jinpu,
>
> Interesting. However, I see that wrote the following: "Firstly this wrong
> sequence raises two kernel warnings: 1st. WARNING at
> lib/percpu-recount.c:309 percpu_ref_kill_and_confirm called more than once
> 2nd. WARNING at lib/percpu-refcount.c:331". I haven't seen any of these
> kernel warnings ...
>
> Thanks,
>
> Bart.
>

The warning happened from time to time, but your hung tasks are
similar with ours.
We injected some delay in order to reproduce easily.


-- 
Mit freundlichen Grüßen,
Best Regards,

Jack Wang

Linux Kernel Developer Storage
ProfitBricks GmbH  The IaaS-Company.

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 30 5770083-42
Fax: +49 30 5770085-98
Email: jinpu.w...@profitbricks.com
URL: http://www.profitbricks.de

Sitz der Gesellschaft: Berlin.
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B.
Geschäftsführer: Andreas Gauger, Achim Weiss.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: [BUG] Deadlock in blk_mq_register_disk error path

2016-08-15 Thread Jinpu Wang
Hi Bart,

>>
>> Nope, your analysis looks correct. This should fix it:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus=6316338a94b2319abe9d3790eb9cdc56ef81ac1a
>
> Hi Jens,
>
> Will that patch be included in stable kernels? I just encountered a
> deadlock with kernel v4.7 that looks similar.
>
> Thank you,
>
> Bart.
>
> INFO: task kworker/u64:6:136 blocked for more than 480 seconds.
>   Tainted: GW   4.7.0-dbg+ #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/u64:6   D 88016f677bb0 0   136  2 0x
> Workqueue: events_unbound async_run_entry_fn
> Call Trace:
>  [] schedule+0x37/0x90
>  [] schedule_preempt_disabled+0x10/0x20
>  [] mutex_lock_nested+0x144/0x350
>  [] blk_mq_disable_hotplug+0x12/0x20
>  [] blk_mq_register_disk+0x29/0x120
>  [] blk_register_queue+0xb6/0x160
>  [] add_disk+0x219/0x4a0
>  [] sd_probe_async+0x100/0x1b0
>  [] async_run_entry_fn+0x45/0x140
>  [] process_one_work+0x1f9/0x6a0
>  [] worker_thread+0x49/0x490
>  [] kthread+0xea/0x100
>  [] ret_from_fork+0x1f/0x40
> 3 locks held by kworker/u64:6/136:
>  #0:  ("events_unbound"){.+.+.+}, at: [] 
> process_one_work+0x17a/0x6a0
>  #1:  ((>work)){+.+.+.}, at: [] 
> process_one_work+0x17a/0x6a0
>  #2:  (all_q_mutex){+.+.+.}, at: [] 
> blk_mq_disable_hotplug+0x12/0x20
> INFO: task 02:8101 blocked for more than 480 seconds.
>   Tainted: GW   4.7.0-dbg+ #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 02  D 88039b747968 0  8101  1 0x0004
> Call Trace:
>  [] schedule+0x37/0x90
>  [] blk_mq_freeze_queue_wait+0x51/0xb0
>  [] blk_mq_update_tag_set_depth+0x3a/0xb0
>  [] blk_mq_init_allocated_queue+0x432/0x450
>  [] blk_mq_init_queue+0x35/0x60
>  [] scsi_mq_alloc_queue+0x17/0x50
>  [] scsi_alloc_sdev+0x2b9/0x350
>  [] scsi_probe_and_add_lun+0x98b/0xe50
>  [] __scsi_scan_target+0x5ca/0x6b0
>  [] scsi_scan_target+0xe1/0xf0
>  [] srp_create_target+0xf06/0x13d4 [ib_srp]
>  [] dev_attr_store+0x13/0x20
>  [] sysfs_kf_write+0x40/0x50
>  [] kernfs_fop_write+0x137/0x1c0
>  [] __vfs_write+0x23/0x140
>  [] vfs_write+0xb0/0x190
>  [] SyS_write+0x44/0xa0
>  [] entry_SYSCALL_64_fastpath+0x18/0xa8
> 8 locks held by 02/8101:
>  #0:  (sb_writers#4){.+.+.+}, at: [] 
> __sb_start_write+0xb2/0xf0
>  #1:  (>mutex){+.+.+.}, at: [] 
> kernfs_fop_write+0x101/0x1c0
>  #2:  (s_active#363){.+.+.+}, at: [] 
> kernfs_fop_write+0x10a/0x1c0
>  #3:  (>add_target_mutex){+.+.+.}, at: [] 
> srp_create_target+0x134/0x13d4 [ib_srp]
>  #4:  (>scan_mutex){+.+.+.}, at: [] 
> scsi_scan_target+0x8d/0xf0
>  #5:  (cpu_hotplug.lock){++}, at: [] 
> get_online_cpus+0x2d/0x80
>  #6:  (all_q_mutex){+.+.+.}, at: [] 
> blk_mq_init_allocated_queue+0x34a/0x450
>  #7:  (>tag_list_lock){+.+...}, at: [] 
> blk_mq_init_allocated_queue+0x37a/0x450
>

It's more likely you hit another bug, my colleague Roman fix that:

http://www.spinics.net/lists/linux-block/msg04552.html

It will be great, you test and see if it works for you!

-- 
Mit freundlichen Grüßen,
Best Regards,

Jack Wang

Linux Kernel Developer Storage
ProfitBricks GmbH  The IaaS-Company.

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin
Tel: +49 30 5770083-42
Fax: +49 30 5770085-98
Email: jinpu.w...@profitbricks.com
URL: http://www.profitbricks.de

Sitz der Gesellschaft: Berlin.
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B.
Geschäftsführer: Andreas Gauger, Achim Weiss.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[BUG] Deadlock in blk_mq_register_disk error path

2016-08-02 Thread Jinpu Wang
Hi Jens,

I found in blk_mq_register_disk, we blk_mq_disable_hotplug which in
turn mutex_lock(_q_mutex);
  queue_for_each_hw_ctx(q, hctx, i) {
ret = blk_mq_register_hctx(hctx);
if (ret)
break; /// if about error out, we will call
unregister below
}

if (ret)
blk_mq_unregister_disk(disk);

In blk_mq_unregister_disk, we will try to disable_hotplug again, which
leads to dead lock.

Did I miss anything?

-- 
Mit freundlichen Grüßen,
Best Regards,

Jack Wang

Linux Kernel Developer Storage
ProfitBricks GmbH  The IaaS-Company.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html