Re: [PATCH 1/2] Convert target drivers to use sbitmap

2018-05-15 Thread Jens Axboe
On 5/15/18 10:11 AM, Jens Axboe wrote:
> On 5/15/18 10:00 AM, Matthew Wilcox wrote:
>> From: Matthew Wilcox <mawil...@microsoft.com>
>>
>> The sbitmap and the percpu_ida perform essentially the same task,
>> allocating tags for commands.  Since the sbitmap is more used than
>> the percpu_ida, convert the percpu_ida users to the sbitmap API.
> 
> It should also be the same performance as percpu_ida in light use, and
> performs much better at > 50% utilization of the tag space. I think
> that's better justification than "more used than".

Had to search long and hard for the perf numbers I did for percpu_ida
on higher utilization, but here it is:

https://lkml.org/lkml/2014/4/22/553

-- 
Jens Axboe



Re: [PATCH 1/2] Convert target drivers to use sbitmap

2018-05-15 Thread Jens Axboe
On 5/15/18 10:00 AM, Matthew Wilcox wrote:
> From: Matthew Wilcox <mawil...@microsoft.com>
> 
> The sbitmap and the percpu_ida perform essentially the same task,
> allocating tags for commands.  Since the sbitmap is more used than
> the percpu_ida, convert the percpu_ida users to the sbitmap API.

It should also be the same performance as percpu_ida in light use, and
performs much better at > 50% utilization of the tag space. I think
that's better justification than "more used than".

> diff --git a/drivers/target/iscsi/iscsi_target_util.c 
> b/drivers/target/iscsi/iscsi_target_util.c
> index 4435bf374d2d..28bcffae609f 100644
> --- a/drivers/target/iscsi/iscsi_target_util.c
> +++ b/drivers/target/iscsi/iscsi_target_util.c
> @@ -17,7 +17,7 @@
>   
> **/
>  
>  #include 
> -#include 
> +#include 
>  #include  /* ipv6_addr_equal() */
>  #include 
>  #include 
> @@ -147,6 +147,28 @@ void iscsit_free_r2ts_from_list(struct iscsi_cmd *cmd)
>   spin_unlock_bh(>r2t_lock);
>  }
>  
> +int iscsit_wait_for_tag(struct se_session *se_sess, int state, int *cpup)
> +{
> + int tag = -1;
> + DEFINE_WAIT(wait);
> + struct sbq_wait_state *ws;
> +
> + if (state == TASK_RUNNING)
> + return tag;
> +
> + ws = _sess->sess_tag_pool.ws[0];
> + for (;;) {
> + prepare_to_wait_exclusive(>wait, , state);
> + if (signal_pending_state(state, current))
> + break;
> + schedule();
> + tag = sbitmap_queue_get(_sess->sess_tag_pool, cpup);
> + }
> +
> + finish_wait(>wait, );
> + return tag;
> +}

Seems like that should be:


ws = _sess->sess_tag_pool.ws[0];
for (;;) {
prepare_to_wait_exclusive(>wait, , state);
if (signal_pending_state(state, current))
break;
tag = sbitmap_queue_get(_sess->sess_tag_pool, cpup);
if (tag != -1)
break;
schedule();
}

finish_wait(>wait, );
return tag;

>  /*
>   * May be called from software interrupt (timer) context for allocating
>   * iSCSI NopINs.
> @@ -155,9 +177,11 @@ struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn 
> *conn, int state)
>  {
>   struct iscsi_cmd *cmd;
>   struct se_session *se_sess = conn->sess->se_sess;
> - int size, tag;
> + int size, tag, cpu;
>  
> - tag = percpu_ida_alloc(_sess->sess_tag_pool, state);
> + tag = sbitmap_queue_get(_sess->sess_tag_pool, );
> + if (tag < 0)
> + tag = iscsit_wait_for_tag(se_sess, state, );
>   if (tag < 0)
>   return NULL;

Might make sense to just roll the whole thing into iscsi_get_tag(), that
would be cleaner.

sbitmap should provide a helper for that, but we can do that cleanup
later. That would encapsulate things like the per-cpu caching hint too,
for instance.

Rest looks fine to me.

-- 
Jens Axboe



Re: [PATCH 2/5] ide: kill ide_toggle_bounce

2018-04-26 Thread Jens Axboe
On 4/26/18 1:20 AM, Christoph Hellwig wrote:
> On Tue, Apr 24, 2018 at 08:02:56PM -0600, Jens Axboe wrote:
>> On 4/24/18 12:16 PM, Christoph Hellwig wrote:
>>> ide_toggle_bounce did select various strange block bounce limits, including
>>> not bouncing at all as soon as an iommu is present in the system.  Given
>>> that the dma_map routines now handle any required bounce buffering except
>>> for ISA DMA, and the ide code already must handle either ISA DMA or highmem
>>> at least for iommu equipped systems we can get rid of the block layer
>>> bounce limit setting entirely.
>>
>> Pretty sure I was the one to add this code, when highmem page IO was
>> enabled about two decades ago...
>>
>> Outside of DMA, the issue was that the PIO code could not handle
>> highmem. That's not the case anymore, so this should be fine.
> 
> Yes, that is the rationale.  Any chance to you could look over the
> other patches as well?  Except for the networking one for which I'd
> really like to see a review from Dave all the users of the interface
> are block related.

You can add my reviewed-by to 1-3, and 5. Looks good to me.

-- 
Jens Axboe



Re: [PATCH 2/5] ide: kill ide_toggle_bounce

2018-04-24 Thread Jens Axboe
On 4/24/18 12:16 PM, Christoph Hellwig wrote:
> ide_toggle_bounce did select various strange block bounce limits, including
> not bouncing at all as soon as an iommu is present in the system.  Given
> that the dma_map routines now handle any required bounce buffering except
> for ISA DMA, and the ide code already must handle either ISA DMA or highmem
> at least for iommu equipped systems we can get rid of the block layer
> bounce limit setting entirely.

Pretty sure I was the one to add this code, when highmem page IO was
enabled about two decades ago...

Outside of DMA, the issue was that the PIO code could not handle
highmem. That's not the case anymore, so this should be fine.

Reviewed-by: Jens Axboe <ax...@kernel.dk>

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 02:23 PM, Thomas Gleixner wrote:
> On Thu, 9 Nov 2017, Jens Axboe wrote:
>> On 11/09/2017 10:03 AM, Thomas Gleixner wrote:
>>> On Thu, 9 Nov 2017, Jens Axboe wrote:
>>>> On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
>>>> If that's the attitude at your end, then I do suggest we just revert the
>>>> driver changes. Clearly this isn't going to be productive going forward.
>>>>
>>>> The better solution was to make the managed setup more flexible, but
>>>> it doesn't sound like that is going to be viable at all.
>>>
>>> That's not true. I indicated several times, that we can do that, but not
>>> just by breaking the managed facility.
>>>
>>> What I'm arguing against is that the blame is focused on those who
>>> implemented the managed facility with the existing semantics.
>>>
>>> I'm still waiting for a proper description of what needs to be changed in
>>> order to make these drivers work again. All I have seen so far is to break
>>> managed interrupts completely and that's not going to happen.
>>
>> There's no blame as far as I'm concerned, just frustration that we broke
>> this and folks apparently not acknowledging that it's a concern.
>>
>> What used to work was that you could move IRQs around as you wanted to.
>> That was very useful for custom setups, for tuning, or for isolation
>> purposes. None of that works now, which is unfortunate.
> 
> Well, its unfortunate and I can understand your frustration, but I really
> have a hard time to understand that these concerns have not been brought up
> when the whole thing was discussed and in the early stages of
> implementation way before it got merged.
> 
> It was not my decision to make it the way it is and I merily try to prevent
> hasty and damaging hackery right now.

There's no rush, we can take our time to get it right. We won't get this
fixed for 4.14 anyway.

On my end, I don't think the limitation of removing user tweakability was
made clear enough. I'm all for moving common code into helpers and
frameworks, but this currently has limitation that aren't acceptable
imho. This is partly my fault, I should have realized this was the case.

As long as we can move forward in a productive fashion, then that's fine
with me.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 10:07 AM, Thomas Gleixner wrote:
> On Thu, 9 Nov 2017, Jens Axboe wrote:
> 
>> On 11/09/2017 09:01 AM, Sagi Grimberg wrote:
>>>> Now you try to blame the people who implemented the managed affinity stuff
>>>> for the wreckage, which was created by people who changed drivers to use
>>>> it. Nice try.
>>>
>>> I'm not trying to blame anyone, really. I was just trying to understand
>>> how to move forward with making users happy and still enjoy subsystem
>>> services instead of doing lots of similar things inside mlx5 driver.
>>
>> Exactly. The key here is how we make it work for both cases. But there
>> has to be a willingness to make the infrastructure work for that.
> 
> I say it one last time: It can be done and I'm willing to help.

It didn't sound like it earlier, but that's good news.

> Please tell me what exactly you expect and I can have a look what needs to
> be done w/o creating an utter mess.

See the previous email, should have all the details.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 10:03 AM, Thomas Gleixner wrote:
> On Thu, 9 Nov 2017, Jens Axboe wrote:
>> On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
>> If that's the attitude at your end, then I do suggest we just revert the
>> driver changes. Clearly this isn't going to be productive going forward.
>>
>> The better solution was to make the managed setup more flexible, but
>> it doesn't sound like that is going to be viable at all.
> 
> That's not true. I indicated several times, that we can do that, but not
> just by breaking the managed facility.
> 
> What I'm arguing against is that the blame is focused on those who
> implemented the managed facility with the existing semantics.
> 
> I'm still waiting for a proper description of what needs to be changed in
> order to make these drivers work again. All I have seen so far is to break
> managed interrupts completely and that's not going to happen.

There's no blame as far as I'm concerned, just frustration that we broke
this and folks apparently not acknowledging that it's a concern.

What used to work was that you could move IRQs around as you wanted to.
That was very useful for custom setups, for tuning, or for isolation
purposes. None of that works now, which is unfortunate.

As a bare minimum, being able to specify a static mask at load time
would be useful. It's still not nearly as useful as it was before
where you could move them at runtime, but at least it's better than
what we have right now.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 09:01 AM, Sagi Grimberg wrote:
>> Now you try to blame the people who implemented the managed affinity stuff
>> for the wreckage, which was created by people who changed drivers to use
>> it. Nice try.
> 
> I'm not trying to blame anyone, really. I was just trying to understand
> how to move forward with making users happy and still enjoy subsystem
> services instead of doing lots of similar things inside mlx5 driver.

Exactly. The key here is how we make it work for both cases. But there
has to be a willingness to make the infrastructure work for that.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
> On Thu, 9 Nov 2017, Sagi Grimberg wrote:
>> Thomas,
>>
>>>> Because the user sometimes knows better based on statically assigned
>>>> loads, or the user wants consistency across kernels. It's great that the
>>>> system is better at allocating this now, but we also need to allow for a
>>>> user to change it. Like anything on Linux, a user wanting to blow off
>>>> his/her own foot, should be allowed to do so.
>>>
>>> That's fine, but that's not what the managed affinity facility provides. If
>>> you want to leverage the spread mechanism, but avoid the managed part, then
>>> this is a different story and we need to figure out how to provide that
>>> without breaking the managed side of it.
>>>
>>> As I said it's possible, but I vehemently disagree, that this is a bug in
>>> the core code, as it was claimed several times in this thread.
>>>
>>> The real issue is that the driver was converted to something which was
>>> expected to behave differently. That's hardly a bug in the core code, at
>>> most it's a documentation problem.
>>
>> I disagree here, this is not a device specific discussion. The question
>> of exposing IRQ affinity assignment knob to user-space holds for every
>> device (NICs, HBAs and alike). The same issue could have been raised on
>> any other device using managed affinitization (and we have quite a few
>> of those now). I can't see any reason why its OK for device X to use it
>> while its not OK for device Y to use it.
>>
>> If the resolution is "YES we must expose a knob to user-space" then the
>> managed facility is unusable in its current form for *any* device. If the
>> answer is "NO, user-space can't handle all the stuff the kernel can" then
>> we should document it. This is really device independent.
> 
> The early discussion of the managed facility came to the conclusion that it
> will manage this stuff completely to allow fixed association of 'queue /
> interrupt / corresponding memory' to a single CPU or a set of CPUs. That
> removes a lot of 'affinity' handling magic from the driver and utilizes the
> hardware in a sensible way. That was not my decision, really. It surely
> made sense to me and I helped Christoph to implement it.
> 
> The real question is whether you want to have the fixed 'queue / interrupt/
> corresponding memory association' and get rid of all the 'affinity' dance
> in your driver or not.
> 
> If you answer that question with 'yes' then the consequence is that there
> is no knob.
> 
> If you answer that question with 'no' then you should not use
> the managed facility in the first place and if you need parts of that
> functionality then this needs to be added to the core code _before_ a
> driver gets converted and not afterwards.
> 
> It's not my problem if people decide, to use this and then trip over the
> limitations after the changes hit the tree. This could have been figured
> out before even a single patch was posted.
> 
> Now you try to blame the people who implemented the managed affinity stuff
> for the wreckage, which was created by people who changed drivers to use
> it. Nice try.

If that's the attitude at your end, then I do suggest we just revert the
driver changes. Clearly this isn't going to be productive going forward.

The better solution was to make the managed setup more flexible, but
it doesn't sound like that is going to be viable at all.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 03:50 AM, Sagi Grimberg wrote:
> Thomas,
> 
>>> Because the user sometimes knows better based on statically assigned
>>> loads, or the user wants consistency across kernels. It's great that the
>>> system is better at allocating this now, but we also need to allow for a
>>> user to change it. Like anything on Linux, a user wanting to blow off
>>> his/her own foot, should be allowed to do so.
>>
>> That's fine, but that's not what the managed affinity facility provides. If
>> you want to leverage the spread mechanism, but avoid the managed part, then
>> this is a different story and we need to figure out how to provide that
>> without breaking the managed side of it.
>>
>> As I said it's possible, but I vehemently disagree, that this is a bug in
>> the core code, as it was claimed several times in this thread.
>>
>> The real issue is that the driver was converted to something which was
>> expected to behave differently. That's hardly a bug in the core code, at
>> most it's a documentation problem.
> 
> I disagree here, this is not a device specific discussion. The question
> of exposing IRQ affinity assignment knob to user-space holds for every
> device (NICs, HBAs and alike). The same issue could have been raised on
> any other device using managed affinitization (and we have quite a few
> of those now). I can't see any reason why its OK for device X to use it
> while its not OK for device Y to use it.
> 
> If the resolution is "YES we must expose a knob to user-space" then the
> managed facility is unusable in its current form for *any* device. If
> the answer is "NO, user-space can't handle all the stuff the kernel can"
> then we should document it. This is really device independent.

Completely agree, and honestly I'm pretty baffled we're even having
to argue this point.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-09 Thread Jens Axboe
On 11/09/2017 03:09 AM, Christoph Hellwig wrote:
> On Wed, Nov 08, 2017 at 09:13:59AM -0700, Jens Axboe wrote:
>> There are numerous valid reasons to be able to set the affinity, for
>> both nics and block drivers. It's great that the kernel has a predefined
>> layout that works well, but users do need the flexibility to be able to
>> reconfigure affinities, to suit their needs.
> 
> Managed interrupts are about more than just setting affinities - they
> bind a a vector (which means a queue) to a specific cpu or set of cpus.
> This is extremely useful to deal with hotplug issues and also makes
> life a lot easier in general.

I know why it was done.

>> But that particular case is completely orthogonal to whether or not we
>> should allow the user to reconfigure. The answer to that is clearly YES,
>> and we should ensure that this is possible.
> 
> And why is the answer yes?  If the anser is YES it means we need explicit
> boilerplate code to deal with  cpu hotplug in every driver not using
> managed interrupts (even if optionally), so it is a non-trivial tradeoff. 

The answer is yes because as with any other kernel level policy, there
are situations where it isn't the optimal solution. I ran into this myself
with nvme recently, and I started cursing the managed setup for that
very reason. You can even argue this is a regression, since we used
to be able to move things around as we saw fit. If the answer is
boilerplate code to make it feasible, then we need that boilerplate
code.

-- 
Jens Axboe



Re: mlx5 broken affinity

2017-11-08 Thread Jens Axboe
On 11/08/2017 05:21 AM, David Laight wrote:
> From: Sagi Grimberg
>> Sent: 08 November 2017 07:28
> ...
>>> Why would you give the user a knob to destroy what you carefully optimized?
>>
>> Well, looks like someone relies on this knob, the question is if he is
>> doing something better for his workload. I don't know, its really up to
>> the user to say.
> 
> Maybe the user wants to ensure that nothing except some very specific
> processing happens on some (or most) of the cpu cores.
> 
> If the expected overall ethernet data rate isn't exceptionally large
> is there any reason to allocate a queue (etc) for every cpu.

There are numerous valid reasons to be able to set the affinity, for
both nics and block drivers. It's great that the kernel has a predefined
layout that works well, but users do need the flexibility to be able to
reconfigure affinities, to suit their needs.

For the particular mlx5 case, I'm actually not sure how the FB
configuration differs from the in-kernel stuff. I'll take a look at
that. It may just be that the configuration exists because the code used
to be driver private and frequently changed, setting it at bootup to a
known good configuration helped eliminate problems when upgrading
kernels. I also remember some cases of removing CPU0 from the mask.

But that particular case is completely orthogonal to whether or not we
should allow the user to reconfigure. The answer to that is clearly YES,
and we should ensure that this is possible.

-- 
Jens Axboe



Re: [PATCH 13/31] timer: Remove meaningless .data/.function assignments

2017-09-01 Thread Jens Axboe
On 08/31/2017 05:29 PM, Kees Cook wrote:
> Several timer users needlessly reset their .function/.data fields during
> their timer callback, but nothing else changes them. Some users do not
> use their .data field at all. Each instance is removed here.

For amiflop:

Acked-by: Jens Axboe <ax...@kernel.dk>

-- 
Jens Axboe



Re: [PATCH 2/6] nbd: make device_attribute const

2017-08-28 Thread Jens Axboe
On 08/21/2017 05:43 AM, Bhumika Goyal wrote:
> Make this const as is is only passed as an argument to the
> function device_create_file and device_remove_file and the corresponding
> arguments are of type const.
> Done using Coccinelle

Added for 4.14, thanks.

-- 
Jens Axboe



Re: [PATCH 19/22] block: DAC960: shut up format-overflow warning

2017-07-14 Thread Jens Axboe
On 07/14/2017 06:07 AM, Arnd Bergmann wrote:
> gcc-7 points out that a large controller number would overflow the
> string length for the procfs name and the firmware version string:
> 
> drivers/block/DAC960.c: In function 'DAC960_Probe':
> drivers/block/DAC960.c:6591:38: warning: 'sprintf' may write a terminating 
> nul past the end of the destination [-Wformat-overflow=]
> drivers/block/DAC960.c: In function 'DAC960_V1_ReadControllerConfiguration':
> drivers/block/DAC960.c:1681:40: error: '%02d' directive writing between 2 and 
> 3 bytes into a region of size between 2 and 5 [-Werror=format-overflow=]
> drivers/block/DAC960.c:1681:40: note: directive argument in the range [0, 255]
> drivers/block/DAC960.c:1681:3: note: 'sprintf' output between 10 and 14 bytes 
> into a destination of size 12
> 
> Both of these seem appropriately sized, and using snprintf()
> instead of sprintf() improves this by ensuring that even
> incorrect data won't cause undefined behavior here.

Thanks Arnd, added for 4.14.

-- 
Jens Axboe



Re: WARNING: CPU: 2 PID: 682 at net/wireless/util.c:1236

2017-05-04 Thread Jens Axboe
On 05/04/2017 09:27 AM, Jens Axboe wrote:
> On 05/04/2017 09:25 AM, David Miller wrote:
>> From: Jens Axboe <ax...@kernel.dk>
>> Date: Thu, 4 May 2017 09:23:36 -0600
>>
>>> Running current -git on my laptop, and I see this spew every once
>>> in a while. This is new, haven't seen it in 4.11 or previously. Driver
>>> is iwlwifi:
>>
>> Please try:
>>
>> http://patchwork.ozlabs.org/patch/758364/
> 
> Will do, compiling...

Yep works, thanks.

-- 
Jens Axboe



Re: WARNING: CPU: 2 PID: 682 at net/wireless/util.c:1236

2017-05-04 Thread Jens Axboe
On 05/04/2017 09:25 AM, David Miller wrote:
> From: Jens Axboe <ax...@kernel.dk>
> Date: Thu, 4 May 2017 09:23:36 -0600
> 
>> Running current -git on my laptop, and I see this spew every once
>> in a while. This is new, haven't seen it in 4.11 or previously. Driver
>> is iwlwifi:
> 
> Please try:
> 
> http://patchwork.ozlabs.org/patch/758364/

Will do, compiling...

-- 
Jens Axboe



WARNING: CPU: 2 PID: 682 at net/wireless/util.c:1236

2017-05-04 Thread Jens Axboe
Hi,

Running current -git on my laptop, and I see this spew every once
in a while. This is new, haven't seen it in 4.11 or previously. Driver
is iwlwifi:

04:00.0 Network controller: Intel Corporation Wireless 8260 (rev 3a)


[ 3285.862640] IPv6: ADDRCONF(NETDEV_CHANGE): wlp4s0: link becomes ready
[ 3285.978482] [ cut here ]
[ 3285.978501] WARNING: CPU: 2 PID: 682 at net/wireless/util.c:1236 
cfg80211_calculate_bitrate+0x130/0x160 [cfg80211]
[ 3285.978502] Modules linked in: ctr ccm rfcomm fuse bnep arc4 
snd_hda_codec_hdmi iwlmvm snd_hda_codec_conexant snd_hda_codec_generic mac80211 
snd_hda_intel snd_hda_codec binfmt_misc snd_hwdep snd_hda_core snd_pcm 
x86_pkg_temp_thermal nls_iso8859_1 intel_powerclamp nls_cp437 kvm_intel vfat 
snd_seq_midi fat kvm iwlwifi snd_seq_midi_event btusb snd_rawmidi btintel 
bluetooth snd_seq uvcvideo irqbypass aesni_intel videobuf2_vmalloc 
videobuf2_memops snd_seq_device aes_x86_64 videobuf2_v4l2 crypto_simd snd_timer 
videobuf2_core cryptd glue_helper videodev cfg80211 ecdh_generic snd soundcore 
hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd 
intel_gtt
[ 3285.978536] CPU: 2 PID: 682 Comm: NetworkManager Tainted: G U  
4.11.0+ #59
[ 3285.978537] Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET50W (1.24 
) 03/08/2017
[ 3285.978538] task: 880421de2880 task.stack: 8804234a8000
[ 3285.978551] RIP: 0010:cfg80211_calculate_bitrate+0x130/0x160 [cfg80211]
[ 3285.978552] RSP: 0018:8804234ab638 EFLAGS: 00010293
[ 3285.978553] RAX:  RBX: 88042df2cf00 RCX: 0002
[ 3285.978554] RDX:  RSI:  RDI: 8804234ab7d6
[ 3285.978555] RBP: 8804234ab638 R08: 0004 R09: 8804252c20d4
[ 3285.978556] R10: 088002aa R11: a02a6000 R12: 8804234ab7d6
[ 3285.978557] R13: 8804252c20d0 R14: 8804252c2030 R15: 880424bf
[ 3285.978559] FS:  7f4e016a8940() GS:88044150() 
knlGS:
[ 3285.978560] CS:  0010 DS:  ES:  CR0: 80050033
[ 3285.978561] CR2: 0c5e5d540010 CR3: 00042b626000 CR4: 001406e0
[ 3285.978561] DR0:  DR1:  DR2: 
[ 3285.978562] DR3:  DR6: fffe0ff0 DR7: 0400
[ 3285.978563] Call Trace:
[ 3285.978577]  nl80211_put_sta_rate+0x4e/0x1f0 [cfg80211]
[ 3285.978591]  nl80211_send_station.isra.66+0x344/0xcc0 [cfg80211]
[ 3285.978595]  ? __kmalloc_track_caller+0x31/0x1a0
[ 3285.978608]  nl80211_get_station+0x1d9/0x250 [cfg80211]
[ 3285.978620]  genl_family_rcv_msg+0x1bf/0x380
[ 3285.978622]  genl_rcv_msg+0x4c/0x90
[ 3285.978623]  ? genl_family_rcv_msg+0x380/0x380
[ 3285.978629]  netlink_rcv_skb+0x7f/0x110
[ 3285.978630]  genl_rcv+0x28/0x40
[ 3285.978633]  netlink_unicast+0x183/0x220
[ 3285.978635]  netlink_sendmsg+0x293/0x390
[ 3285.978637]  sock_sendmsg+0x38/0x50
[ 3285.978638]  ___sys_sendmsg+0x256/0x2b0
[ 3285.978641]  ? __slab_free+0x1d1/0x2c0
[ 3285.978643]  ? destroy_inode+0x3b/0x60
[ 3285.978645]  ? do_wp_page+0x103/0x440
[ 3285.978646]  ? __handle_mm_fault+0x34e/0xa10
[ 3285.978648]  ? __fget+0x6e/0x90
[ 3285.978650]  __sys_sendmsg+0x45/0x80
[ 3285.978652]  ? __sys_sendmsg+0x45/0x80
[ 3285.978654]  SyS_sendmsg+0x12/0x20
[ 3285.978656]  entry_SYSCALL_64_fastpath+0x13/0x94
[ 3285.978657] RIP: 0033:0x7f4dff521a6d
[ 3285.978658] RSP: 002b:7ffc7507aec0 EFLAGS: 0293 ORIG_RAX: 
002e
[ 3285.978660] RAX: ffda RBX: 02044b60 RCX: 7f4dff521a6d
[ 3285.978661] RDX:  RSI: 7ffc7507af50 RDI: 0010
[ 3285.978661] RBP: 7f4dff50bb20 R08:  R09: 
[ 3285.978662] R10: 01fd8df0 R11: 0293 R12: 1010
[ 3285.978663] R13: 7f4dff50bb78 R14: 26fc R15: 01ffc570
[ 3285.978664] Code: d0 f7 e1 d1 ea 8d 14 92 01 d2 81 c2 50 c3 00 00 b9 c5 5a 
7c 0a 5d c1 ea 05 89 d0 f7 e1 89 d0 c1 e8 07 c3 31 c0 40 80 fe 02 74 b6 <0f> ff 
31 c0 eb b0 8d 14 52 01 d2 e9 41 ff ff ff 0f ff 31 c0 5d 
[ 3285.978693] ---[ end trace ae08dfa326cc120a ]---

-- 
Jens Axboe




Re: linux-next: build failure after merge of the block tree

2017-05-01 Thread Jens Axboe
On May 1, 2017, at 7:37 PM, Stephen Rothwell <s...@canb.auug.org.au> wrote:
> 
> Hi Jens,
> 
>> On Mon, 1 May 2017 19:09:34 -0600 Jens Axboe <ax...@kernel.dk> wrote:
>> 
>> Indeed, I have warned Linus about it. Thanks Stephen. 
> 
> Thanks.
> 
> BTW, (unusually) I did not see your pull request(s) ...

I CC'ed linux-block, so they showed up there at least. 



Re: linux-next: build failure after merge of the block tree

2017-05-01 Thread Jens Axboe

> On May 1, 2017, at 7:07 PM, Stephen Rothwell  wrote:
> 
> Hi all,
> 
>> On Tue, 18 Apr 2017 13:02:29 +1000 Stephen Rothwell  
>> wrote:
>> 
>> After merging the block tree, today's linux-next build (powerpc
>> ppc64_defconfig) failed like this:
>> 
>> drivers/block/nbd.c: In function 'nbd_genl_connect':
>> drivers/block/nbd.c:1662:10: error: too few arguments to function 
>> 'nla_parse_nested'
>>ret = nla_parse_nested(socks, NBD_SOCK_MAX, attr,
>>  ^
>> In file included from include/net/rtnetlink.h:5:0,
>> from include/net/sch_generic.h:12,
>> from include/linux/filter.h:20,
>> from include/net/sock.h:64,
>> from drivers/block/nbd.c:32:
>> include/net/netlink.h:754:19: note: declared here
>> static inline int nla_parse_nested(struct nlattr *tb[], int maxtype,
>>   ^
>> drivers/block/nbd.c: In function 'nbd_genl_reconfigure':
>> drivers/block/nbd.c:1818:10: error: too few arguments to function 
>> 'nla_parse_nested'
>>ret = nla_parse_nested(socks, NBD_SOCK_MAX, attr,
>>  ^
>> In file included from include/net/rtnetlink.h:5:0,
>> from include/net/sch_generic.h:12,
>> from include/linux/filter.h:20,
>> from include/net/sock.h:64,
>> from drivers/block/nbd.c:32:
>> include/net/netlink.h:754:19: note: declared here
>> static inline int nla_parse_nested(struct nlattr *tb[], int maxtype,
>>   ^
>> 
>> Caused by commits
>> 
>>  e46c7287b1c2 ("nbd: add a basic netlink interface")
>>  b7aa3d39385d ("nbd: add a reconfigure netlink command")
>> 
>> interacting with commit
>> 
>>  fceb6435e852 ("netlink: pass extended ACK struct to parsing functions")
>> 
>> from the net-next tree.
>> 
>> I have applied the following merge fix patch:
>> 
>> From: Stephen Rothwell 
>> Date: Tue, 18 Apr 2017 12:59:05 +1000
>> Subject: [PATCH] nbd: fix up for nla_parse_nested() API change
>> 
>> Signed-off-by: Stephen Rothwell 
>> ---
>> drivers/block/nbd.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
>> index b78f23ce2395..5049d19f3940 100644
>> --- a/drivers/block/nbd.c
>> +++ b/drivers/block/nbd.c
>> @@ -1660,7 +1660,7 @@ static int nbd_genl_connect(struct sk_buff *skb, 
>> struct genl_info *info)
>>goto out;
>>}
>>ret = nla_parse_nested(socks, NBD_SOCK_MAX, attr,
>> -   nbd_sock_policy);
>> +   nbd_sock_policy, NULL);
>>if (ret != 0) {
>>printk(KERN_ERR "nbd: error processing sock list\n");
>>ret = -EINVAL;
>> @@ -1816,7 +1816,7 @@ static int nbd_genl_reconfigure(struct sk_buff *skb, 
>> struct genl_info *info)
>>goto out;
>>}
>>ret = nla_parse_nested(socks, NBD_SOCK_MAX, attr,
>> -   nbd_sock_policy);
>> +   nbd_sock_policy, NULL);
>>if (ret != 0) {
>>printk(KERN_ERR "nbd: error processing sock list\n");
>>ret = -EINVAL;
>> -- 
>> 2.11.0
> 
> So, this merge fix is now needed when the net-next tree is merged (as
> Linus has merged the block tree).

Indeed, I have warned Linus about it. Thanks Stephen. 



Re: [PATCH block-tree] net: off by one in inet6_pton()

2017-04-14 Thread Jens Axboe
On 04/13/2017 01:42 PM, Dan Carpenter wrote:
> If "scope_len" is sizeof(scope_id) then we would put the NUL terminator
> one space beyond the end of the buffer.

Added, thanks Dan.

-- 
Jens Axboe



Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper

2017-04-05 Thread Jens Axboe
On 04/02/2017 07:41 AM, Sagi Grimberg wrote:
> Like pci and virtio, we add a rdma helper for affinity
> spreading. This achieves optimal mq affinity assignments
> according to the underlying rdma device affinity maps.

Reviewed-by: Jens Axboe <ax...@fb.com>

-- 
Jens Axboe



Re: codel: split into multiple files

2016-04-26 Thread Jens Axboe

On 04/26/2016 06:36 AM, Michal Kazior wrote:

On 26 April 2016 at 08:43, Sedat Dilek <sedat.di...@gmail.com> wrote:

On 4/26/16, Michal Kazior <michal.kaz...@tieto.com> wrote:

On 26 April 2016 at 08:09, Sedat Dilek <sedat.di...@gmail.com> wrote:

Hi,

I had a very quick view on net-next.git#master (up to commit
fab7b629a82da1b59620470d13152aff975239f6).

Commit in [1] aka "codel: split into multiple files" removed codel.h
but [2] and [3] have relicts to it.
Forgot to remove?


codel.h was not removed. diffstat for codel.h is all red which I
presume is why you thought of it as removed, see:

http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/tree/include/net/codel.h?id=d068ca2ae2e614b9a418fb3b5f1fd4cf996ff032



[ CC Jens ]

OK.
So what are the plans in the future?
Keep a "generic" codel.h (compatibility reasons?) for net or is it your split?


I'm interested in re-using codel in mac80211 for wireless. cfg80211
drivers may want to do that as well later. Even vendor drivers could
start to use it (I can dream :).

I plan to re-spin my patches soonish re-based on the new codel.h/fq.h
approach. There's quite a few spins already[1].



AFAICS I have seen a codel-implementation in block.git#wb-buf-throttle.
Does it make sense to have a more "super-generic" codel.h for re-use
(not only for net and block)?
Just a thought.


Oh, I'm not really familiar with block and problems around it but it
sounds reasonable and interesting. It doesn't look like it blatantly
copies codel though (I did that in my initial mac80211 patches with
some adjustments, you can check that in the link[1] which you can
lookup via my patchset's cover letter[2]; I've based off of codel5[3]
back then).


The block version is an adaptation, I guess you can say it pays homage 
to CoDel. But there are a sufficient amount of differences between 
networking and storage that I don't think a fully generic version is 
really feasible. My favorite thing to bring up is the fact that we don't 
have the luxury of dropping packets on the storage side...


--
Jens Axboe



Re: [PATCH 1/4] list: introduce list_is_first()

2015-12-10 Thread Jens Axboe

On 12/10/2015 07:17 AM, Geliang Tang wrote:

We already have list_is_last(), it makes sense to also add
list_is_first() for consistency. This list utility function
to check for first element in a list.


Honestly, I think we already have way too many of these kind of helpers. 
IMHO they don't really help, they hurt readability. You should know how 
the list works anyway, and if you do, then it's a no-brainer what's 
first and last. If you don't, then you are bound to screw up in other ways.


Just my 2 cents.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] list: introduce list_is_first()

2015-12-10 Thread Jens Axboe

On 12/10/2015 08:23 AM, Josh Poimboeuf wrote:

On Thu, Dec 10, 2015 at 08:10:34AM -0700, Jens Axboe wrote:

On 12/10/2015 07:17 AM, Geliang Tang wrote:

We already have list_is_last(), it makes sense to also add
list_is_first() for consistency. This list utility function
to check for first element in a list.


Honestly, I think we already have way too many of these kind of helpers.
IMHO they don't really help, they hurt readability. You should know how the
list works anyway, and if you do, then it's a no-brainer what's first and
last. If you don't, then you are bound to screw up in other ways.

Just my 2 cents.


Personally I would disagree.  Something like:

   if (list_is_first(>queuelist, >queue))

is much more readable to me than:

   if (rq->queuelist.prev == >queue)


Both the function and your example are backwards, and hence a lot harder 
to comprehend than they should be. It'd be much clearer as:


if (nd->queue.next == >queuelist)

which is a lot easier to read. Nobody should open-code a 'is this the 
first entry in the list' by asking 'is the previous link to my node the 
head', asking 'is the next entry in the list X' makes a lot more sense. 
I'm assuming this happened because the list_is_last was just copied and 
modified, instead of thinking about this for a second.



The first one takes no effort for me -- it's almost English.  While the
second one takes me a few seconds (and some precious brain cycles) to
decipher.

Maybe whether it's readable depends on how many years you've been
looking at the pattern.  But IMHO we shouldn't make "having x # of years
staring at kernel code" a prerequisite for being able to read kernel
code.


It's a balance, as we also should not make APIs out of everything. As I 
said, purely my opinion, but I think the is_last/is_first have jumped 
the shark.


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.4-rc3, KVM, br0 and instant hang

2015-12-05 Thread Jens Axboe

On 12/05/2015 10:31 AM, John Stoffel wrote:

"John" == John Stoffel <j...@quad.stoffel.home> writes:


John> On Fri, Dec 04, 2015 at 11:28:33PM -0500, John Stoffel wrote:


Hi all,



Anyway, if I try to boot up anything past the 4.2.6 kernel, the system
locks up pretty quickly with an oops message that scrolls off the
screen too far.  I've got some pictures which I'll attach in a bit,
maybe they'll help.  So at first I thought it was something to do with
bad kworker threads, or SCSI or SATA interactions, but as I tried to
configure Netconsole to log to my beaglebone black SBC, I found out
that if I compiled and installed 4.4-rc3, started the bridge up (br0),
even started KVM, but did NOT start my VMs, the system was stable.


I've now figured out that I can disable all my VMs from autostart, and
the system will come up properly.  Then I can setup netconsole to use
the br0 interface, do an  "echo t > sysrq" to confirm it's working,
and start up the VMs.

On my most recent bootup, I thought it was ok, since the VMs worked
for a while (10 minutes) and I was starting to re-compile the kernel
again to make more modules compiled in.  No luck, I got the following
crash dump (partial) on my netconsole box.

[ 1434.266524] [ cut here ]
[ 1434.266643] WARNING: CPU: 2 PID: 179 at block/blk-merge.c:435 
blk_rq_map_sg+0x2d9/0x2eb()


This is fixed in current -git, as of a few days ago.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


4.3.0+ breaks software VPN

2015-11-13 Thread Jens Axboe

Hi,

Tried to connect to sw vpn today, and it isn't working. Running git 
as-of yesterday. In dmesg:


[23703.921542] vpn0: set_features() failed (-1); wanted 
0x008048c1, left 0x0080001b48c9


Reverting:

fd867d51f889
5ba3f7d61a3a
e7868a85e1b2

in reverse order makes it work again. How do we get this fixed so that 
4.4-rc1 doesn't break basic VPN support?


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.3.0+ breaks software VPN

2015-11-13 Thread Jens Axboe

On 11/13/2015 02:44 PM, Nikolay Aleksandrov wrote:

On 11/13/2015 10:37 PM, Jens Axboe wrote:

Hi,

Tried to connect to sw vpn today, and it isn't working. Running git as-of 
yesterday. In dmesg:

[23703.921542] vpn0: set_features() failed (-1); wanted 0x008048c1, 
left 0x0080001b48c9

Reverting:

fd867d51f889
5ba3f7d61a3a
e7868a85e1b2

in reverse order makes it work again. How do we get this fixed so that 4.4-rc1 
doesn't break basic VPN support?



Today I've posted a patch that should fix this,
http://patchwork.ozlabs.org/patch/544307/

More testing is always welcome.


Thanks, that works!

Tested-by: Jens Axboe <ax...@fb.com>


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH retry] bluetooth : add conn add/del workqueues to avoid connection fail

2008-01-31 Thread Jens Axboe
On Wed, Jan 30 2008, Dave Young wrote:
 
 The bluetooth hci_conn sysfs add/del executed in the default workqueue.
 If the del_conn is executed after the new add_conn with same target,
 add_conn will failed with warning of same kobject name.
 
 Here add btaddconn  btdelconn workqueues,
 flush the btdelconn workqueue in the add_conn function to avoid the issue.
 
 Signed-off-by: Dave Young [EMAIL PROTECTED] 
 
 ---
 diff -upr a/net/bluetooth/hci_sysfs.c b/net/bluetooth/hci_sysfs.c
 --- a/net/bluetooth/hci_sysfs.c   2008-01-30 10:14:27.0 +0800
 +++ b/net/bluetooth/hci_sysfs.c   2008-01-30 10:14:14.0 +0800
 @@ -12,6 +12,8 @@
  #undef  BT_DBG
  #define BT_DBG(D...)
  #endif
 +static struct workqueue_struct *btaddconn;
 +static struct workqueue_struct *btdelconn;
  
  static inline char *typetostr(int type)
  {
 @@ -279,6 +281,7 @@ static void add_conn(struct work_struct 
   struct hci_conn *conn = container_of(work, struct hci_conn, work);
   int i;
  
 + flush_workqueue(btdelconn);
   if (device_add(conn-dev)  0) {
   BT_ERR(Failed to register connection device);
   return;
 @@ -313,6 +316,7 @@ void hci_conn_add_sysfs(struct hci_conn 
  
   INIT_WORK(conn-work, add_conn);
  
 + queue_work(btaddconn, conn-work);
   schedule_work(conn-work);
  }

So you queue conn-work on both btaddconn and keventd_wq?

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] New Kernel Bugs

2007-11-13 Thread Jens Axboe
On Tue, Nov 13 2007, Andrew Morton wrote:
  I/O STORAGE===
  
  kernel bug from pktcdvd
  http://bugzilla.kernel.org/show_bug.cgi?id=9294
  Kernel: 2.6.23
 
 I think we might have fixed this.

It's fixed and merged, I just forgot to close the bugzilla. Did so now.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: Fix networking scatterlist regressions.

2007-10-31 Thread Jens Axboe
On Tue, Oct 30 2007, David Miller wrote:
 
 I just checked the following bug fix into net-2.6
 
 Rusty, have a quick look at virtio_net wrt. the changes I
 made to skb_to_sgvec()'s behavior.  I think I might have
 even fixed something :-)
 
 Jens, please review my commentary wrt. sg_mark_end() and
 it's nonintuitive behavior which led to these bugs.

I fully agree, lets just change sg_mark_end() to NOT overwrite a stored
page there. The current interface isn't nice and can't be used after
filling the sg table, which is what users would want. I've added such a
patch to the sg repo.

From 5a0347663f51850eb52b89c4dcf6a714ea8d3965 Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Wed, 31 Oct 2007 08:31:23 +0100
Subject: [PATCH] [SG] Remove __sg_mark_end()

Make sg_mark_end() NOT overwrite the page link. Then it can be used
after filling the sg table, which is what users want. That means that
__sg_mark_end() is no longer useful, so kill it.

It's important the sg entries be initialized before using sg_mark_end(),
so also add a debug check to catch use-before-init.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/ll_rw_blk.c   |2 +-
 include/linux/scatterlist.h |   10 --
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index e948407..fdc0707 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1369,7 +1369,7 @@ new_segment:
} /* segments in rq */
 
if (sg)
-   __sg_mark_end(sg);
+   sg_mark_end(sg);
 
return nsegs;
 }
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index d5e1876..aa97954 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -195,13 +195,11 @@ static inline void sg_chain(struct scatterlist *prv, 
unsigned int prv_nents,
  *   Marks the last entry as the termination point for sg_next()
  *
  **/
-static inline void sg_mark_end(struct scatterlist *sgl, unsigned int nents)
-{
-   sgl[nents - 1].page_link = 0x02;
-}
-
-static inline void __sg_mark_end(struct scatterlist *sg)
+static inline void sg_mark_end(struct scatterlist *sg)
 {
+#ifdef CONFIG_DEBUG_SG
+   BUG_ON(sg-sg_magic != SG_MAGIC);
+#endif
sg-page_link |= 0x02;
 }
 
-- 
1.5.3.GIT


-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: Fix networking scatterlist regressions.

2007-10-31 Thread Jens Axboe
On Wed, Oct 31 2007, David Miller wrote:
 From: Jens Axboe [EMAIL PROTECTED]
 Date: Wed, 31 Oct 2007 08:32:07 +0100
 
  [SG] Remove __sg_mark_end()
  
  Make sg_mark_end() NOT overwrite the page link. Then it can be used
  after filling the sg table, which is what users want. That means that
  __sg_mark_end() is no longer useful, so kill it.
  
  It's important the sg entries be initialized before using sg_mark_end(),
  so also add a debug check to catch use-before-init.
  
  Signed-off-by: Jens Axboe [EMAIL PROTECTED]
 
 Ok, but I just pushed my changes to Linus and once those show up
 you'll need to extend this patch to kill the '__' prefix from
 all the rest of the calls which will be in the tree.
 
 Thanks!

No problem, I'll base further sg_mark_end() updates on top of yours.
Just need to get that last email I sent out resolved, the
gss_krb5_crypto bits.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: Fix networking scatterlist regressions.

2007-10-31 Thread Jens Axboe
On Tue, Oct 30 2007, David Miller wrote:
 diff --git a/net/sunrpc/auth_gss/gss_krb5_crypto.c 
 b/net/sunrpc/auth_gss/gss_krb5_crypto.c
 index 91cd8f0..4a8aa94 100644
 --- a/net/sunrpc/auth_gss/gss_krb5_crypto.c
 +++ b/net/sunrpc/auth_gss/gss_krb5_crypto.c
 @@ -211,8 +211,8 @@ encryptor(struct scatterlist *sg, void *data)
   if (thislen == 0)
   return 0;
  
 - sg_mark_end(desc-infrags, desc-fragno);
 - sg_mark_end(desc-outfrags, desc-fragno);
 + __sg_mark_end(desc-infrags, desc-fragno);
 + __sg_mark_end(desc-outfrags, desc-fragno);
  
   ret = crypto_blkcipher_encrypt_iv(desc-desc, desc-outfrags,
 desc-infrags, thislen);
 @@ -293,7 +293,7 @@ decryptor(struct scatterlist *sg, void *data)
   if (thislen == 0)
   return 0;
  
 - sg_mark_end(desc-frags, desc-fragno);
 + __sg_mark_end(desc-frags, desc-fragno);
  
   ret = crypto_blkcipher_decrypt_iv(desc-desc, desc-frags,
 desc-frags, thislen);

Hmm? These don't seem right. It also has a weird code sequence:

...
sg_mark_end(desc-infrags[desc-fragno - 1]);
sg_mark_end(desc-outfrags[desc-fragno - 1]);

ret = crypto_blkcipher_encrypt_iv(desc-desc, desc-outfrags,
  desc-infrags, thislen);
if (ret)
return ret;

sg_init_table(desc-infrags, 4);
sg_init_table(desc-outfrags, 4);
...

Did something go wrong there?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: Fix networking scatterlist regressions.

2007-10-31 Thread Jens Axboe
On Wed, Oct 31 2007, David Miller wrote:
 From: Jens Axboe [EMAIL PROTECTED]
 Date: Wed, 31 Oct 2007 09:01:43 +0100
 
  On Wed, Oct 31 2007, David Miller wrote:
   From: Jens Axboe [EMAIL PROTECTED]
   Date: Wed, 31 Oct 2007 08:46:21 +0100
   
On Tue, Oct 30 2007, David Miller wrote:
 @@ -293,7 +293,7 @@ decryptor(struct scatterlist *sg, void *data)
   if (thislen == 0)
   return 0;
  
 - sg_mark_end(desc-frags, desc-fragno);
 + __sg_mark_end(desc-frags, desc-fragno);
  
   ret = crypto_blkcipher_decrypt_iv(desc-desc, desc-frags,
 desc-frags, thislen);

Hmm? These don't seem right. It also has a weird code sequence:
...
Did something go wrong there?
   
   Yes, I fixed those up after doing some allmodconfig builds.
   
   Here is the final patch I actually pushed to Linus:
  
  That fixes up the sg_mark_end() bit, but it's still calling
  sg_init_table() just a few lines further down. Is that correct?
 
 Absolutely.  It initially using the scatterlist for this
 crypto layer call:
 
   ret = crypto_blkcipher_decrypt_iv(desc-desc, desc-frags,
 desc-frags, thislen);
   if (ret)
   return ret;
 
 then it reinits and sets the sglist to the values the caller
 wants.

Great, just wanted to double check that it was indeed correct!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: Fix networking scatterlist regressions.

2007-10-31 Thread Jens Axboe
On Wed, Oct 31 2007, David Miller wrote:
 From: Jens Axboe [EMAIL PROTECTED]
 Date: Wed, 31 Oct 2007 08:46:21 +0100
 
  On Tue, Oct 30 2007, David Miller wrote:
   @@ -293,7 +293,7 @@ decryptor(struct scatterlist *sg, void *data)
 if (thislen == 0)
 return 0;

   - sg_mark_end(desc-frags, desc-fragno);
   + __sg_mark_end(desc-frags, desc-fragno);

 ret = crypto_blkcipher_decrypt_iv(desc-desc, desc-frags,
   desc-frags, thislen);
  
  Hmm? These don't seem right. It also has a weird code sequence:
  ...
  Did something go wrong there?
 
 Yes, I fixed those up after doing some allmodconfig builds.
 
 Here is the final patch I actually pushed to Linus:

Here's the sg_mark_end() patch on top of that.

From 2f5371509d3d4d09269bf7a46868da2ac5c61d77 Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Wed, 31 Oct 2007 09:11:10 +0100
Subject: [PATCH] [SG] Get rid of __sg_mark_end()

sg_mark_end() overwrites the page_link information, but all users want
__sg_mark_end() behaviour where we just set the end bit. That is the most
natural way to use the sg list, since you'll fill it in and then mark the
end point.

So change sg_mark_end() to only set the termination bit. Add a sg_magic
debug check as well, and clear a chain pointer if it is set.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 block/ll_rw_blk.c |2 +-
 drivers/scsi/scsi_lib.c   |2 +-
 include/linux/scatterlist.h   |   22 --
 net/core/skbuff.c |2 +-
 net/ipv4/tcp_ipv4.c   |2 +-
 net/ipv6/tcp_ipv6.c   |2 +-
 net/rxrpc/rxkad.c |2 +-
 net/sunrpc/auth_gss/gss_krb5_crypto.c |6 +++---
 8 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index e948407..fdc0707 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -1369,7 +1369,7 @@ new_segment:
} /* segments in rq */
 
if (sg)
-   __sg_mark_end(sg);
+   sg_mark_end(sg);
 
return nsegs;
 }
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 61fdaf0..88de771 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -785,7 +785,7 @@ struct scatterlist *scsi_alloc_sgtable(struct scsi_cmnd 
*cmd, gfp_t gfp_mask)
 * end-of-list
 */
if (!left)
-   sg_mark_end(sgl, this);
+   sg_mark_end(sgl[this - 1]);
 
/*
 * don't allow subsequent mempool allocs to sleep, it would
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index d5e1876..b2116a1 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -188,21 +188,23 @@ static inline void sg_chain(struct scatterlist *prv, 
unsigned int prv_nents,
 
 /**
  * sg_mark_end - Mark the end of the scatterlist
- * @sgl:   Scatterlist
- * @nents: Number of entries in sgl
+ * @sg: SG entryScatterlist
  *
  * Description:
- *   Marks the last entry as the termination point for sg_next()
+ *   Marks the passed in sg entry as the termination point for the sg
+ *   table. A call to sg_next() on this entry will return NULL.
  *
  **/
-static inline void sg_mark_end(struct scatterlist *sgl, unsigned int nents)
-{
-   sgl[nents - 1].page_link = 0x02;
-}
-
-static inline void __sg_mark_end(struct scatterlist *sg)
+static inline void sg_mark_end(struct scatterlist *sg)
 {
+#ifdef CONFIG_DEBUG_SG
+   BUG_ON(sg-sg_magic != SG_MAGIC);
+#endif
+   /*
+* Set termination bit, clear potential chain bit
+*/
sg-page_link |= 0x02;
+   sg-page_link = ~0x01;
 }
 
 /**
@@ -218,7 +220,7 @@ static inline void __sg_mark_end(struct scatterlist *sg)
 static inline void sg_init_table(struct scatterlist *sgl, unsigned int nents)
 {
memset(sgl, 0, sizeof(*sgl) * nents);
-   sg_mark_end(sgl, nents);
+   sg_mark_end(sgl[nents - 1]);
 #ifdef CONFIG_DEBUG_SG
{
unsigned int i;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 64b50ff..32d5826 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2095,7 +2095,7 @@ int skb_to_sgvec(struct sk_buff *skb, struct scatterlist 
*sg, int offset, int le
 {
int nsg = __skb_to_sgvec(skb, sg, offset, len);
 
-   __sg_mark_end(sg[nsg - 1]);
+   sg_mark_end(sg[nsg - 1]);
 
return nsg;
 }
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index eec02b2..d438dfb 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1083,7 +1083,7 @@ static int tcp_v4_do_calc_md5_hash(char *md5_hash, struct 
tcp_md5sig_key *key,
sg_set_buf(sg[block++], key-key, key-keylen);
nbytes += key-keylen;
 
-   __sg_mark_end(sg[block - 1]);
+   sg_mark_end(sg[block - 1]);
 
/* Now store the Hash into the packet

Re: [PATCH]: Fix networking scatterlist regressions.

2007-10-31 Thread Jens Axboe
On Wed, Oct 31 2007, David Miller wrote:
 From: Jens Axboe [EMAIL PROTECTED]
 Date: Wed, 31 Oct 2007 08:46:21 +0100
 
  On Tue, Oct 30 2007, David Miller wrote:
   @@ -293,7 +293,7 @@ decryptor(struct scatterlist *sg, void *data)
 if (thislen == 0)
 return 0;

   - sg_mark_end(desc-frags, desc-fragno);
   + __sg_mark_end(desc-frags, desc-fragno);

 ret = crypto_blkcipher_decrypt_iv(desc-desc, desc-frags,
   desc-frags, thislen);
  
  Hmm? These don't seem right. It also has a weird code sequence:
  ...
  Did something go wrong there?
 
 Yes, I fixed those up after doing some allmodconfig builds.
 
 Here is the final patch I actually pushed to Linus:

That fixes up the sg_mark_end() bit, but it's still calling
sg_init_table() just a few lines further down. Is that correct?


-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: Fix networking scatterlist regressions.

2007-10-31 Thread Jens Axboe
On Wed, Oct 31 2007, David Miller wrote:
 From: Jens Axboe [EMAIL PROTECTED]
 Date: Wed, 31 Oct 2007 09:14:28 +0100
 
  Subject: [PATCH] [SG] Get rid of __sg_mark_end()
  
  sg_mark_end() overwrites the page_link information, but all users want
  __sg_mark_end() behaviour where we just set the end bit. That is the most
  natural way to use the sg list, since you'll fill it in and then mark the
  end point.
  
  So change sg_mark_end() to only set the termination bit. Add a sg_magic
  debug check as well, and clear a chain pointer if it is set.
  
  Signed-off-by: Jens Axboe [EMAIL PROTECTED]
 
 It doesn't build.  I suspect there is something else in your tree
 that is necessary for this patch to work on it's own.

Builds here. But yes, it's on top of other patches, it was merely for
demonstration purposes that I posted it. Locally sg_init_one() uses
sg_init_table() here, it doesn't open code the init:

static inline void sg_init_one(struct scatterlist *sg, const void *buf,
   unsigned int buflen)
{
sg_init_table(sg, 1);
sg_set_buf(sg, buf, buflen);
}


-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: Fix networking scatterlist regressions.

2007-10-31 Thread Jens Axboe
On Wed, Oct 31 2007, Herbert Xu wrote:
 On Tue, Oct 30, 2007 at 08:40:02PM -0700, David Miller wrote:
  
  I just checked the following bug fix into net-2.6
 
 Thanks for getting to the bottom of this Dave! I seem to have
 mistaken the = for a |= in sg_mark_end :)

I don't blame you, that function was definitely non-intuitive!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [CRYPTO] tcrypt: Move sg_init_table out of timing loops

2007-10-30 Thread Jens Axboe
On Tue, Oct 30 2007, Boaz Harrosh wrote:
 On Mon, Oct 29 2007 at 22:16 +0200, Jens Axboe [EMAIL PROTECTED] wrote:
  On Fri, Oct 26 2007, Herbert Xu wrote:
  [CRYPTO] tcrypt: Move sg_init_table out of timing loops
 
  This patch moves the sg_init_table out of the timing loops for hash
  algorithms so that it doesn't impact on the speed test results.
  
  Wouldn't it be better to just make sg_init_one() call sg_init_table?
  
  diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
  index 4571231..ccc55a6 100644
  --- a/include/linux/scatterlist.h
  +++ b/include/linux/scatterlist.h
  @@ -202,28 +202,6 @@ static inline void __sg_mark_end(struct scatterlist 
  *sg)
   }
   
   /**
  - * sg_init_one - Initialize a single entry sg list
  - * @sg: SG entry
  - * @buf:Virtual address for IO
  - * @buflen: IO length
  - *
  - * Notes:
  - *   This should not be used on a single entry that is part of a larger
  - *   table. Use sg_init_table() for that.
  - *
  - **/
  -static inline void sg_init_one(struct scatterlist *sg, const void *buf,
  -  unsigned int buflen)
  -{
  -   memset(sg, 0, sizeof(*sg));
  -#ifdef CONFIG_DEBUG_SG
  -   sg-sg_magic = SG_MAGIC;
  -#endif
  -   sg_mark_end(sg, 1);
  -   sg_set_buf(sg, buf, buflen);
  -}
  -
  -/**
* sg_init_table - Initialize SG table
* @sgl:  The SG table
* @nents:Number of entries in table
  @@ -247,6 +225,24 @@ static inline void sg_init_table(struct scatterlist 
  *sgl, unsigned int nents)
   }
   
   /**
  + * sg_init_one - Initialize a single entry sg list
  + * @sg: SG entry
  + * @buf:Virtual address for IO
  + * @buflen: IO length
  + *
  + * Notes:
  + *   This should not be used on a single entry that is part of a larger
  + *   table. Use sg_init_table() for that.
  + *
  + **/
  +static inline void sg_init_one(struct scatterlist *sg, const void *buf,
  +  unsigned int buflen)
  +{
  +   sg_init_table(sg, 1);
  +   sg_set_buf(sg, buf, buflen);
  +}
  +
  +/**
* sg_phys - Return physical address of an sg entry
* @sg: SG entry
*
  
 Yes please submit this patch. scsi-ml is full of sg_init_one, specially
 on the error recovery path.

Will do.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [CRYPTO] tcrypt: Move sg_init_table out of timing loops

2007-10-30 Thread Jens Axboe
On Tue, Oct 30 2007, Herbert Xu wrote:
 On Tue, Oct 30, 2007 at 06:50:58AM +0100, Jens Axboe wrote:
 
  How so? The reason you changed it to sg_init_table() + sg_set_buf() is
  exactly because sg_init_one() didn't properly init the entry (as they
  name promised).
 
 For one of the cases yes but the other one repeatedly calls
 sg_init_one on the same sg entry while we really only need
 to initialise it once and call sg_set_buf afterwards.
 
 Normally this is irrelevant but the loops in question are
 trying to estimate the speed of the algorithms so it's good
 to exclude as much noise from them as possible.

Ah OK, I was referring to the replacement mentioned above.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] [CRYPTO] tcrypt: Move sg_init_table out of timing loops

2007-10-29 Thread Jens Axboe
On Fri, Oct 26 2007, Herbert Xu wrote:
 [CRYPTO] tcrypt: Move sg_init_table out of timing loops
 
 This patch moves the sg_init_table out of the timing loops for hash
 algorithms so that it doesn't impact on the speed test results.

Wouldn't it be better to just make sg_init_one() call sg_init_table?

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 4571231..ccc55a6 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -202,28 +202,6 @@ static inline void __sg_mark_end(struct scatterlist *sg)
 }
 
 /**
- * sg_init_one - Initialize a single entry sg list
- * @sg: SG entry
- * @buf:Virtual address for IO
- * @buflen: IO length
- *
- * Notes:
- *   This should not be used on a single entry that is part of a larger
- *   table. Use sg_init_table() for that.
- *
- **/
-static inline void sg_init_one(struct scatterlist *sg, const void *buf,
-  unsigned int buflen)
-{
-   memset(sg, 0, sizeof(*sg));
-#ifdef CONFIG_DEBUG_SG
-   sg-sg_magic = SG_MAGIC;
-#endif
-   sg_mark_end(sg, 1);
-   sg_set_buf(sg, buf, buflen);
-}
-
-/**
  * sg_init_table - Initialize SG table
  * @sgl:  The SG table
  * @nents:Number of entries in table
@@ -247,6 +225,24 @@ static inline void sg_init_table(struct scatterlist *sgl, 
unsigned int nents)
 }
 
 /**
+ * sg_init_one - Initialize a single entry sg list
+ * @sg: SG entry
+ * @buf:Virtual address for IO
+ * @buflen: IO length
+ *
+ * Notes:
+ *   This should not be used on a single entry that is part of a larger
+ *   table. Use sg_init_table() for that.
+ *
+ **/
+static inline void sg_init_one(struct scatterlist *sg, const void *buf,
+  unsigned int buflen)
+{
+   sg_init_table(sg, 1);
+   sg_set_buf(sg, buf, buflen);
+}
+
+/**
  * sg_phys - Return physical address of an sg entry
  * @sg: SG entry
  *

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [-mm patch] make tcp_splice_data_recv() static

2007-09-12 Thread Jens Axboe
On Wed, Sep 12 2007, David Miller wrote:
 From: Adrian Bunk [EMAIL PROTECTED]
 Date: Sun, 9 Sep 2007 22:25:58 +0200
 
  On Fri, Aug 31, 2007 at 09:58:22PM -0700, Andrew Morton wrote:
  ...
   Changes since 2.6.23-rc3-mm1:
  ...
git-block.patch
  ...
git trees
  ...
  
  tcp_splice_data_recv() can become static.
  
  Signed-off-by: Adrian Bunk [EMAIL PROTECTED]
 
 I'll let Jens or similar pick this one up since it
 obviously won't apply to my tree.

I'll shove it in my #splice-net branch, where it originates from.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-13 Thread Jens Axboe
On Sun, Aug 12 2007, Daniel Phillips wrote:
 On Tuesday 07 August 2007 13:55, Jens Axboe wrote:
  I don't like structure bloat, but I do like nice design. Overloading
  is a necessary evil sometimes, though. Even today, there isn't enough
  room to hold bi_rw and bi_flags in the same variable on 32-bit archs,
  so that concern can be scratched. If you read bio.h, that much is
  obvious.
 
 Sixteen bits in bi_rw are consumed by queue priority.  Is there a reason 
 this lives in struct bio instead of struct request?

If you don't, you have to pass them down. You can make that very
statement about basically any member of struct bio, until we end up with
a submit_bio() path and down taking 16 arguments.

  If you check up on the iommu virtual merging, you'll understand the
  front and back size members. They may smell dubious to you, but
  please take the time to understand why it looks the way it does.
 
 Virtual merging is only needed at the physical device, so why do these 
 fields live in struct bio instead of struct request?

A bio does exist outside of a struct request, and bio buildup also
happens before it gets attached to such.

  Changing the number of bvecs is integral to how bio buildup current
  works.
 
 Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can 
 derive efficiently from BIO_POOL_IDX() provided the bio was allocated 
 in the standard way.

That would only be feasible, if we ruled that any bio in the system must
originate from the standard pools.

 This leaves a little bit of clean up to do for bios not allocated from
 a standard pool.

Please suggest how to do such a cleanup.

 Incidentally, why does the bvl need to be memset to zero on allocation?  
 bi_vcnt already tells you which bvecs are valid and the only field in a 
 bvec that can reasonably default to zero is the offset, which ought to 
 be set set every time a bvec is initialized anyway.

We could probably skip that, but that's an unrelated subject.

   bi_destructor could be combined.  I don't see a lot of users of
   bi_idx,
 
  bi_idx is integral to partial io completions.
 
 Struct request has a remaining submission sector count so what does 
 bi_idx do that is different?

Struct request has remaining IO count. You still need to know where to
start in the bio.

   that looks like a soft target.  See what happened to struct page
   when a couple of folks got serious about attacking it, some really
   deep hacks were done to pare off a few bytes here and there.  But
   struct bio as a space waster is not nearly in the same ballpark.
 
  So show some concrete patches and examples, hand waving and
  assumptions is just a waste of everyones time.
 
 Average struct bio memory footprint ranks near the bottom of the list of 
 things that suck most about Linux storage.  At idle I see 8K in use 
 (reserves); during updatedb it spikes occasionally to 50K; under a 
 heavy  load generated by ddsnap on a storage box it sometimes goes to 
 100K with bio throttling in place.  Really not moving the needle.

Then, again, stop wasting time on this subject. Just because struct bio
isn't a huge bloat is absolutely no justification for adding extra
members to it. It's not just about system wide bloat.

 On the other hand, vm writeout deadlock ranks smack dab at the top of 
 the list, so that is where the patching effort must go for the 
 forseeable future.  Without bio throttling, the ddsnap load can go to 
 24 MB for struct bio alone.  That definitely moves the needle.  in 
 short, we save 3,200 times more memory by putting decent throttling in 
 place than by saving an int in struct bio.

Then fix the damn vm writeout. I always thought it was silly to depend
on the block layer for any sort of throttling. If it's not a system wide
problem, then throttle the io count in the make_request_fn handler of
that problematic driver.

 That said, I did a little analysis to get an idea of where the soft 
 targets are in struct bio, and to get to know the bio layer a little 
 better.  Maybe these few hints will get somebody interested enough to 
 look further.
 
   It would be interesting to see if bi_bdev could be made read only.
   Generally, each stage in the block device stack knows what the next
   stage is going to be, so why do we have to write that in the bio? 
   For error reporting from interrupt context?  Anyway, if Evgeniy
   wants to do the patch, I will happily unload the task of convincing
   you that random fields are/are not needed in struct bio :-)
 
  It's a trade off, otherwise you'd have to pass the block device
  around a lot.
 
 Which costs very little, probably less than trashing an extra field's 
 worth of cache.

Again, you can make that argument for most of the members. It's a
non-starter.

  And it's, again, a design issue. A bio contains 
  destination information, that means device/offset/size information.
  I'm all for shaving structure bytes where it matters, but not for the
  sake of sacrificing code

Re: Distributed storage.

2007-08-13 Thread Jens Axboe
On Mon, Aug 13 2007, Jens Axboe wrote:
  You did not comment on the one about putting the bio destructor in 
  the -endio handler, which looks dead simple.  The majority of cases 
  just use the default endio handler and the default destructor.  Of the 
  remaining cases, where a specialized destructor is needed, typically a 
  specialized endio handler is too, so combining is free.  There are few 
  if any cases where a new specialized endio handler would need to be 
  written.
 
 We could do that without too much work, I agree.

But that idea fails as well, since reference counts and IO completion
are two completely seperate entities. So unless end IO just happens to
be the last user holding a reference to the bio, you cannot free it.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-13 Thread Jens Axboe
On Mon, Aug 13 2007, Daniel Phillips wrote:
 On Monday 13 August 2007 00:28, Jens Axboe wrote:
  On Sun, Aug 12 2007, Daniel Phillips wrote:
   Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can
   derive efficiently from BIO_POOL_IDX() provided the bio was
   allocated in the standard way.
 
  That would only be feasible, if we ruled that any bio in the system
  must originate from the standard pools.
 
 Not at all.
 
   This leaves a little bit of clean up to do for bios not allocated
   from a standard pool.
 
  Please suggest how to do such a cleanup.
 
 Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a 
 bio from the standard pool.  Just put the power of two size in the bits 
 and map that number to the standard pool arrangement with a table 
 lookup.

So reserve a bit that tells you how to interpret the (now) 3 remaining
bits. Doesn't sound very pretty, does it?

   On the other hand, vm writeout deadlock ranks smack dab at the top
   of the list, so that is where the patching effort must go for the
   forseeable future.  Without bio throttling, the ddsnap load can go
   to 24 MB for struct bio alone.  That definitely moves the needle. 
   in short, we save 3,200 times more memory by putting decent
   throttling in place than by saving an int in struct bio.
 
  Then fix the damn vm writeout. I always thought it was silly to
  depend on the block layer for any sort of throttling. If it's not a
  system wide problem, then throttle the io count in the
  make_request_fn handler of that problematic driver.
 
 It is a system wide problem.  Every block device needs throttling, 
 otherwise queues expand without limit.  Currently, block devices that 
 use the standard request library get a slipshod form of throttling for 
 free in the form of limiting in-flight request structs.  Because the 
 amount of IO carried by a single request can vary by two orders of 
 magnitude, the system behavior of this approach is far from 
 predictable.

Is it? Consider just 10 standard sata disks. The next kernel revision
will have sg chaining support, so that allows 32MiB per request. Even if
we disregard reads (not so interesting in this discussion) and just look
at potentially pinned dirty data in a single queue, that number comes to
4GiB PER disk. Or 40GiB for 10 disks. Auch.

So I still think that this throttling needs to happen elsewhere, you
cannot rely the block layer throttling globally or for a single device.
It just doesn't make sense.

   You did not comment on the one about putting the bio destructor in
   the -endio handler, which looks dead simple.  The majority of
   cases just use the default endio handler and the default
   destructor.  Of the remaining cases, where a specialized destructor
   is needed, typically a specialized endio handler is too, so
   combining is free.  There are few if any cases where a new
   specialized endio handler would need to be written.
 
  We could do that without too much work, I agree.
 
 OK, we got one and another is close to cracking, enough of that.

No we did not, I already failed this one in the next mail.

   As far as code stability goes, current kernels are horribly
   unstable in a variety of contexts because of memory deadlock and
   slowdowns related to the attempt to fix the problem via dirty
   memory limits.  Accurate throttling of bio traffic is one of the
   two key requirements to fix this instability, the other other is
   accurate writeout path reserve management, which is only partially
   addressed by BIO_POOL.
 
  Which, as written above and stated many times over the years on lkml,
  is not a block layer issue imho.
 
 Whoever stated that was wrong, but this should be no surprise.  There 
 have been many wrong things said about this particular bug over the 
 years.  The one thing that remains constant is, Linux continues to 
 deadlock under a variety of loads both with and without network 
 involvement, making it effectively useless as a storage platform.
 
 These deadlocks are first and foremost, block layer deficiencies.  Even 
 the network becomes part of the problem only because it lies in the 
 block IO path.

The block layer has NEVER guaranteed throttling, so it can - by
definition - not be a block layer deficiency.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-13 Thread Jens Axboe
On Mon, Aug 13 2007, Daniel Phillips wrote:
 On Monday 13 August 2007 00:45, Jens Axboe wrote:
  On Mon, Aug 13 2007, Jens Axboe wrote:
You did not comment on the one about putting the bio destructor
in the -endio handler, which looks dead simple.  The majority of
cases just use the default endio handler and the default
destructor.  Of the remaining cases, where a specialized
destructor is needed, typically a specialized endio handler is
too, so combining is free.  There are few if any cases where a
new specialized endio handler would need to be written.
  
   We could do that without too much work, I agree.
 
  But that idea fails as well, since reference counts and IO completion
  are two completely seperate entities. So unless end IO just happens
  to be the last user holding a reference to the bio, you cannot free
  it.
 
 That is not a problem.  When bio_put hits zero it calls -endio instead 
 of the destructor.  The -endio sees that the count is zero and 
 destroys the bio.

You can't be serious? You'd stall end io completion notification because
someone holds a reference to a bio. Surely you jest.

Needless to say, that will never go in.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-13 Thread Jens Axboe
On Mon, Aug 13 2007, Daniel Phillips wrote:
 On Monday 13 August 2007 02:13, Jens Axboe wrote:
  On Mon, Aug 13 2007, Daniel Phillips wrote:
   On Monday 13 August 2007 00:45, Jens Axboe wrote:
On Mon, Aug 13 2007, Jens Axboe wrote:
  You did not comment on the one about putting the bio
  destructor in the -endio handler, which looks dead simple. 
  The majority of cases just use the default endio handler and
  the default destructor.  Of the remaining cases, where a
  specialized destructor is needed, typically a specialized
  endio handler is too, so combining is free.  There are few if
  any cases where a new specialized endio handler would need to
  be written.

 We could do that without too much work, I agree.
   
But that idea fails as well, since reference counts and IO
completion are two completely seperate entities. So unless end IO
just happens to be the last user holding a reference to the bio,
you cannot free it.
  
   That is not a problem.  When bio_put hits zero it calls -endio
   instead of the destructor.  The -endio sees that the count is zero
   and destroys the bio.
 
  You can't be serious? You'd stall end io completion notification
  because someone holds a reference to a bio.
 
 Of course not.  Nothing I said stops endio from being called in the 
 usual way as well.  For this to work, endio just needs to know that one 
 call means end and the other means destroy, this is trivial.

Sorry Daniel, but your suggestions would do nothing more than uglify the
code and design.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-13 Thread Jens Axboe
On Mon, Aug 13 2007, Daniel Phillips wrote:
 On Monday 13 August 2007 03:06, Jens Axboe wrote:
  On Mon, Aug 13 2007, Daniel Phillips wrote:
   Of course not.  Nothing I said stops endio from being called in the
   usual way as well.  For this to work, endio just needs to know that
   one call means end and the other means destroy, this is
   trivial.
 
  Sorry Daniel, but your suggestions would do nothing more than uglify
  the code and design.
 
 Pretty much exactly what was said about shrinking struct page, ask Bill.  
 The difference was, shrinking struct page actually mattered whereas 
 shrinking struct bio does not, and neither does expanding it by a few 
 bytes.

Lets back this up a bit - this whole thing began with you saying that
struct bio was bloated already, which I said wasn't true. You then
continued to hand wave your wave through various suggestions to trim the
obvious fat from that structure, none of which were nice or feasible.

I never compared the bio to struct page, I'd obviously agree that
shrinking struct page was a worthy goal and that it'd be ok to uglify
some code to do that. The same isn't true for struct bio.

And we can expand struct bio if we have to, naturally. And I've done it
before, which I wrote in the initial mail. I just don't want to do it
casually, then it WILL be bloated all of a sudden. Your laissez faire
attitude towards adding members to struct bio oh I'll just add it and
someone less lazy than me will fix it up in the future makes me happy
that you are not maintaining anything that I use.

I'll stop replying to your mails until something interesting surfaces.
I've already made my points clear about both the above and the
throttling. And I'd advise you to let Evgeniy take this forward, he
seems a lot more adept to actually getting CODE done and - at least from
my current and past perspective - is someone you can actually have a
fruitful conversation with.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-07 Thread Jens Axboe
On Sun, Aug 05 2007, Daniel Phillips wrote:
 A simple way to solve the stable accounting field issue is to add a new 
 pointer to struct bio that is owned by the top level submitter 
 (normally generic_make_request but not always) and is not affected by 
 any recursive resubmission.  Then getting rid of that field later 
 becomes somebody's summer project, which is not all that urgent because 
 struct bio is already bloated up with a bunch of dubious fields and is 
 a transient structure anyway.

Thanks for your insights. Care to detail what bloat and dubious fields
struct bio has?

And we don't add temporary fields out of laziness, hoping that someone
will later kill it again and rewrite it in a nicer fashion. Hint: that
never happens, bloat sticks.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Distributed storage.

2007-08-07 Thread Jens Axboe
On Tue, Aug 07 2007, Daniel Phillips wrote:
 On Tuesday 07 August 2007 05:05, Jens Axboe wrote:
  On Sun, Aug 05 2007, Daniel Phillips wrote:
   A simple way to solve the stable accounting field issue is to add a
   new pointer to struct bio that is owned by the top level submitter
   (normally generic_make_request but not always) and is not affected
   by any recursive resubmission.  Then getting rid of that field
   later becomes somebody's summer project, which is not all that
   urgent because struct bio is already bloated up with a bunch of
   dubious fields and is a transient structure anyway.
 
  Thanks for your insights. Care to detail what bloat and dubious
  fields struct bio has?
 
 First obvious one I see is bi_rw separate from bi_flags.  Front_size and 
 back_size smell dubious.  Is max_vecs really necessary?  You could 

I don't like structure bloat, but I do like nice design. Overloading is
a necessary evil sometimes, though. Even today, there isn't enough room
to hold bi_rw and bi_flags in the same variable on 32-bit archs, so that
concern can be scratched. If you read bio.h, that much is obvious.

If you check up on the iommu virtual merging, you'll understand the
front and back size members. They may smell dubious to you, but please
take the time to understand why it looks the way it does.

 reasonably assume bi_vcnt rounded up to a power of two and bury the 
 details of making that work behind wrapper functions to change the 
 number of bvecs, if anybody actually needs that.  Bi_endio and 

Changing the number of bvecs is integral to how bio buildup current
works.

 bi_destructor could be combined.  I don't see a lot of users of bi_idx, 

bi_idx is integral to partial io completions.

 that looks like a soft target.  See what happened to struct page when a 
 couple of folks got serious about attacking it, some really deep hacks 
 were done to pare off a few bytes here and there.  But struct bio as a 
 space waster is not nearly in the same ballpark.

So show some concrete patches and examples, hand waving and assumptions
is just a waste of everyones time.

 It would be interesting to see if bi_bdev could be made read only.  
 Generally, each stage in the block device stack knows what the next 
 stage is going to be, so why do we have to write that in the bio?  For 
 error reporting from interrupt context?  Anyway, if Evgeniy wants to do 
 the patch, I will happily unload the task of convincing you that random 
 fields are/are not needed in struct bio :-)

It's a trade off, otherwise you'd have to pass the block device around a
lot. And it's, again, a design issue. A bio contains destination
information, that means device/offset/size information. I'm all for
shaving structure bytes where it matters, but not for the sake of
sacrificing code stability or design. I consider struct bio quite lean
and have worked hard to keep it that way. In fact, iirc, the only
addition to struct bio since 2001 is the iommu front/back size members.
And I resisted those for quite a while.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc1: BUG_ON in kmap_atomic_prot()

2007-07-24 Thread Jens Axboe
On Mon, Jul 23 2007, Andrew Morton wrote:
 I worked out that the crash I saw was in
 
 BUG_ON(!pte_none(*(kmap_pte-idx)));
 
 in the read of kmap_pte[idx].  Which would be weird as the caller is using 
 a literal KM_USER0.
 
 So maybe I goofed, and that BUG_ON is triggering (it scrolled off, and I am
 unable to reproduce it now).
 
 If that BUG_ON _is_ triggering then it might indicate that someone is doing
 a __GFP_HIGHMEM|__GFP_ZERO allocation while holding KM_USER0.

Or doing double kunmaps, or doing a kunmap_atomic() on the page, not the
address. I've seen both of those end up triggering that BUG_ON() in a
later kmap.

Looking over the 2.6.22..2.6.23-rc1 diff, I found one such error in
ocfs2 at least. But you are probably not using that, so I'll keep
looking...

---

[PATCH] ocfs2: bad kunmap_atomic()

kunmap_atomic() takes the virtual address, not the mapped page as
argument.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 5727cd1..c4034f6 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2153,7 +2153,7 @@ static int ocfs2_splice_write_actor(struct 
pipe_inode_info *pipe,
src = buf-ops-map(pipe, buf, 1);
dst = kmap_atomic(page, KM_USER1);
memcpy(dst + offset, src + buf-offset, count);
-   kunmap_atomic(page, KM_USER1);
+   kunmap_atomic(dst, KM_USER1);
buf-ops-unmap(pipe, buf, src);
 
copied = ocfs2_write_end(file, file-f_mapping, sd-pos, count, count,

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc1: BUG_ON in kmap_atomic_prot()

2007-07-24 Thread Jens Axboe
On Tue, Jul 24 2007, Jens Axboe wrote:
 On Mon, Jul 23 2007, Andrew Morton wrote:
  I worked out that the crash I saw was in
  
  BUG_ON(!pte_none(*(kmap_pte-idx)));
  
  in the read of kmap_pte[idx].  Which would be weird as the caller is using 
  a literal KM_USER0.
  
  So maybe I goofed, and that BUG_ON is triggering (it scrolled off, and I am
  unable to reproduce it now).
  
  If that BUG_ON _is_ triggering then it might indicate that someone is doing
  a __GFP_HIGHMEM|__GFP_ZERO allocation while holding KM_USER0.
 
 Or doing double kunmaps, or doing a kunmap_atomic() on the page, not the
 address. I've seen both of those end up triggering that BUG_ON() in a
 later kmap.
 
 Looking over the 2.6.22..2.6.23-rc1 diff, I found one such error in
 ocfs2 at least. But you are probably not using that, so I'll keep
 looking...

What about the new async crypto stuff? I've been looking, but is it
guarenteed that async_memcpy() runs in process context with interrupts
enabled always? If not, there's a km type bug there.

In general, I think the highmem stuff could do with more safety checks:

- People ALWAYS get the atomic unmaps wrong, passing in the page instead
  of the address. I've seen tons of these. And since kunmap_atomic()
  takes a void pointer, nobody notices until it goes boom.
- People easily get the km type wrong - they use KM_USERx in interrupt
  context, or one of the irq variants without disabling interrupts.

If we could just catch these two types of bugs, we've got a lot of these
problems covered.


-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v3

2007-07-19 Thread Jens Axboe
On Thu, Jul 19 2007, YOSHIFUJI Hideaki / ?$B5HF#1QL@ wrote:
 Hello.
 
 In article [EMAIL PROTECTED] (at Wed, 11 Jul 2007 11:19:27 +0200), Jens 
 Axboe [EMAIL PROTECTED] says:
 
  @@ -835,6 +835,7 @@ const struct proto_ops inet_stream_ops = {
  .recvmsg   = sock_common_recvmsg,
  .mmap  = sock_no_mmap,
  .sendpage  = tcp_sendpage,
  +   .splice_read   = tcp_splice_read,
   #ifdef CONFIG_COMPAT
  .compat_setsockopt = compat_sock_common_setsockopt,
  .compat_getsockopt = compat_sock_common_getsockopt,
 
 Please add similar bits in net/ipv6/af_inet6.c
 unless there are any dependency on IPv4.
 (And if there are, it is not good.)

There are no specific ipv4 depedencies, it's just an oversight. So
thanks for the clue, I'll add it!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP stalls in current git, possibly splice related

2007-07-18 Thread Jens Axboe
On Wed, Jul 18 2007, Johannes Berg wrote:
 On Mon, 2007-07-16 at 14:02 +0200, Jens Axboe wrote:
 
  Yep, it's a sender thing, so upgrading the receiver will not change
  anything.
 
 Ok, I upgraded, but that didn't help. And in fact, I don't see how it
 could have since synergy doesn't use splice or sendfile. I should've
 thought of that right away, sorry.
 
 It seems that packets are actually coming in during the time that my
 mouse hangs though (ran wireshark in parallel and saw no pauses in the
 timeline.) Hence, it actually seems to be on the receiver side, and
 running the synergy client under strace reveals that during the time my
 mouse hangs it's in poll() waiting for input on the tcp socket. sysrg-t
 doesn't show anything useful, it's just scheduling waiting for data.
 According to wireshark data is sent, but it never shows up at the
 application layer.

OK, then we can put splice off the hook at least :-)

If it's easily reproducible (and it sounds like it), then a git bisect
might be the easiest way forward.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP stalls in current git, possibly splice related

2007-07-16 Thread Jens Axboe
On Mon, Jul 16 2007, Johannes Berg wrote:
 On Fri, 2007-07-13 at 13:05 +0200, Jens Axboe wrote:
  On Fri, Jul 13 2007, Johannes Berg wrote:
   On Thu, 2007-07-12 at 16:12 -0400, James Morris wrote:
I'm seeing TCP connection stalls with current git, and a bisect found 
the 
following as a possible cause:
   
   I'm also seeing stalls with synergy.
  
  Please try the below.
 
 I usually see stalls sending data from machine A to machine B. I just
 upgraded machine B to commit 8d9107e8c50e1c4ff43c91c8841805833f3ecfb9
 (Fri Jul 13 16:53:18 2007 -0700) and the tree includes this patch. This
 did not, however, fix the problem so I guess the problem is on the
 sender side. Due to wireless-dev lagging behind I there still have
 2.6.22-something where this patch doesn't apply. I hope John will merge
 again soon and then I'll report back whether I still see the stalling
 behaviour with a current tree that includes this patch.

Yep, it's a sender thing, so upgrading the receiver will not change
anything.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] more ACSI removal

2007-07-16 Thread Jens Axboe
On Mon, Jul 16 2007, Geert Uytterhoeven wrote:
 On Mon, 16 Jul 2007, Jens Axboe wrote:
  On Fri, Jul 13 2007, Geert Uytterhoeven wrote:
   On Fri, 13 Jul 2007, Adrian Bunk wrote:
This patch removes some code that became dead code after the ATARI_ACSI 
removal.
   
   When was it removed? Nobody seems to have informed the m68k people...
  
  It hasn't even been compiling for 5 years, so it can't be that big of an
  issue. I thought you had been cc'ed on the discussion, but I can't seem
  to find the thread now even.
 
 OK, in that case it doens't make much sense to keep ATARI_SLM, ATARI_BIONET,
 and ATARI_PAMSNET. Those who desperately(?) need it can resurvive ACSI
 if they want to...
 
 Acked-by: Geert Uytterhoeven [EMAIL PROTECTED]

Great thanks! I have that queued up, will add your acked-by.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP stalls in current git, possibly splice related

2007-07-13 Thread Jens Axboe
On Fri, Jul 13 2007, Jens Axboe wrote:
 On Thu, Jul 12 2007, James Morris wrote:
  On Thu, 12 Jul 2007, David Miller wrote:
  
   From: James Morris [EMAIL PROTECTED]
   Date: Thu, 12 Jul 2007 16:12:25 -0400 (EDT)
   
I'm seeing TCP connection stalls with current git, and a bisect found 
the 
following as a possible cause:
   
   To add to this James is seeing this with distcc I believe.
  
  Correct.
 
 I'll try and reproduce.

You didn't happen to get a sysrq-t backtrace of that distcc being hung,
did you?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v3

2007-07-13 Thread Jens Axboe
On Thu, Jul 12 2007, Evgeniy Polyakov wrote:
 On Wed, Jul 11, 2007 at 11:19:27AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  Hi,
 
 Hi Jens.
 
  Here's an updated implementation of tcp network splice receive support.
  It actually works for me now, no data corruption seen.
  
  For the original announcement and how to test it, see:
  
  http://marc.info/?l=linux-netdevm=118103093400770w=2
  
  The splice core changes needed to support this are now merged in
  2.6.22-git, so the patchset shrinks to just two patches - one for adding
  a release hook, and one for the networking changes.
  
  The code is also available in the splice-net branch here:
  
  git://git.kernel.dk/data/git/linux-2.6-block.git splice-net
  
  There's a third experimental patch in there that allows vmsplice
  directly to user memory, that still needs some work though.
  
  Comments, testing welcome!
 
 It looks like you included all bits we found in the previous runs, so
 likely it will work good, but so far I have conflicts merging todays git
 and your tree in include/linux/splice.h, fs/ext2/file.c, fs/splice.c and 
 mm/filemap_xip.c. This can be a problem with my tree though.

Hmm, the patch should apply directly to the tree as of when I posted
this original mail, or any later one. I just tried a rebase, and it
rebased fine on top of the current -git as well. So I think the issue is
with your tree, sorry!

 It really looks like the last tree we tested, so if you think additional
 one will not hurt, feel free to ping, so I will completely rebase
 testing tree.

It would be great if you could retest! There are some minor changes in
there, and some extra testing definitely will not hurt.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP stalls in current git, possibly splice related

2007-07-13 Thread Jens Axboe
On Fri, Jul 13 2007, Johannes Berg wrote:
 On Thu, 2007-07-12 at 16:12 -0400, James Morris wrote:
  I'm seeing TCP connection stalls with current git, and a bisect found the 
  following as a possible cause:
 
 I'm also seeing stalls with synergy.

Please try the below.

diff --git a/fs/splice.c b/fs/splice.c
index ed2ce99..92646aa 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -491,7 +491,7 @@ ssize_t generic_file_splice_read(struct file *in, loff_t 
*ppos,
 
ret = 0;
spliced = 0;
-   while (len) {
+   while (len  !spliced) {
ret = __generic_file_splice_read(in, ppos, pipe, len, flags);
 
if (ret  0)
@@ -1051,15 +1051,10 @@ ssize_t splice_direct_to_actor(struct file *in, struct 
splice_desc *sd,
sd-flags = ~SPLICE_F_NONBLOCK;
 
while (len) {
-   size_t read_len, max_read_len;
-
-   /*
-* Do at most PIPE_BUFFERS pages worth of transfer:
-*/
-   max_read_len = min(len, (size_t)(PIPE_BUFFERS*PAGE_SIZE));
+   size_t read_len;
 
-   ret = do_splice_to(in, sd-pos, pipe, max_read_len, flags);
-   if (unlikely(ret  0))
+   ret = do_splice_to(in, sd-pos, pipe, len, flags);
+   if (unlikely(ret = 0))
goto out_release;
 
read_len = ret;
@@ -1071,26 +1066,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct 
splice_desc *sd,
 * could get stuck data in the internal pipe:
 */
ret = actor(pipe, sd);
-   if (unlikely(ret  0))
+   if (unlikely(ret = 0))
goto out_release;
 
bytes += ret;
len -= ret;
 
-   /*
-* In nonblocking mode, if we got back a short read then
-* that was due to either an IO error or due to the
-* pagecache entry not being there. In the IO error case
-* the _next_ splice attempt will produce a clean IO error
-* return value (not a short read), so in both cases it's
-* correct to break out of the loop here:
-*/
-   if ((flags  SPLICE_F_NONBLOCK)  (read_len  max_read_len))
-   break;
+   if (ret  read_len)
+   goto out_release;
}
 
pipe-nrbufs = pipe-curbuf = 0;
-
return bytes;
 
 out_release:
@@ -1152,10 +1138,12 @@ long do_splice_direct(struct file *in, loff_t *ppos, 
struct file *out,
.pos= *ppos,
.u.file = out,
};
-   size_t ret;
+   long ret;
 
ret = splice_direct_to_actor(in, sd, direct_splice_actor);
-   *ppos = sd.pos;
+   if (ret  0)
+   *ppos += ret;
+
return ret;
 }
 

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP stalls in current git, possibly splice related

2007-07-13 Thread Jens Axboe
On Fri, Jul 13 2007, Jens Axboe wrote:
 On Fri, Jul 13 2007, Jens Axboe wrote:
  On Thu, Jul 12 2007, James Morris wrote:
   On Thu, 12 Jul 2007, David Miller wrote:
   
From: James Morris [EMAIL PROTECTED]
Date: Thu, 12 Jul 2007 16:12:25 -0400 (EDT)

 I'm seeing TCP connection stalls with current git, and a bisect found 
 the 
 following as a possible cause:

To add to this James is seeing this with distcc I believe.
   
   Correct.
  
  I'll try and reproduce.
 
 You didn't happen to get a sysrq-t backtrace of that distcc being hung,
 did you?

Does this work for you?

diff --git a/fs/splice.c b/fs/splice.c
index ed2ce99..92646aa 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -491,7 +491,7 @@ ssize_t generic_file_splice_read(struct file *in, loff_t 
*ppos,
 
ret = 0;
spliced = 0;
-   while (len) {
+   while (len  !spliced) {
ret = __generic_file_splice_read(in, ppos, pipe, len, flags);
 
if (ret  0)
@@ -1051,15 +1051,10 @@ ssize_t splice_direct_to_actor(struct file *in, struct 
splice_desc *sd,
sd-flags = ~SPLICE_F_NONBLOCK;
 
while (len) {
-   size_t read_len, max_read_len;
-
-   /*
-* Do at most PIPE_BUFFERS pages worth of transfer:
-*/
-   max_read_len = min(len, (size_t)(PIPE_BUFFERS*PAGE_SIZE));
+   size_t read_len;
 
-   ret = do_splice_to(in, sd-pos, pipe, max_read_len, flags);
-   if (unlikely(ret  0))
+   ret = do_splice_to(in, sd-pos, pipe, len, flags);
+   if (unlikely(ret = 0))
goto out_release;
 
read_len = ret;
@@ -1071,26 +1066,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct 
splice_desc *sd,
 * could get stuck data in the internal pipe:
 */
ret = actor(pipe, sd);
-   if (unlikely(ret  0))
+   if (unlikely(ret = 0))
goto out_release;
 
bytes += ret;
len -= ret;
 
-   /*
-* In nonblocking mode, if we got back a short read then
-* that was due to either an IO error or due to the
-* pagecache entry not being there. In the IO error case
-* the _next_ splice attempt will produce a clean IO error
-* return value (not a short read), so in both cases it's
-* correct to break out of the loop here:
-*/
-   if ((flags  SPLICE_F_NONBLOCK)  (read_len  max_read_len))
-   break;
+   if (ret  read_len)
+   goto out_release;
}
 
pipe-nrbufs = pipe-curbuf = 0;
-
return bytes;
 
 out_release:
@@ -1152,10 +1138,12 @@ long do_splice_direct(struct file *in, loff_t *ppos, 
struct file *out,
.pos= *ppos,
.u.file = out,
};
-   size_t ret;
+   long ret;
 
ret = splice_direct_to_actor(in, sd, direct_splice_actor);
-   *ppos = sd.pos;
+   if (ret  0)
+   *ppos += ret;
+
return ret;
 }
 

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP stalls in current git, possibly splice related

2007-07-12 Thread Jens Axboe
On Thu, Jul 12 2007, James Morris wrote:
 On Thu, 12 Jul 2007, David Miller wrote:
 
  From: James Morris [EMAIL PROTECTED]
  Date: Thu, 12 Jul 2007 16:12:25 -0400 (EDT)
  
   I'm seeing TCP connection stalls with current git, and a bisect found the 
   following as a possible cause:
  
  To add to this James is seeing this with distcc I believe.
 
 Correct.

I'll try and reproduce.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] more ACSI removal

2007-07-12 Thread Jens Axboe
On Fri, Jul 13 2007, Adrian Bunk wrote:
 This patch removes some code that became dead code after the ATARI_ACSI 
 removal.
 
 It also indirectly fixes the following bug introduced by
 commit c2bcf3b8978c291e1b7f6499475c8403a259d4d6:
 
  config ATARI_SLM
 tristate Atari SLM laser printer support
 -   depends on ATARI  ATARI_ACSI!=n
 +   depends on ATARI
 
 
 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

Thanks Adrian, applied!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][RFC] network splice receive v3

2007-07-11 Thread Jens Axboe
Hi,

Here's an updated implementation of tcp network splice receive support.
It actually works for me now, no data corruption seen.

For the original announcement and how to test it, see:

http://marc.info/?l=linux-netdevm=118103093400770w=2

The splice core changes needed to support this are now merged in
2.6.22-git, so the patchset shrinks to just two patches - one for adding
a release hook, and one for the networking changes.

The code is also available in the splice-net branch here:

git://git.kernel.dk/data/git/linux-2.6-block.git splice-net

There's a third experimental patch in there that allows vmsplice
directly to user memory, that still needs some work though.

Comments, testing welcome!

-- 
Jens Axboe

From e59a68f2d7d261b301960b97659910aab8e3d776 Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Mon, 11 Jun 2007 13:00:32 +0200
Subject: [PATCH] splice: don't assume regular pages in splice_to_pipe()

Allow caller to pass in a release function, there might be
other resources that need releasing as well. Needed for
network receive.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 fs/splice.c|9 -
 include/linux/splice.h |1 +
 2 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 3160951..4b4b501 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -254,11 +254,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 	}
 
 	while (page_nr  spd_pages)
-		page_cache_release(spd-pages[page_nr++]);
+		spd-spd_release(spd, page_nr++);
 
 	return ret;
 }
 
+static void spd_release_page(struct splice_pipe_desc *spd, unsigned int i)
+{
+	page_cache_release(spd-pages[i]);
+}
+
 static int
 __generic_file_splice_read(struct file *in, loff_t *ppos,
 			   struct pipe_inode_info *pipe, size_t len,
@@ -277,6 +282,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		.partial = partial,
 		.flags = flags,
 		.ops = page_cache_pipe_buf_ops,
+		.spd_release = spd_release_page,
 	};
 
 	index = *ppos  PAGE_CACHE_SHIFT;
@@ -1674,6 +1680,7 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
 		.partial = partial,
 		.flags = flags,
 		.ops = user_page_pipe_buf_ops,
+		.spd_release = spd_release_page,
 	};
 
 	pipe = pipe_info(file-f_path.dentry-d_inode);
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 2c08456..b8fa41e 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -54,6 +54,7 @@ struct splice_pipe_desc {
 	int nr_pages;			/* number of pages in map */
 	unsigned int flags;		/* splice flags */
 	const struct pipe_buf_operations *ops;/* ops associated with output pipe */
+	void (*spd_release)(struct splice_pipe_desc *, unsigned int);
 };
 
 typedef int (splice_actor)(struct pipe_inode_info *, struct pipe_buffer *,
-- 
1.5.3.rc0.90.gbaa79

From b62e4a5a3e3220702e837e556427972dc591ff59 Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Wed, 20 Jun 2007 09:54:14 +0200
Subject: [PATCH] TCP splice receive support

Support for network splice receive.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 include/linux/net.h|3 +
 include/linux/skbuff.h |5 +
 include/net/tcp.h  |3 +
 net/core/skbuff.c  |  246 
 net/ipv4/af_inet.c |1 +
 net/ipv4/tcp.c |  129 +
 net/socket.c   |   13 +++
 7 files changed, 400 insertions(+), 0 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index efc4517..472ee12 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -19,6 +19,7 @@
 #define _LINUX_NET_H
 
 #include linux/wait.h
+#include linux/splice.h
 #include asm/socket.h
 
 struct poll_table_struct;
@@ -165,6 +166,8 @@ struct proto_ops {
   struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
   int offset, size_t size, int flags);
+	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
+   struct pipe_inode_info *pipe, size_t len, unsigned int flags);
 };
 
 struct net_proto_family {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6f0b2f7..177bffc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1504,6 +1504,11 @@ extern int	   skb_store_bits(struct sk_buff *skb, int offset,
 extern __wsum	   skb_copy_and_csum_bits(const struct sk_buff *skb,
 	  int offset, u8 *to, int len,
 	  __wsum csum);
+extern int skb_splice_bits(struct sk_buff *skb,
+		unsigned int offset,
+		struct pipe_inode_info *pipe,
+		unsigned int len,
+		unsigned int flags);
 extern void	   skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 extern void	   skb_split(struct sk_buff *skb,
  struct sk_buff *skb1, const u32 len);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a8af9ae..8e86697 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -308,6 +308,9 @@ extern int

Re: [PATCH][RFC] network splice receive v3

2007-07-11 Thread Jens Axboe
On Wed, Jul 11 2007, Joel Becker wrote:
 On Wed, Jul 11, 2007 at 11:19:27AM +0200, Jens Axboe wrote:
  Subject: [PATCH] splice: don't assume regular pages in splice_to_pipe()
  
  Allow caller to pass in a release function, there might be
  other resources that need releasing as well. Needed for
  network receive.
  
  diff --git a/fs/splice.c b/fs/splice.c
  index 3160951..4b4b501 100644
  --- a/fs/splice.c
  +++ b/fs/splice.c
  @@ -254,11 +254,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
  }
   
  while (page_nr  spd_pages)
  -   page_cache_release(spd-pages[page_nr++]);
  +   spd-spd_release(spd, page_nr++);
 
   Rather than requiring the caller set this, shouldn't we just
 allow it?  Especially given there is only one non-page user?
 
   while (page_nr  spd_pages)
  -page_cache_release(spd-pages[page_nr++]);
  +if (spd-spd_release)
  +spd-spd_release(spd, page_nr++);
  +else
  +page_cache_release(spd-pages[page_nr++]);

Certainly possible, I think it's cleaner with it always being set
though. If it grows other out-of-splice.c users, then your change may be
a good idea though.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2.6.22-rc5] cxgb2: handle possible NULL pointer dereferencing

2007-06-21 Thread Jens Axboe
On Thu, Jun 21 2007, pradeep singh wrote:
 Hi,
 
 Chelsio's in kernel 10G driver does not checks the return value from
 t1_get_board_info() in cxgb2.c.
 t1_get_board_info may return a NULL and we still go on to dereference
 it in the for loop without checking for the NULL.
 
 This patch fixes this.

Patch looks odd - bi is dereferenced a number of times after that loop
anyway, so I don't see your patch fixing much.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2.6.22-rc5] cxgb2: handle possible NULL pointer dereferencing

2007-06-21 Thread Jens Axboe
On Thu, Jun 21 2007, pradeep singh wrote:
 Hi
 On 6/21/07, Jens Axboe [EMAIL PROTECTED] wrote:
 On Thu, Jun 21 2007, pradeep singh wrote:
  Hi,
 
  Chelsio's in kernel 10G driver does not checks the return value from
  t1_get_board_info() in cxgb2.c.
  t1_get_board_info may return a NULL and we still go on to dereference
  it in the for loop without checking for the NULL.
 
  This patch fixes this.
 
 Patch looks odd - bi is dereferenced a number of times after that loop
 anyway, so I don't see your patch fixing much.
 Thanks for pointing that out Jens.
 Sorry, i pushed it in a haste :(.
 Will check again and resubmit it.

You're welcome. The first thing to do is analyze whether a NULL return
from t1_get_board_info() makes any sense. From a quick look, driver_data
should be the index into the t1 pci table. So if t1_get_board_info()
returns NULL, it must be some core bug. So I'd say either don't handle
it, or mark it with BUG_ON(), or do the !bi check and CH_ERR() a warning
and goto out_disable_pdev.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-15 Thread Jens Axboe
On Fri, Jun 15 2007, Evgeniy Polyakov wrote:
 On Thu, Jun 14, 2007 at 01:59:02PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  On Thu, Jun 14 2007, Evgeniy Polyakov wrote:
   On Wed, Jun 13, 2007 at 12:01:04PM +0400, Evgeniy Polyakov ([EMAIL 
   PROTECTED]) wrote:
I will rebase my tree, likely something was not merged correctly.
   
   Ok, I've just rebased a tree from recent git and pulled from brick -
   things seems to be settled. I've ran several tests with different
   filesizes and all files were received correctly without kernel crashes.
   There is skb leak indeed, and it looks like it is in the
   __splice_from_pipe() for the last page.
  
  Uh, the leak, right - I had forgotten about that, was working on getting
  vmsplice into shape the last two days. Interesting that you mention the
  last page, I'll dig in now! Any more info on this (how did you notice
  the leak originates from there)?
 
 I first observed leak via slabinfo data (not a leak, but number of skbs
 did not dropped after quite huge files were transferred), then added
 number of allocated and freed objects in skbuff.c, they did not match
 for big files, so I started to check splice source and found that
 sometimes -release callback is not called, but code breaks out of the
 loop. I've put some printks in __splice_from_pipe() and found following
 case, when skb is leaked:
 when the same cloned skb was shared multiple times (no more than 2 though),
 only one copy was freed.
 
 Further analysis description requires some splice background (obvious
 for you, but that clears it for me):
 pipe_buffer only contains 16 pages.
 There is a code, which copies pages (pointers) from spd to pipe_buffer
 (splice_to_pipe()).
 skb allocations happens in chunks of different size (i.e. with different
 number of skbs/pages per call), so it is possible that number of
 allocated skbs will be less than pipe_buffer size (16), and then the
 rest of the pages will be put into different (or the same) pipe_buffer later.
 Sometimes two skbs from above example happens to be on the boundary of
 the pipe buffer, so only one of them is being copied into pipe_buffer,
 which is then transferred over the pipe.
 So, we have a case, when spd has (it had more, but this two are special) 
 2 pages (actually the same page, but two references to it), but pipe_buffer 
 has a place only for one. In that case second page from spd will be missed.
 
 So, things turned down to be not in the __splice_from_pipe(), but
 splice_to_pipe(). Attached patch fixes a leak for me.
 It was tested with different data files and all were received correctly.
 
 Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]
 
 diff --git a/fs/splice.c b/fs/splice.c
 index bc481f1..365bfd9 100644
 --- a/fs/splice.c
 +++ b/fs/splice.c
 @@ -211,8 +211,6 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
   break;
   if (pipe-nrbufs  PIPE_BUFFERS)
   continue;
 -
 - break;
   }
  
   if (spd-flags  SPLICE_F_NONBLOCK) {
 

Hmm, curious. If we hit that location, then two conditions are true:

- Pipe is full
- We transferred some data

if you remove the break, then you'll end up blocking in pipe_wait()
(unless you have SPLICE_F_NONBLOCK also set). And we don't want to block
waiting for more room, if we already transferred some data. In that case
we just want to return a short splice. Looking at pipe_write(), it'll
block as well though. Just doesn't seem optimal to me, but...

So the question is why would doing the break there cause a leak? I just
don't yet see how it can happen, I'd love to fix that condition instead.
For the case you describe, we should have page_nr == 1 and spd-nr_pages
== 2. Is the:

while (page_nr  spd-nr_pages)
spd-spd_release(spd, page_nr++);

not dropping the right reference?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-15 Thread Jens Axboe
On Fri, Jun 15 2007, Evgeniy Polyakov wrote:
 On Fri, Jun 15, 2007 at 10:43:18AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
   So, things turned down to be not in the __splice_from_pipe(), but
   splice_to_pipe(). Attached patch fixes a leak for me.
   It was tested with different data files and all were received correctly.
   
   Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED]
   
   diff --git a/fs/splice.c b/fs/splice.c
   index bc481f1..365bfd9 100644
   --- a/fs/splice.c
   +++ b/fs/splice.c
   @@ -211,8 +211,6 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 break;
 if (pipe-nrbufs  PIPE_BUFFERS)
 continue;
   -
   - break;
 }

 if (spd-flags  SPLICE_F_NONBLOCK) {
   
  
  Hmm, curious. If we hit that location, then two conditions are true:
  
  - Pipe is full
  - We transferred some data
 
 Yep.
 
  if you remove the break, then you'll end up blocking in pipe_wait()
  (unless you have SPLICE_F_NONBLOCK also set). And we don't want to block
  waiting for more room, if we already transferred some data. In that case
  we just want to return a short splice. Looking at pipe_write(), it'll
  block as well though. Just doesn't seem optimal to me, but...
  
  So the question is why would doing the break there cause a leak? I just
  don't yet see how it can happen, I'd love to fix that condition instead.
  For the case you describe, we should have page_nr == 1 and spd-nr_pages
  == 2. Is the:
  
  while (page_nr  spd-nr_pages)
  spd-spd_release(spd, page_nr++);
  
  not dropping the right reference?
 
 Both spd-nr_pages and page_nr are equal to 1. When spd-nr_pages
 was 2 there was only 1 free slot in pipe_buffer.

Ah duh, indeed, it is a dumb bug indeed. This should work.

diff --git a/fs/splice.c b/fs/splice.c
index 89871c6..5327cbf 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -172,6 +172,7 @@ static const struct pipe_buf_operations 
user_page_pipe_buf_ops = {
 ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
   struct splice_pipe_desc *spd)
 {
+   unsigned int spd_pages = spd-nr_pages;
int ret, do_wakeup, page_nr;
 
ret = 0;
@@ -252,7 +253,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
}
}
 
-   while (page_nr  spd-nr_pages)
+   while (page_nr  spd_pages)
spd-spd_release(spd, page_nr++);
 
return ret;

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-14 Thread Jens Axboe
On Thu, Jun 14 2007, Evgeniy Polyakov wrote:
 On Wed, Jun 13, 2007 at 12:01:04PM +0400, Evgeniy Polyakov ([EMAIL 
 PROTECTED]) wrote:
  I will rebase my tree, likely something was not merged correctly.
 
 Ok, I've just rebased a tree from recent git and pulled from brick -
 things seems to be settled. I've ran several tests with different
 filesizes and all files were received correctly without kernel crashes.
 There is skb leak indeed, and it looks like it is in the
 __splice_from_pipe() for the last page.

Uh, the leak, right - I had forgotten about that, was working on getting
vmsplice into shape the last two days. Interesting that you mention the
last page, I'll dig in now! Any more info on this (how did you notice
the leak originates from there)?

 Thanks Jens, good work.

Thanks, you're welcome!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-13 Thread Jens Axboe
On Wed, Jun 13 2007, Evgeniy Polyakov wrote:
 On Tue, Jun 12, 2007 at 08:17:32PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
   On Mon, Jun 11, 2007 at 01:59:26PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
   wrote:
Patches are against the #splice branch of the block repo, official url
of that is:

git://git.kernel.dk/data/git/linux-2.6-block.git/

and it's based on Linus main tree. Let me know if I should supply netdev
branch patches instead, or even just provide a rolled up patch (or patch
series) for anyone curious to test or play with it.
   
   Hi Jens.
   
   I've just pulled your tree (splice-net, but splice tree looks the
   same, git pull says 'Already up-to-date.') on top of linus git and got
   following bug trace.  I will investigate it further tomorrow.
  
  Please tell me the contents of splice-net, it looks like you didn't
  actually use the new code. That BUG_ON() is in get_page(), which
  splice-net no longer uses. So the bug report cannot be valid for the
  current code.
 
 This is the last commit in that tree:
 
 commit c90a6ce8242d108a5bc6fd0bc1b2aca72a2b5944
 Author: Jens Axboe [EMAIL PROTECTED]
 Date:   Mon Jun 11 21:59:50 2007 +0200
 
 TCP splice receive support
 
 Support for network splice receive.
 
 Signed-off-by: Jens Axboe [EMAIL PROTECTED]
 
 :100644 100644 efc4517... 472ee12... M include/linux/net.h
 :100644 100644 e7367c7... 64e3eed... M include/linux/skbuff.h
 :100644 100644 a8af9ae... 8e86697... M include/net/tcp.h
 :100644 100644 7c6a34e... daea7b0... M net/core/skbuff.c
 :100644 100644 041fba3... 0ff9f86... M net/ipv4/af_inet.c
 :100644 100644 450f44b... 63efd7a... M net/ipv4/tcp.c
 :100644 100644 f453019... 41240f5... M net/socket.c
 
 I will rebase my tree, likely something was not merged correctly.

It must have been, please let me know how the current stuf works for
you!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Splice network receive support

2007-06-12 Thread Jens Axboe
Hi,

This series of patches applies on top of the splice series just
posted. It implements basic network receive support, ie splicing
from a socket to a pipe.

There seems to be a skhead_buff_cache leak somewhere that I need
to track down, otherwise it works fine for me.

-- 
Jens Axboe


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] splice: don't assume regular pages in splice_to_pipe()

2007-06-12 Thread Jens Axboe
Allow caller to pass in a release function, there might be
other resources that need releasing as well. Needed for
network receive.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 fs/splice.c|9 -
 include/linux/splice.h |1 +
 2 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index f24e367..25ec9c8 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -247,11 +247,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
}
 
while (page_nr  spd-nr_pages)
-   page_cache_release(spd-pages[page_nr++]);
+   spd-spd_release(spd, page_nr++);
 
return ret;
 }
 
+static void spd_release_page(struct splice_pipe_desc *spd, unsigned int i)
+{
+   page_cache_release(spd-pages[i]);
+}
+
 static int
 __generic_file_splice_read(struct file *in, loff_t *ppos,
   struct pipe_inode_info *pipe, size_t len,
@@ -270,6 +275,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
.partial = partial,
.flags = flags,
.ops = page_cache_pipe_buf_ops,
+   .spd_release = spd_release_page,
};
 
index = *ppos  PAGE_CACHE_SHIFT;
@@ -1442,6 +1448,7 @@ static long vmsplice_to_pipe(struct file *file, const 
struct iovec __user *iov,
.partial = partial,
.flags = flags,
.ops = user_page_pipe_buf_ops,
+   .spd_release = spd_release_page,
};
 
pipe = pipe_info(file-f_path.dentry-d_inode);
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 1a1182b..04c1068 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -53,6 +53,7 @@ struct splice_pipe_desc {
int nr_pages;   /* number of pages in map */
unsigned int flags; /* splice flags */
const struct pipe_buf_operations *ops;/* ops associated with output 
pipe */
+   void (*spd_release)(struct splice_pipe_desc *, unsigned int);
 };
 
 typedef int (splice_actor)(struct pipe_inode_info *, struct pipe_buffer *,
-- 
1.5.2.1.174.gcd03

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] TCP splice receive support

2007-06-12 Thread Jens Axboe
Support for network splice receive.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 include/linux/net.h|3 +
 include/linux/skbuff.h |5 +
 include/net/tcp.h  |3 +
 net/core/skbuff.c  |  231 
 net/ipv4/af_inet.c |1 +
 net/ipv4/tcp.c |  129 +++
 net/socket.c   |   13 +++
 7 files changed, 385 insertions(+), 0 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index efc4517..472ee12 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -19,6 +19,7 @@
 #define _LINUX_NET_H
 
 #include linux/wait.h
+#include linux/splice.h
 #include asm/socket.h
 
 struct poll_table_struct;
@@ -165,6 +166,8 @@ struct proto_ops {
  struct vm_area_struct * vma);
ssize_t (*sendpage)  (struct socket *sock, struct page *page,
  int offset, size_t size, int flags);
+   ssize_t (*splice_read)(struct socket *sock,  loff_t *ppos,
+  struct pipe_inode_info *pipe, size_t 
len, unsigned int flags);
 };
 
 struct net_proto_family {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e7367c7..64e3eed 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1504,6 +1504,11 @@ extern int  skb_store_bits(struct sk_buff 
*skb, int offset,
 extern __wsum skb_copy_and_csum_bits(const struct sk_buff *skb,
  int offset, u8 *to, int len,
  __wsum csum);
+extern int skb_splice_bits(struct sk_buff *skb,
+   unsigned int offset,
+   struct pipe_inode_info *pipe,
+   unsigned int len,
+   unsigned int flags);
 extern void   skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 extern void   skb_split(struct sk_buff *skb,
 struct sk_buff *skb1, const u32 len);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a8af9ae..8e86697 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -308,6 +308,9 @@ extern int  tcp_twsk_unique(struct sock *sk,
 
 extern voidtcp_twsk_destructor(struct sock *sk);
 
+extern ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos,
+   struct pipe_inode_info *pipe, 
size_t len, unsigned int flags);
+
 static inline void tcp_dec_quickack_mode(struct sock *sk,
 const unsigned int pkts)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7c6a34e..daea7b0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -52,6 +52,7 @@
 #endif
 #include linux/string.h
 #include linux/skbuff.h
+#include linux/splice.h
 #include linux/cache.h
 #include linux/rtnetlink.h
 #include linux/init.h
@@ -71,6 +72,40 @@
 static struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 
+static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
+ struct pipe_buffer *buf)
+{
+   struct sk_buff *skb = (struct sk_buff *) buf-private;
+
+   kfree_skb(skb);
+}
+
+static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
+   struct pipe_buffer *buf)
+{
+   struct sk_buff *skb = (struct sk_buff *) buf-private;
+
+   skb_get(skb);
+}
+
+static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
+  struct pipe_buffer *buf)
+{
+   return 1;
+}
+
+
+/* Pipe buffer operations for a socket. */
+static struct pipe_buf_operations sock_pipe_buf_ops = {
+   .can_merge = 0,
+   .map = generic_pipe_buf_map,
+   .unmap = generic_pipe_buf_unmap,
+   .pin = generic_pipe_buf_pin,
+   .release = sock_pipe_buf_release,
+   .steal = sock_pipe_buf_steal,
+   .get = sock_pipe_buf_get,
+};
+
 /*
  * Keep out-of-line to prevent kernel bloat.
  * __builtin_return_address is not used because it is not always
@@ -1116,6 +1151,202 @@ fault:
return -EFAULT;
 }
 
+/*
+ * Callback from splice_to_pipe(), if we need to release some pages
+ * at the end of the spd in case we error'ed out in filling the pipe.
+ */
+static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
+{
+   struct sk_buff *skb = (struct sk_buff *) spd-partial[i].private;
+
+   kfree_skb(skb);
+}
+
+/*
+ * Fill page/offset/length into spd, if it can hold more pages.
+ */
+static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page 
*page,
+   unsigned int len, unsigned int offset,
+   struct sk_buff *skb

[PATCH 2/3] tcp_read_sock: alloc recv_actor() return return negative error value

2007-06-12 Thread Jens Axboe
Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 net/ipv4/tcp.c |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cd3c7e9..450f44b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1064,7 +1064,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t 
*desc,
break;
}
used = recv_actor(desc, skb, offset, len);
-   if (used = len) {
+   if (used  0) {
+   if (!copied)
+   copied = used;
+   break;
+   } else if (used = len) {
seq += used;
copied += used;
offset += used;
@@ -1086,7 +1090,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t 
*desc,
tcp_rcv_space_adjust(sk);
 
/* Clean up data we have read: This will do ACK frames. */
-   if (copied)
+   if (copied  0)
tcp_cleanup_rbuf(sk, copied);
return copied;
 }
-- 
1.5.2.1.174.gcd03

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-12 Thread Jens Axboe
On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
 On Sat, Jun 09, 2007 at 08:36:09AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
   On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov ([EMAIL 
   PROTECTED]) wrote:
I will try some things for the nearest 30-60 minutes, and then will 
move to
canoe trip until thuesday, so will not be able to work on this idea.
   
   Ok, replacing in fs/splice.c every page_cache_release() with
   static void splice_page_release(struct page *p)
   {
 if (!PageSlab(p))
 page_cache_release(p);
   }
  
  Ehm, I don't see why that should be necessary. Except in
  splice_to_pipe(), I have considered that we need to pass in a release
  function if mapping fails at some point. But it's probably best to do
  that in the caller, since they have the knowledge of how to release the
  pages.
  
  The rest of the PageSlab() tests are bogus.
 
 I had a crashdump, where page was released via splice_to_pipe() indeed,
 I did not investigate if it is possible to release provided page in
 other places. I think if in future there will other slab usage cases
 except networking receiving, that might be useful, but as is it is not
 needed.

Read the just posted code, it has moved way beyond this :-)

   and putting cloned skb into private field instead of 
   original on in spd_fill_page() ends up without kernel hung.
  
  Why? Seems pointless to allocate a clone just to hold on to the skb, a
  reference should be equally good. I would not be opposed to doing it
  this way, I just don't see what a clone buys us as compared to just
  holding that reference to the skb.
 
 Receiving code does not expect shared skbs - too many fields are changed
 with assumptions that it is a private copy.

Actually the main problem is that tcp_read_sock() unconditionally frees
the skb, so it wouldn't help if we grabbed a reference to it. I've yet
to receive an explanation of why it does so, seem awkward and violates
the whole principle of reference counted objects. Davem??

So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd
hope we can get rid of that by fixing tcp_read_sock(), though.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-12 Thread Jens Axboe
On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
 and putting cloned skb into private field instead of 
 original on in spd_fill_page() ends up without kernel hung.

Why? Seems pointless to allocate a clone just to hold on to the skb, a
reference should be equally good. I would not be opposed to doing it
this way, I just don't see what a clone buys us as compared to just
holding that reference to the skb.
   
   Receiving code does not expect shared skbs - too many fields are changed
   with assumptions that it is a private copy.
  
  Actually the main problem is that tcp_read_sock() unconditionally frees
  the skb, so it wouldn't help if we grabbed a reference to it. I've yet
  to receive an explanation of why it does so, seem awkward and violates
  the whole principle of reference counted objects. Davem??
  
  So for now, skb_splice_bits() clones the incoming skb to avoid that. I'd
  hope we can get rid of that by fixing tcp_read_sock(), though.
 
 It does that because it knows, that skb is not allowed to be shared
 there. Similar things are being done in udp for example - code changes
 internal mebers of skb, since it knows skb is not shared.
 
 For example generic_make_request() is not allowed to change, say, 
 bio-bi_sector or bi_destructor, since it does not own a block request, 
 not matter what bi_cnt is. From another side, -bi_destructor() can do
 whatever it wants with bio without any check for its reference counter.

But generic_make_request() DOES change -bi_sector, that's how partition
remapping works :-). The destructor can of course do whatever it wants,
by definition the bio is not referenced at that point (or it would not
have been called). So while I think your analogy is quite poor, I do now
follow the code (even if I think it's ugly). There's quite a big
difference between changing parts of the elements of a structure to just
grabbing a reference to it. If the skb cannot be referenced, skb_get()
should return NULL.

But that aside, I see the issue. I'll just stick to the clone, it works
fine as-is (well almost, there's a leak there, but functionally it's
ok!).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-12 Thread Jens Axboe
On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
  difference between changing parts of the elements of a structure to just
  grabbing a reference to it. If the skb cannot be referenced, skb_get()
  should return NULL.
  
  But that aside, I see the issue. I'll just stick to the clone, it works
  fine as-is (well almost, there's a leak there, but functionally it's
  ok!).
 
 Btw, is it allowed to use splice from network with, say, nfs?
 Since RPC code uses sk_user_data as long as network splice.

It doesn't anymore, see the version posted today (or yesterday, but it
would be silly to read older code than the newest :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive v2

2007-06-12 Thread Jens Axboe
On Tue, Jun 12 2007, Evgeniy Polyakov wrote:
 On Mon, Jun 11, 2007 at 01:59:26PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  Patches are against the #splice branch of the block repo, official url
  of that is:
  
  git://git.kernel.dk/data/git/linux-2.6-block.git/
  
  and it's based on Linus main tree. Let me know if I should supply netdev
  branch patches instead, or even just provide a rolled up patch (or patch
  series) for anyone curious to test or play with it.
 
 Hi Jens.
 
 I've just pulled your tree (splice-net, but splice tree looks the
 same, git pull says 'Already up-to-date.') on top of linus git and got
 following bug trace.  I will investigate it further tomorrow.

Please tell me the contents of splice-net, it looks like you didn't
actually use the new code. That BUG_ON() is in get_page(), which
splice-net no longer uses. So the bug report cannot be valid for the
current code.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-11 Thread Jens Axboe
On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
 Size of the received file is bigger than file sent, file contains repeated
 blocks of data sometimes. Cloned skb usage is likely too big overhead,
 although for receiving fast clone is unused in most cases, so there
 might be some gain.

That was actually a new bug, here:

plen -= *offset;
poff += *offset;

in __skb_slice_bits(), we should only subtract the offset from plen, not
add to poff. Then we just create some weird hole without any meaning. So
remove those two poff additions in the two loops, and the size issue is
resolved at least.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][RFC] network splice receive v2

2007-06-11 Thread Jens Axboe
Hi,

Here's an updated implementation of tcp network splice receive support.
It actually works for me now, no data corruption seen.

For the original announcement and how to test it, see:

http://marc.info/?l=linux-netdevm=118103093400770w=2

I hang on to the original skb by creating a clone of it and holding
references to that. I really don't need a clone, but tcp_read_sock() is
not being very helpful by calling sk_eat_skb() which blissfully ignores
any reference counts and uncondtionally frees the skb - why is that??
The clone works around that issue.

The current code also gets rid of the data_ready hack, and it should now
handle linear/fragments/fraglist in the skb just fine. Thanks to Olaf
for providing review of that stuff!

Patches are against the #splice branch of the block repo, official url
of that is:

git://git.kernel.dk/data/git/linux-2.6-block.git/

and it's based on Linus main tree. Let me know if I should supply netdev
branch patches instead, or even just provide a rolled up patch (or patch
series) for anyone curious to test or play with it.

-- 
Jens Axboe

From fd79bf84fdeb15b72f256c342609257ae8a56235 Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Mon, 11 Jun 2007 13:00:32 +0200
Subject: [PATCH] splice: don't assume regular pages in splice_to_pipe()

Allow caller to pass in a release function, there might be
other resources that need releasing as well. Needed for
network receive.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 fs/splice.c|9 -
 include/linux/splice.h |1 +
 2 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index f24e367..25ec9c8 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -247,11 +247,16 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
 	}
 
 	while (page_nr  spd-nr_pages)
-		page_cache_release(spd-pages[page_nr++]);
+		spd-spd_release(spd, page_nr++);
 
 	return ret;
 }
 
+static void spd_release_page(struct splice_pipe_desc *spd, unsigned int i)
+{
+	page_cache_release(spd-pages[i]);
+}
+
 static int
 __generic_file_splice_read(struct file *in, loff_t *ppos,
 			   struct pipe_inode_info *pipe, size_t len,
@@ -270,6 +275,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		.partial = partial,
 		.flags = flags,
 		.ops = page_cache_pipe_buf_ops,
+		.spd_release = spd_release_page,
 	};
 
 	index = *ppos  PAGE_CACHE_SHIFT;
@@ -1442,6 +1448,7 @@ static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
 		.partial = partial,
 		.flags = flags,
 		.ops = user_page_pipe_buf_ops,
+		.spd_release = spd_release_page,
 	};
 
 	pipe = pipe_info(file-f_path.dentry-d_inode);
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 1a1182b..04c1068 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -53,6 +53,7 @@ struct splice_pipe_desc {
 	int nr_pages;			/* number of pages in map */
 	unsigned int flags;		/* splice flags */
 	const struct pipe_buf_operations *ops;/* ops associated with output pipe */
+	void (*spd_release)(struct splice_pipe_desc *, unsigned int);
 };
 
 typedef int (splice_actor)(struct pipe_inode_info *, struct pipe_buffer *,
-- 
1.5.2.1.174.gcd03

From 95a1ee277f2a6df5c95d786b9229ea0ffa46850d Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Mon, 4 Jun 2007 15:06:43 +0200
Subject: [PATCH] tcp_read_sock: alloc recv_actor() return return negative error value

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 net/ipv4/tcp.c |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cd3c7e9..450f44b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1064,7 +1064,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	break;
 			}
 			used = recv_actor(desc, skb, offset, len);
-			if (used = len) {
+			if (used  0) {
+if (!copied)
+	copied = used;
+break;
+			} else if (used = len) {
 seq += used;
 copied += used;
 offset += used;
@@ -1086,7 +1090,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	tcp_rcv_space_adjust(sk);
 
 	/* Clean up data we have read: This will do ACK frames. */
-	if (copied)
+	if (copied  0)
 		tcp_cleanup_rbuf(sk, copied);
 	return copied;
 }
-- 
1.5.2.1.174.gcd03

From f0329226c6c1f521c2069358699bae5ed48f5a43 Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Mon, 11 Jun 2007 13:51:57 +0200
Subject: [PATCH] TCP splice receive support

Support for network splice receive.

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 include/linux/net.h|3 +
 include/linux/skbuff.h |5 ++
 include/net/tcp.h  |3 +
 net/core/skbuff.c  |  175 
 net/ipv4/af_inet.c |1 +
 net/ipv4/tcp.c |  122 +
 net/socket.c   |   13 
 7 files changed, 322 insertions(+), 0 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index

Re: [PATCH][RFC] network splice receive

2007-06-09 Thread Jens Axboe
On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
 On Fri, Jun 08, 2007 at 06:57:25PM +0400, Evgeniy Polyakov ([EMAIL 
 PROTECTED]) wrote:
  I will try some things for the nearest 30-60 minutes, and then will move to
  canoe trip until thuesday, so will not be able to work on this idea.
 
 Ok, replacing in fs/splice.c every page_cache_release() with
 static void splice_page_release(struct page *p)
 {
   if (!PageSlab(p))
   page_cache_release(p);
 }

Ehm, I don't see why that should be necessary. Except in
splice_to_pipe(), I have considered that we need to pass in a release
function if mapping fails at some point. But it's probably best to do
that in the caller, since they have the knowledge of how to release the
pages.

The rest of the PageSlab() tests are bogus.

 and putting cloned skb into private field instead of 
 original on in spd_fill_page() ends up without kernel hung.

Why? Seems pointless to allocate a clone just to hold on to the skb, a
reference should be equally good. I would not be opposed to doing it
this way, I just don't see what a clone buys us as compared to just
holding that reference to the skb.

 I'm not sure it is correct, that page can be released in fs/splice.c
 without calling any callback from network code, when network data is
 being processed.

Please explain!

 Size of the received file is bigger than file sent, file contains repeated
 blocks of data sometimes. Cloned skb usage is likely too big overhead,
 although for receiving fast clone is unused in most cases, so there
 might be some gain.
 
 Attached your patch with above changes.

Thanks, I'll fiddle with this on monday.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Thu, Jun 07 2007, Evgeniy Polyakov wrote:
 On Thu, Jun 07, 2007 at 12:51:59PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
   What bout checking if page belongs to kmalloc cache (or any other cache
   via priviate pointers) and do not perform any kind of reference counting
   on them? I will play with this a bit later today.
  
  That might work, but sounds a little dirty... But there's probably no
  way around. Be sure to look at the #splice-net branch if you are playing
  with this, I've updated it a number of times and fixed some bugs in
  there. Notably it now gets the offset right, and handles fragments and
  fraglist as well.
 
 I've pulled splice-net, which indeed fixed some issues, but referencing
 slab pages is still is not allowed. There are at least two problems
 (although they are related):
 1. if we do not increment reference counter for slab pages, they
 eventually get refilled and slab exploses after it understood that its
 pages are in use (or user dies when page is moved out of his control in
 slab).
 2. get/put page does not work with slab pages, and simple
 increment/decrement of the reference counters is not allowed too.
 
 Both problems have the same root - slab does not allow anyone to 
 manipulate page's members. That should be broken/changed to allow splice
 to put its hands into network using the fastest way.
 I will think about it.

Perhaps it's possible to solve this at a different level - can we hang
on to the skb until the pipe buffer has been consumed, and prevent reuse
that way? Then we don't have to care what backing the skb has, as long
as it (and its data) isn't being reused until we drop the reference to
it in sock_pipe_buf_release().

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Fri, Jun 08 2007, David Miller wrote:
 From: Jens Axboe [EMAIL PROTECTED]
 Date: Fri, 8 Jun 2007 09:48:24 +0200
 
  Perhaps it's possible to solve this at a different level - can we hang
  on to the skb until the pipe buffer has been consumed, and prevent reuse
  that way? Then we don't have to care what backing the skb has, as long
  as it (and its data) isn't being reused until we drop the reference to
  it in sock_pipe_buf_release().
 
 Depending upon whether the pipe buffer consumption is bounded of not,
 this will jam up the TCP sender because the SKB data allocation is
 charged against the socket send buffer allocation.

Forgive my network ignorance, but is that a problem? Since you bring it
up, I guess so :-)

We can grow the pipe, should we have to. So instead of blocking waiting
on reader consumption, we can extend the size of the pipe and keep
going.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
 On Fri, Jun 08, 2007 at 10:38:53AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  On Fri, Jun 08 2007, David Miller wrote:
   From: Jens Axboe [EMAIL PROTECTED]
   Date: Fri, 8 Jun 2007 09:48:24 +0200
   
Perhaps it's possible to solve this at a different level - can we hang
on to the skb until the pipe buffer has been consumed, and prevent reuse
that way? Then we don't have to care what backing the skb has, as long
as it (and its data) isn't being reused until we drop the reference to
it in sock_pipe_buf_release().
   
   Depending upon whether the pipe buffer consumption is bounded of not,
   this will jam up the TCP sender because the SKB data allocation is
   charged against the socket send buffer allocation.
  
  Forgive my network ignorance, but is that a problem? Since you bring it
  up, I guess so :-)
 
 David means, that socket bufer allocation is limited, and delaying
 freeing can end up with exhausint that limit.

OK, so a delayed empty of the pipe could end up causing packet drops
elsewhere due to allocation exhaustion?

  We can grow the pipe, should we have to. So instead of blocking waiting
  on reader consumption, we can extend the size of the pipe and keep
  going.
 
 I have a code, which roughly works (but I will test it some more), which
 just introduces reference counters for slab pages, so that the would not
 be actually freed via page reclaim, but only after reference counters
 are dropped. That forced changes in mm/slab.c so likely it is
 unacceptible solution, but it is interesting as is.

Hmm, still seems like it's working around the problem. We essentially
just need to ensure that the data doesn't get _reused_, not just freed.
It doesn't help holding a reference to the page, if someone else just
reuses it and fills it with other data before it has been consumed and
released by the pipe buffer operations.

That's why I thought the skb referencing was the better idea, then we
don't have to care about the backing of the skb either. Provided that
preventing the free of the skb before the pipe buffer has been consumed
guarantees that the contents aren't reused.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
 On Fri, Jun 08, 2007 at 11:04:40AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  OK, so a delayed empty of the pipe could end up causing packet drops
  elsewhere due to allocation exhaustion?
 
 Yes.
 
We can grow the pipe, should we have to. So instead of blocking waiting
on reader consumption, we can extend the size of the pipe and keep
going.
   
   I have a code, which roughly works (but I will test it some more), which
   just introduces reference counters for slab pages, so that the would not
   be actually freed via page reclaim, but only after reference counters
   are dropped. That forced changes in mm/slab.c so likely it is
   unacceptible solution, but it is interesting as is.
  
  Hmm, still seems like it's working around the problem. We essentially
  just need to ensure that the data doesn't get _reused_, not just freed.
  It doesn't help holding a reference to the page, if someone else just
  reuses it and fills it with other data before it has been consumed and
  released by the pipe buffer operations.
  
  That's why I thought the skb referencing was the better idea, then we
  don't have to care about the backing of the skb either. Provided that
  preventing the free of the skb before the pipe buffer has been consumed
  guarantees that the contents aren't reused.
 
 It is not only better idea, it is the only correct one.
 Attached patch for interested reader, which does slab pages accounting,
 but it is broken. It does not fires up with kernel bug, but it fills
 output file with random garbage from reused and dirtied pages. And I do
 not know why, but received file is always smaller than file being sent
 (when file has resonable size like 10mb, with 4-40kb filesize things
 seems to be ok).
 
 I've started skb referencing work, let's see where this will end up.

Here's a start, for the splice side at least of storing a buf-private
entity with the ops.

diff --git a/fs/splice.c b/fs/splice.c
index 90588a8..f24e367 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -191,6 +191,7 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
buf-page = spd-pages[page_nr];
buf-offset = spd-partial[page_nr].offset;
buf-len = spd-partial[page_nr].len;
+   buf-private = spd-partial[page_nr].private;
buf-ops = spd-ops;
if (spd-flags  SPLICE_F_GIFT)
buf-flags |= PIPE_BUF_FLAG_GIFT;
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 7ba228d..4409167 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -14,6 +14,7 @@ struct pipe_buffer {
unsigned int offset, len;
const struct pipe_buf_operations *ops;
unsigned int flags;
+   unsigned long private;
 };
 
 struct pipe_inode_info {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 619dcf5..64e3eed 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1504,7 +1504,7 @@ extern int   skb_store_bits(struct sk_buff 
*skb, int offset,
 extern __wsum skb_copy_and_csum_bits(const struct sk_buff *skb,
  int offset, u8 *to, int len,
  __wsum csum);
-extern int skb_splice_bits(const struct sk_buff *skb,
+extern int skb_splice_bits(struct sk_buff *skb,
unsigned int offset,
struct pipe_inode_info *pipe,
unsigned int len,
diff --git a/include/linux/splice.h b/include/linux/splice.h
index b3f1528..1a1182b 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -41,6 +41,7 @@ struct splice_desc {
 struct partial_page {
unsigned int offset;
unsigned int len;
+   unsigned long private;
 };
 
 /*
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d2b2547..7d9ec9e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -78,7 +78,10 @@ static void sock_pipe_buf_release(struct pipe_inode_info 
*pipe,
 #ifdef NET_COPY_SPLICE
__free_page(buf-page);
 #else
-   put_page(buf-page);
+   struct sk_buff *skb = (struct sk_buff *) buf-private;
+
+   kfree_skb(skb);
+   //put_page(buf-page);
 #endif
 }
 
@@ -1148,7 +1151,8 @@ fault:
  * Fill page/offset/length into spd, if it can hold more pages.
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page 
*page,
-   unsigned int len, unsigned int offset)
+   unsigned int len, unsigned int offset,
+   struct sk_buff *skb)
 {
struct page *p;
 
@@ -1163,12 +1167,14 @@ static inline int spd_fill_page(struct splice_pipe_desc 
*spd, struct page *page,
memcpy(page_address(p

Re: [PATCH][RFC] network splice receive

2007-06-08 Thread Jens Axboe
On Fri, Jun 08 2007, Evgeniy Polyakov wrote:
 On Fri, Jun 08, 2007 at 04:14:52PM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  Here's a start, for the splice side at least of storing a buf-private
  entity with the ops.
 
 :) I tested the same implementation, but I put skb pointer into
 page-private. My approach is not correct, since the same page can hold
 several objects, so if there are several splicers, this will scream.
 I've tested your patch on top of splice-net branch, here is a result:
 
 [   44.798853] Slab corruption: skbuff_head_cache start=81003b726668, 
 len=192
 [   44.806148] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
 [   44.811598] Last user: [803699fd](kfree_skbmem+0x7a/0x7e)
 [   44.818012] 0b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6a 6b 6b a5
 [   44.824889] Prev obj: start=81003b726590, len=192
 [   44.829985] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
 [   44.835604] Last user: [8036a22c](__alloc_skb+0x40/0x13f)
 [   44.842010] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 [   44.848896] 010: 20 58 7e 3b 00 81 ff ff 00 00 00 00 00 00 00 00
 [   44.855772] Next obj: start=81003b726740, len=192
 [   44.860868] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
 [   44.866314] Last user: [803699fd](kfree_skbmem+0x7a/0x7e)
 [   44.872721] 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
 [   44.879597] 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
 
 I will try some things for the nearest 30-60 minutes, and then will move to
 canoe trip until thuesday, so will not be able to work on this idea.

I'm not surprised, it wasn't tested at all - just provides the basic
framework for storing the skb so we can access it on pipe buffer
release.

Lets talk more next week, I'll likely play with this approach on monday.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-07 Thread Jens Axboe
On Thu, Jun 07 2007, Evgeniy Polyakov wrote:
 On Wed, Jun 06, 2007 at 09:17:25AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
   Some pages have zero reference counter here.
  
  But it's somewhat annoying to get pages with zero reference counts
  there, I wonder how that happens. I guess if the skb-data originated
  from kmalloc() then we don't really know. The main intent there was just
  to ensure the page wasn't going away, but clearly it's not good enough
  to ensure that reuse isn't taking place.
 
 What bout checking if page belongs to kmalloc cache (or any other cache
 via priviate pointers) and do not perform any kind of reference counting
 on them? I will play with this a bit later today.

That might work, but sounds a little dirty... But there's probably no
way around. Be sure to look at the #splice-net branch if you are playing
with this, I've updated it a number of times and fixed some bugs in
there. Notably it now gets the offset right, and handles fragments and
fraglist as well.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-06 Thread Jens Axboe
On Tue, Jun 05 2007, jamal wrote:
 On Tue, 2007-05-06 at 14:20 +0200, Jens Axboe wrote:
 
   
    1800 4ff3 937f e000 6381 7275 0008
   
   Perhaps that hex pattern rings a bell with someone intimate with the
   networking. The remaining wrong bytes don't seem to have anything in
   common.
  
  Ok, the source mac address is 00:18:F3:4F:7F:93 and the destination is
  00:E0:81:63:75:72 which are the middle 12 bytes of the 16.
  
 
 It appears you may have endianness issues and perhaps a 16 bit
 mis-offset. the 0008 at the end looks like a swapped 0x800 which
 is the ethernet type for IPV4.

Yeah, it looks like the first part of the ethernet frame. I'm using an
e1000 on the receive side and a sky2 on the sender.

  Hope that helps someone clue me in as to which network part is reusing
  the data. Do I need to 'pin' the sk_buff until the pipe data has been
  consumed?
 
 I would worry about the driver level first - likely thats where your
 corruption is.

I'd just be a heck-of-a-lot more inclined to suspect my code! Hence I
thought that data reuse for perhaps another packet would be the likely
explanation, especially since it looked like valid network driver data
ending up where it should not.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-06 Thread Jens Axboe
On Tue, Jun 05 2007, Evgeniy Polyakov wrote:
 On Tue, Jun 05, 2007 at 10:05:43AM +0200, Jens Axboe ([EMAIL PROTECTED]) 
 wrote:
  Here's an implementation of tcp network splice receive support. It's
  originally based on the patch set that Intel posted some time ago, but
  has been (close to) 100% reworked.
  
  Now, I'm not a networking guru by any stretch of the imagination, so I'd
  like some input on the direction of the main patch. Is the approach
  feasible? Glaring errors? Missing bits?
 
   263.709262] [ cut here ]
   [  263.713932] kernel BUG at include/linux/mm.h:285!
   [  263.718678] invalid opcode:  [1] PREEMPT SMP 
   [  263.723561] CPU 0 
   [  263.725665] Modules linked in: button loop snd_intel8x0
   snd_ac97_codec ac97_bus snd_pcm snd_timer snd soundcore psmouse
   snd_page_alloc k8temp i2c_nforcen
   [  263.755666] Pid: 2709, comm: splice-fromnet Not tainted
   2.6.22-rc4-splice #2
   [  263.762759] RIP: 0010:[8038c60c]  [8038c60c]
   skb_splice_bits+0xac/0x1c9
   [  263.771212] RSP: 0018:81003c79fc88  EFLAGS: 00010246
   [  263.776564] RAX:  RBX: 05a8 RCX:
   81003ff04778
   [  263.783743] RDX: 81003ff04778 RSI: 0ab2 RDI:
   0003d52d
   [  263.790925] RBP: 81003c79fdd8 R08:  R09:
   81003d927b78
   [  263.798104] R10: 803b0181 R11: 81003c79fde8 R12:
   81003d52d000
   [  263.805284] R13: 054e R14: 81003d927b78 R15:
   81003bbc6ea0
   [  263.812463] FS:  2ac4089a86d0() GS:804fb000()
   knlGS:
   [  263.820611] CS:  0010 DS:  ES:  CR0: 8005003b
   [  263.826396] CR2: 2ac4088320e0 CR3: 3c987000 CR4:
   06e0
   [  263.833578] Process splice-fromnet (pid: 2709, threadinfo
   81003c79e000, task 81003755c380)
   [  263.842591] Stack:  81003ff04720 
   81003755c380 0046
   [  263.850897]  00c6 0046 81003b0428b8
   81003d0b5b10
   [  263.858543]  00c6 81003d0b5b10 81003b0428b8
   81003d0b5b10
   [  263.865957] Call Trace:
   [  263.868683]  [803dc630] _read_unlock_irq+0x31/0x4e
   [  263.874393]  [803afb54] tcp_splice_data_recv+0x20/0x22
   [  263.880447]  [803afa2b] tcp_read_sock+0xa2/0x1ab
   [  263.885983]  [803afb34] tcp_splice_data_recv+0x0/0x22
   [  263.891951]  [803b01c1] tcp_splice_read+0xae/0x1a3
   [  263.897655]  [8038920f] sock_def_readable+0x0/0x6f
   [  263.903366]  [80384a65] sock_splice_read+0x15/0x17
   [  263.909072]  [8029e773] do_splice_to+0x76/0x88
   [  263.914432]  [8029fcc8] sys_splice+0x1a8/0x232
   [  263.919795]  [802097ce] system_call+0x7e/0x83
   [  263.925067] 
   [  263.926606] 
   [  263.926607] Code: 0f 0b eb fe 44 89 e6 81 e6 ff 0f 00 00 90 ff 42
   08 48 63 55 
   [  263.936418] RIP  [8038c60c] skb_splice_bits+0xac/0x1c9
   [  263.942516]  RSP 81003c79fc88
 
 This a vm_bug_on in get_page().
 
  +static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page 
  *page,
  +   unsigned int len, unsigned int offset)
  +{
  +   struct page *p;
  +
  +   if (unlikely(spd-nr_pages == PIPE_BUFFERS))
  +   return 1;
  +
  +#ifdef NET_COPY_SPLICE
  +   p = alloc_pages(GFP_KERNEL, 0);
  +   if (!p)
  +   return 1;
  +
  +   memcpy(page_address(p) + offset, page_address(page) + offset, len);
  +#else
  +   p = page;
  +   get_page(p);
  +#endif
 
 Some pages have zero reference counter here.
 
 Is commented NET_COPY_SPLICE part from old implementation?
 It will be always slower than existing approach due to allocation
 overhead.

NET_COPY_SPLICE is not supposed to be enabled, it's just a demonstration
of a copy operation if you wanted to test functionality without just
linking in the skb pages. At least that would allow you to test
correctness of the rest of the code, since I don't know if the skb page
linking is entirely correct yet.

But it's somewhat annoying to get pages with zero reference counts
there, I wonder how that happens. I guess if the skb-data originated
from kmalloc() then we don't really know. The main intent there was just
to ensure the page wasn't going away, but clearly it's not good enough
to ensure that reuse isn't taking place.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][RFC] network splice receive

2007-06-05 Thread Jens Axboe
Hi,

Here's an implementation of tcp network splice receive support. It's
originally based on the patch set that Intel posted some time ago, but
has been (close to) 100% reworked.

Now, I'm not a networking guru by any stretch of the imagination, so I'd
like some input on the direction of the main patch. Is the approach
feasible? Glaring errors? Missing bits?

If you want to test it, I'd suggest downloading the latest splice tools
snapshot here:

http://brick.kernel.dk/snaps/splice-git-latest.tar.gz

Examples:

- Sending a small test message over the network:

  [EMAIL PROTECTED]:~/splice $ ./splice-fromnet  | cat
  [EMAIL PROTECTED]:~ $ echo hello | netcat host1 

  should write hello on host1. Yeah, complex stuff.

- Sending a file over the network:

  [EMAIL PROTECTED]:~/splice $ ./splice-fromnet  | ./splice out outfile
  [EMAIL PROTECTED]:~ $ cat somefile | netcat host1 

  should send somefile over the network, storing it in outfile.

Seems to work reasonably well for me, sometimes I do see small ranges
inside the output file that are not correct, but I haven't been able to
reproduce today. I think it has to do with page reuse, hence the
NET_COPY_SPLICE ifdef that you can enable to just plain copy the data
instead of referencing it.

Patches are against the #splice branch of the block repo, official url
of that is:

git://git.kernel.dk/data/git/linux-2.6-block.git/

and it's based on Linus main tree (at 2.6.22-rc4 right now). Let me know
if I should supply netdev branch patches instead.

-- 
Jens Axboe

From 592c46ea813c31c0d6b28bf543ce2f5dd884a75e Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Mon, 4 Jun 2007 15:06:43 +0200
Subject: [PATCH] [NET] tcp_read_sock: alloc recv_actor() return return negative error value

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 net/ipv4/tcp.c |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cd3c7e9..450f44b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1064,7 +1064,11 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	break;
 			}
 			used = recv_actor(desc, skb, offset, len);
-			if (used = len) {
+			if (used  0) {
+if (!copied)
+	copied = used;
+break;
+			} else if (used = len) {
 seq += used;
 copied += used;
 offset += used;
@@ -1086,7 +1090,7 @@ int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	tcp_rcv_space_adjust(sk);
 
 	/* Clean up data we have read: This will do ACK frames. */
-	if (copied)
+	if (copied  0)
 		tcp_cleanup_rbuf(sk, copied);
 	return copied;
 }
-- 
1.5.2.rc1

From 10d906a9a5a16a022d5067bee3963a0e3a03ae0c Mon Sep 17 00:00:00 2001
From: Jens Axboe [EMAIL PROTECTED]
Date: Tue, 5 Jun 2007 09:54:00 +0200
Subject: [PATCH] [NET] TCP splice receive support

Losely based on original patches from Intel, modified to actually
be zero-copy (the original patches memcpy'ed the data).

Signed-off-by: Jens Axboe [EMAIL PROTECTED]
---
 include/linux/net.h|3 +
 include/linux/skbuff.h |5 ++
 include/net/tcp.h  |3 +
 net/core/skbuff.c  |  114 +++
 net/ipv4/af_inet.c |1 +
 net/ipv4/tcp.c |  138 
 net/socket.c   |   13 +
 7 files changed, 277 insertions(+), 0 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index efc4517..472ee12 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -19,6 +19,7 @@
 #define _LINUX_NET_H
 
 #include linux/wait.h
+#include linux/splice.h
 #include asm/socket.h
 
 struct poll_table_struct;
@@ -165,6 +166,8 @@ struct proto_ops {
   struct vm_area_struct * vma);
 	ssize_t		(*sendpage)  (struct socket *sock, struct page *page,
   int offset, size_t size, int flags);
+	ssize_t 	(*splice_read)(struct socket *sock,  loff_t *ppos,
+   struct pipe_inode_info *pipe, size_t len, unsigned int flags);
 };
 
 struct net_proto_family {
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e7367c7..619dcf5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1504,6 +1504,11 @@ extern int	   skb_store_bits(struct sk_buff *skb, int offset,
 extern __wsum	   skb_copy_and_csum_bits(const struct sk_buff *skb,
 	  int offset, u8 *to, int len,
 	  __wsum csum);
+extern int skb_splice_bits(const struct sk_buff *skb,
+		unsigned int offset,
+		struct pipe_inode_info *pipe,
+		unsigned int len,
+		unsigned int flags);
 extern void	   skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 extern void	   skb_split(struct sk_buff *skb,
  struct sk_buff *skb1, const u32 len);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a8af9ae..8e86697 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -308,6 +308,9 @@ extern int			tcp_twsk_unique(struct sock *sk,
 
 extern void			tcp_twsk_destructor(struct sock

Re: [PATCH][RFC] network splice receive

2007-06-05 Thread Jens Axboe
On Tue, Jun 05 2007, Jens Axboe wrote:
 Seems to work reasonably well for me, sometimes I do see small ranges
 inside the output file that are not correct, but I haven't been able to
 reproduce today. I think it has to do with page reuse, hence the
 NET_COPY_SPLICE ifdef that you can enable to just plain copy the data
 instead of referencing it.

I managed to reproduce. It's segments of 68-80 bytes beyond corrupt in
the middle of the out, and there might be 1-3 of such occurences in the
30mb file I tested with. The first 16 bytes of the corruption are always
the same:

 1800 4ff3 937f e000 6381 7275 0008

Perhaps that hex pattern rings a bell with someone intimate with the
networking. The remaining wrong bytes don't seem to have anything in
common.

Slab poisoning doesn't change the pattern, so it's not use-after-free.


-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] network splice receive

2007-06-05 Thread Jens Axboe
On Tue, Jun 05 2007, Jens Axboe wrote:
 On Tue, Jun 05 2007, Jens Axboe wrote:
  Seems to work reasonably well for me, sometimes I do see small ranges
  inside the output file that are not correct, but I haven't been able to
  reproduce today. I think it has to do with page reuse, hence the
  NET_COPY_SPLICE ifdef that you can enable to just plain copy the data
  instead of referencing it.
 
 I managed to reproduce. It's segments of 68-80 bytes beyond corrupt in
 the middle of the out, and there might be 1-3 of such occurences in the
 30mb file I tested with. The first 16 bytes of the corruption are always
 the same:
 
  1800 4ff3 937f e000 6381 7275 0008
 
 Perhaps that hex pattern rings a bell with someone intimate with the
 networking. The remaining wrong bytes don't seem to have anything in
 common.

Ok, the source mac address is 00:18:F3:4F:7F:93 and the destination is
00:E0:81:63:75:72 which are the middle 12 bytes of the 16.

Hope that helps someone clue me in as to which network part is reusing
the data. Do I need to 'pin' the sk_buff until the pipe data has been
consumed?

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: default y idiocy

2007-05-12 Thread Jens Axboe
On Sat, May 12 2007, Simon Arlott wrote:
 On 12/05/07 19:23, Jens Axboe wrote:
 Hi,
 
 This has bothered me for a long time, and it just seems to be getting
 worse. Can people please STOP defaulting non-essential stuff to 'y'?
 Grrr.
 
 Is there a reason why various 10/100/1000Mbit network cards are 'y' too?
 There's even a default SCSI 'm' that seems to be completely hidden from 
 the menu too (CONFIG_SCSI_WAIT_SCAN). It depends on SCSI but I can't 
 disable SCSI...

For the exact same (wrong) reason that the other menuconfig changes did
it, I suppose. Need fixing, too.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Revert [NET_SCHED]: HTB: fix incorrect use of RB_EMPTY_NODE

2006-10-03 Thread Jens Axboe
On Mon, Oct 02 2006, Jarek Poplawski wrote:
 On Mon, Oct 02, 2006 at 12:24:47PM +0200, Jarek Poplawski wrote:
  On 30-09-2006 21:23, Ismail Donmez wrote:
   Hi,
   
   With commit 10fd48f2376db52f08bf0420d2c4f580e39269e1 [1] ,  RB_EMPTY_NODE 
   changed behaviour so it returns false when the node is empty as expected. 
  ...
   - if (!RB_EMPTY_NODE(rb)) {
   + if (RB_EMPTY_NODE(rb)) {
  
  Maybe you have some kind of agreement with Jens Axboe but I
  can't understand current way of kernel cooperation:
  he changes some global behavior to the opposite and fixes
  his code in three places but can't fix it in the fourth place
 ...
 
 But I see it's not so bad and net_sched isn't the last!
 It looks deadline-iosched.c and one place in as-iosched.c
 (~ 466 line) where probably also forgotten. 

I don't see any missed changes.

 Another question - is there any change planned?
 If not why in rbtree.c is:
 
 + if (rb_parent(node) == node)
 
 instead of:
 
 + if (RB_EMPTY_NODE(rb))

RB_EMPTY_NODE was (re)-introduced after that change, and it never
propagated. Changing it now would be fine and easier to read.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Revert [NET_SCHED]: HTB: fix incorrect use of RB_EMPTY_NODE

2006-10-03 Thread Jens Axboe
On Mon, Oct 02 2006, Ismail Donmez wrote:
 02 Eki 2006 Pts 13:24 tarihinde, Jarek Poplawski ??unlar?? yazmt??: 
  On 30-09-2006 21:23, Ismail Donmez wrote:
   Hi,
  
   With commit 10fd48f2376db52f08bf0420d2c4f580e39269e1 [1] ,  RB_EMPTY_NODE
   changed behaviour so it returns false when the node is empty as expected.
 
  ...
 
   - if (!RB_EMPTY_NODE(rb)) {
   + if (RB_EMPTY_NODE(rb)) {
 
  Maybe you have some kind of agreement with Jens Axboe but I
  can't understand current way of kernel cooperation:
  he changes some global behavior to the opposite and fixes
  his code in three places but can't fix it in the fourth place
  where it's used? Isn't it both trivial and automatic kind
  of patch?
 
 I don't know if Jens will going to fix other occurrences but the
 sch_htb.c fix was recent and Jens'  commit got my attention hence the
 patch.

Send me a patch and I'll review / integrate it. Most of the rbtree
mixups stem from the fact that the first color change was incomplete.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/20] nbd: request_fn fixup

2006-09-13 Thread Jens Axboe
On Tue, Sep 12 2006, Jeff Garzik wrote:
 Jens Axboe wrote:
 Generally the block device rule is that once you are invoked due to an
 unplug (or whatever) event, it is the responsibility of the block device
 to run the queue until it's done. So if you bail out of queue handling
 for whatever reason (might be resource starvation in hard- or software),
 you must make sure to reenter queue handling since the device will not
 get replugged while it has requests pending. Unless you run into some
 software resource shortage, running of the queue is done
 deterministically when you know resources are available (ie an io
 completes). The device plugging itself is only ever done when you
 encounter a shortage outside of your control (memory shortage, for
 instance) _and_ you don't already have pending work where you can invoke
 queueing from again.
 
 Or he could employ the blk_{start,stop}_queue() functions, if that model 
 is easier for the driver (and brain).

Definitely, yes.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/20] nbd: request_fn fixup

2006-09-12 Thread Jens Axboe
On Tue, Sep 12 2006, Peter Zijlstra wrote:
 @@ -463,10 +465,13 @@ static void do_nbd_request(request_queue
  
  error_out:
   req-errors++;
 - spin_unlock(q-queue_lock);
 - nbd_end_request(req);
 - spin_lock(q-queue_lock);
 + __nbd_end_request(req);
   }
 + /*
 +  * q-queue_lock has been dropped, this opens up a race
 +  * plug the device to close it.
 +  */
 + blk_plug_device(q);
   return;
  }

This looks wrong, I wonder if this only fixes things for you because it
happens to reinvoke the request handler after the timeout occurs? Your
comment doesn't really describe what you think is going on, please
describe in detail what you think is happening here that the plugging
supposedly solves.

Generally the block device rule is that once you are invoked due to an
unplug (or whatever) event, it is the responsibility of the block device
to run the queue until it's done. So if you bail out of queue handling
for whatever reason (might be resource starvation in hard- or software),
you must make sure to reenter queue handling since the device will not
get replugged while it has requests pending. Unless you run into some
software resource shortage, running of the queue is done
deterministically when you know resources are available (ie an io
completes). The device plugging itself is only ever done when you
encounter a shortage outside of your control (memory shortage, for
instance) _and_ you don't already have pending work where you can invoke
queueing from again.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/21] block: elevator selection and pinning

2006-09-07 Thread Jens Axboe
On Thu, Sep 07 2006, Peter Zijlstra wrote:
 On Wed, 2006-09-06 at 15:46 +0200, Jens Axboe wrote:
  On Wed, Sep 06 2006, Peter Zijlstra wrote:
   Provide an block queue init function that allows to set an elevator. And 
   a 
   function to pin the current elevator.
   
   Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
   Signed-off-by: Daniel Phillips [EMAIL PROTECTED]
   CC: Jens Axboe [EMAIL PROTECTED]
   CC: Pavel Machek [EMAIL PROTECTED]
  
  Generally I don't think this is the right approach, as what you really
  want to do is let the driver say I want intelligent scheduling or not.
  The type of scheduler is policy that is left with the user, not the
  driver.
 
 True, and the only sane value here is NOOP, any other policy would not
 be a good value. With this in mind would you rather prefer a 'boolean'
 argument suggesting we use NOOP over the default scheduler?

Nope, I don't think it's the right thing to do. Either we want to pass a
type down or a profile of some sort.

For the work you are doing here, just forget about it. It's not like
it's a critical piece, just list in the README (or wherever) that noop
is a good choice and leave it at that.

 Would you agree that this hint on intelligent scheduling could be used
 to set the initial policy, the user can always override when he
 disagrees.
 
 These network block devices like NBD, iSCSI and AoE often talk to
 virtual disks, any attempt to be smart is a waste of time.

I think you'll find the runtime overhead of them is pretty close and
hard to measure in most setups, it's things like the queue idling and
anticipation that AS/CFQ might do that will potentially decrease your io
performance. deadline and noop would be equally good choices.

So you don't want to pass down hints on use noop, you really want to
tell the block layer that you have no seek penalty for instance. And a
range of other parameters.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/21] block: elevator selection and pinning

2006-09-06 Thread Jens Axboe
On Wed, Sep 06 2006, Peter Zijlstra wrote:
 Provide an block queue init function that allows to set an elevator. And a 
 function to pin the current elevator.
 
 Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
 Signed-off-by: Daniel Phillips [EMAIL PROTECTED]
 CC: Jens Axboe [EMAIL PROTECTED]
 CC: Pavel Machek [EMAIL PROTECTED]

Generally I don't think this is the right approach, as what you really
want to do is let the driver say I want intelligent scheduling or not.
The type of scheduler is policy that is left with the user, not the
driver.

And this patch seems to do two things, and you don't explain what the
pinning is useful for at all.

So that's 2 for 2 currently, NAK from me.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/21] nbd: limit blk_queue

2006-09-06 Thread Jens Axboe
On Wed, Sep 06 2006, Erik Mouw wrote:
 On Wed, Sep 06, 2006 at 03:16:41PM +0200, Peter Zijlstra wrote:
  -   disk-queue = blk_init_queue(do_nbd_request, nbd_lock);
  +   disk-queue = blk_init_queue_node_elv(do_nbd_request,
  +   nbd_lock, -1, noop);
 
 So what happens if the noop scheduler isn't compiled into the kernel?

You can't de-select noop, so that cannot happen. But the point is valid
for other choices of io schedulers, which is another reason why this
_elv api addition is a bad idea.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >