On Fri, Oct 18, 2013 at 10:09:54AM -0700, H. Peter Anvin wrote:
> If implemented properly adcx/adox should give additional speedup... that is
> the whole reason for their existence.
>
Ok, fair enough. Unfotunately, I'm not going to be able to get my hands on a
stepping of this CPU to test any
On Fri, Oct 18, 2013 at 10:09:54AM -0700, H. Peter Anvin wrote:
If implemented properly adcx/adox should give additional speedup... that is
the whole reason for their existence.
Ok, fair enough. Unfotunately, I'm not going to be able to get my hands on a
stepping of this CPU to test any code
On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
>
> >
> > Ok, so I ran the above code on a single cpu using taskset, and set irq
> > affinity
> > such that no interrupts (save for local ones), would occur on that cpu.
> >
On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
>
> Ok, so I ran the above code on a single cpu using taskset, and set irq
> affinity
> such that no interrupts (save for local ones), would occur on that cpu. Note
> that I had to convert csum_partial_opt to csum_partial, as the _opt
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
>
> > #define BUFSIZ_ORDER 4
> > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > static int __init csum_init_module(void)
> > {
> > int i;
> > __wsum sum = 0;
> >
On 10/19/2013 04:23 AM, Ingo Molnar wrote:
>
> * Doug Ledford wrote:
>> All prefetch operations get sent to an access queue in the memory
>> controller where they compete with both other reads and writes for the
>> available memory bandwidth. The optimal prefetch window is not a factor
>> of
On Mon, Oct 21, 2013 at 10:31:38AM -0700, Eric Dumazet wrote:
> On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
> > On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> > > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> > >
> > > > #define BUFSIZ_ORDER 4
> > > > #define
On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
> On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> > On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> >
> > > #define BUFSIZ_ORDER 4
> > > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > > static int __init
On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
#define BUFSIZ_ORDER 4
#define BUFSIZ ((2 BUFSIZ_ORDER) * (1024*1024*2))
static int __init
On Mon, Oct 21, 2013 at 10:31:38AM -0700, Eric Dumazet wrote:
On Sun, 2013-10-20 at 17:29 -0400, Neil Horman wrote:
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
#define BUFSIZ_ORDER 4
#define BUFSIZ ((2
On 10/19/2013 04:23 AM, Ingo Molnar wrote:
* Doug Ledford dledf...@redhat.com wrote:
All prefetch operations get sent to an access queue in the memory
controller where they compete with both other reads and writes for the
available memory bandwidth. The optimal prefetch window is not a
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
#define BUFSIZ_ORDER 4
#define BUFSIZ ((2 BUFSIZ_ORDER) * (1024*1024*2))
static int __init csum_init_module(void)
{
int i;
__wsum sum = 0;
struct
On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
Ok, so I ran the above code on a single cpu using taskset, and set irq
affinity
such that no interrupts (save for local ones), would occur on that cpu. Note
that I had to convert csum_partial_opt to csum_partial, as the _opt variant
On Mon, Oct 21, 2013 at 12:44:05PM -0700, Eric Dumazet wrote:
On Mon, 2013-10-21 at 15:21 -0400, Neil Horman wrote:
Ok, so I ran the above code on a single cpu using taskset, and set irq
affinity
such that no interrupts (save for local ones), would occur on that cpu.
Note
that I
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
>
> > #define BUFSIZ_ORDER 4
> > #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> > static int __init csum_init_module(void)
> > {
> > int i;
> > __wsum sum = 0;
> >
On Fri, Oct 18, 2013 at 02:15:52PM -0700, Eric Dumazet wrote:
On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
#define BUFSIZ_ORDER 4
#define BUFSIZ ((2 BUFSIZ_ORDER) * (1024*1024*2))
static int __init csum_init_module(void)
{
int i;
__wsum sum = 0;
struct
* Doug Ledford wrote:
> >> Based on these, prefetching is obviously a a good improvement, but
> >> not as good as parallel execution, and the winner by far is doing
> >> both.
>
> OK, this is where I have to chime in that these tests can *not* be used
> to say anything about prefetch, and
* Doug Ledford dledf...@redhat.com wrote:
Based on these, prefetching is obviously a a good improvement, but
not as good as parallel execution, and the winner by far is doing
both.
OK, this is where I have to chime in that these tests can *not* be used
to say anything about
On Mon, 2013-10-14 at 22:49 -0700, Joe Perches wrote:
> On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
>> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
>> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
>> > > attached patch brings much better results
>> > >
>> > >
On 2013-10-17, Ingo wrote:
> * Neil Horman wrote:
>
>> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
>> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
>> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
>> > >
>> > > > So, early testing results today. I wrote
On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
> #define BUFSIZ_ORDER 4
> #define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
> static int __init csum_init_module(void)
> {
> int i;
> __wsum sum = 0;
> struct timespec start, end;
> u64 time;
> struct page
On Fri, Oct 18, 2013 at 10:20:35AM -0700, Eric Dumazet wrote:
> On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
> > >
>
> > for(i=0;i<10;i++) {
> > sum = csum_partial(buf+offset, PAGE_SIZE, sum);
> > offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE :
On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
> >
> for(i=0;i<10;i++) {
> sum = csum_partial(buf+offset, PAGE_SIZE, sum);
> offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE : 0;
> }
Please replace this by random accesses, and use the
If implemented properly adcx/adox should give additional speedup... that is the
whole reason for their existence.
Neil Horman wrote:
>On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
>> On 10/11/2013 09:51 AM, Neil Horman wrote:
>> > Sébastien Dugué reported to me that devices
>
> Your benchmark uses a single 4K page, so data is _super_ hot in cpu
> caches.
> ( prefetch should give no speedups, I am surprised it makes any
> difference)
>
> Try now with 32 huges pages, to get 64 MBytes of working set.
>
> Because in reality we never csum_partial() data in cpu cache.
>
On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
> On 10/11/2013 09:51 AM, Neil Horman wrote:
> > Sébastien Dugué reported to me that devices implementing ipoib (which don't
> > have
> > checksum offload hardware were spending a significant amount of time
> > computing
> >
* H. Peter Anvin wrote:
> On 10/17/2013 01:41 AM, Ingo Molnar wrote:
> >
> > To correctly simulate the workload you'd have to:
> >
> > - allocate a buffer larger than your L2 cache.
> >
> > - to measure the effects of the prefetches you'd also have to randomize
> >the individual buffer
* H. Peter Anvin h...@zytor.com wrote:
On 10/17/2013 01:41 AM, Ingo Molnar wrote:
To correctly simulate the workload you'd have to:
- allocate a buffer larger than your L2 cache.
- to measure the effects of the prefetches you'd also have to randomize
the individual buffer
On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
On 10/11/2013 09:51 AM, Neil Horman wrote:
Sébastien Dugué reported to me that devices implementing ipoib (which don't
have
checksum offload hardware were spending a significant amount of time
computing
checksums. We
Your benchmark uses a single 4K page, so data is _super_ hot in cpu
caches.
( prefetch should give no speedups, I am surprised it makes any
difference)
Try now with 32 huges pages, to get 64 MBytes of working set.
Because in reality we never csum_partial() data in cpu cache.
(Unless
If implemented properly adcx/adox should give additional speedup... that is the
whole reason for their existence.
Neil Horman nhor...@tuxdriver.com wrote:
On Sat, Oct 12, 2013 at 03:29:24PM -0700, H. Peter Anvin wrote:
On 10/11/2013 09:51 AM, Neil Horman wrote:
Sébastien Dugué reported to me
On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
for(i=0;i10;i++) {
sum = csum_partial(buf+offset, PAGE_SIZE, sum);
offset = (offset BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE : 0;
}
Please replace this by random accesses, and use the more
On Fri, Oct 18, 2013 at 10:20:35AM -0700, Eric Dumazet wrote:
On Fri, 2013-10-18 at 12:50 -0400, Neil Horman wrote:
for(i=0;i10;i++) {
sum = csum_partial(buf+offset, PAGE_SIZE, sum);
offset = (offset BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE : 0;
}
On Fri, 2013-10-18 at 16:11 -0400, Neil Horman wrote:
#define BUFSIZ_ORDER 4
#define BUFSIZ ((2 BUFSIZ_ORDER) * (1024*1024*2))
static int __init csum_init_module(void)
{
int i;
__wsum sum = 0;
struct timespec start, end;
u64 time;
struct page *page;
On 2013-10-17, Ingo wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
So, early testing results today. I
On Mon, 2013-10-14 at 22:49 -0700, Joe Perches wrote:
On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
attached patch brings much better results
lpq83:~# ./netperf -H
On Thu, 2013-10-17 at 11:19 -0700, H. Peter Anvin wrote:
> Seriously, though, how much does it matter? All the above seems likely
> to do is to drown the signal by adding noise.
I don't think so.
>
> If the parallel (threaded) checksumming is faster, which theory says it
> should and
On 10/17/2013 01:41 AM, Ingo Molnar wrote:
>
> To correctly simulate the workload you'd have to:
>
> - allocate a buffer larger than your L2 cache.
>
> - to measure the effects of the prefetches you'd also have to randomize
>the individual buffer positions. See how 'perf bench numa'
* Neil Horman wrote:
> On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> > >
> > > > So, early testing results today. I wrote a test module that, allocated
> >
* Neil Horman nhor...@tuxdriver.com wrote:
On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
So, early testing results today. I wrote a test module that,
On 10/17/2013 01:41 AM, Ingo Molnar wrote:
To correctly simulate the workload you'd have to:
- allocate a buffer larger than your L2 cache.
- to measure the effects of the prefetches you'd also have to randomize
the individual buffer positions. See how 'perf bench numa' implements a
On Thu, 2013-10-17 at 11:19 -0700, H. Peter Anvin wrote:
Seriously, though, how much does it matter? All the above seems likely
to do is to drown the signal by adding noise.
I don't think so.
If the parallel (threaded) checksumming is faster, which theory says it
should and
On Wed, 2013-10-16 at 20:34 -0400, Neil Horman wrote:
> >
>
> So I went to reproduce these results, but was unable to (due to the fact that
> I
> only have a pretty jittery network to do testing accross at the moment with
> these devices). So instead I figured that I would go back to just
On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> >
> > > So, early testing results today. I wrote a test module that, allocated a
> > > 4k
> > > buffer, initalized
On Wed, 2013-10-16 at 08:25 +0200, Ingo Molnar wrote:
> Prefetch takes memory from L2->L1 memory
> just as much as it moves it cachelines from memory to the L2 cache.
Yup, mea culpa.
I thought the prefetch was still to L1 like the Pentium.
--
To unsubscribe from this list: send the line
* Joe Perches wrote:
> On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
> > * Joe Perches wrote:
> >
> > > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
* Joe Perches j...@perches.com wrote:
On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
* Joe Perches j...@perches.com wrote:
On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
On Mon, 2013-10-14 at 15:18 -0700, Eric
On Wed, 2013-10-16 at 08:25 +0200, Ingo Molnar wrote:
Prefetch takes memory from L2-L1 memory
just as much as it moves it cachelines from memory to the L2 cache.
Yup, mea culpa.
I thought the prefetch was still to L1 like the Pentium.
--
To unsubscribe from this list: send the line
On Mon, Oct 14, 2013 at 03:18:47PM -0700, Eric Dumazet wrote:
On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
So, early testing results today. I wrote a test module that, allocated a
4k
buffer, initalized it with random
On Wed, 2013-10-16 at 20:34 -0400, Neil Horman wrote:
So I went to reproduce these results, but was unable to (due to the fact that
I
only have a pretty jittery network to do testing accross at the moment with
these devices). So instead I figured that I would go back to just doing
On Tue, 2013-10-15 at 09:21 -0700, Joe Perches wrote:
> Ingo, Eric _showed_ that the prefetch is good here.
> How about looking at a little optimization to the minimal
> prefetch that gives that level of performance.
Wait a minute, my point was to remind that main cost is the
memory fetching.
On Tue, 2013-10-15 at 18:02 +0200, Andi Kleen wrote:
> > I get the csum_partial() if disabling prequeue.
>
> At least in the ipoib case i would consider that a misconfiguration.
There is nothing you can do, if application is not blocked on recv(),
but using poll()/epoll()/select(), prequeue is
On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
> * Joe Perches wrote:
>
> > On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > > attached patch brings
> I get the csum_partial() if disabling prequeue.
At least in the ipoib case i would consider that a misconfiguration.
"don't do this if it hurts"
There may be more such problems.
-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
On Tue, 2013-10-15 at 07:26 -0700, Eric Dumazet wrote:
> And the receiver should also do the same : (ethtool -K eth0 rx off)
>
> 10.55%netserver [kernel.kallsyms] [k]
> csum_partial_copy_generic
I get the csum_partial() if disabling prequeue.
echo 1
On Tue, 2013-10-15 at 16:15 +0200, Sébastien Dugué wrote:
> Hi Eric,
>
> On Tue, 15 Oct 2013 07:06:25 -0700
> Eric Dumazet wrote:
> > But the csum cost is both for sender and receiver ?
>
> No, it was only on the receiver side that I noticed it.
>
Yes, as Andi said, we do the csum while
Hi Eric,
On Tue, 15 Oct 2013 07:06:25 -0700
Eric Dumazet wrote:
> On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
> > On Tue, 15 Oct 2013 15:33:36 +0200
> > Andi Kleen wrote:
> >
> > > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR
> > > > hardware
> > > >
On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
> On Tue, 15 Oct 2013 15:33:36 +0200
> Andi Kleen wrote:
>
> > > indeed, our typical workload is connected mode IPoIB on mlx4 QDR
> > > hardware
> > > where one cannot benefit from hardware offloads.
> >
> > Is this with sendfile?
>
On Tue, 15 Oct 2013 15:33:36 +0200
Andi Kleen wrote:
> > indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> > where one cannot benefit from hardware offloads.
>
> Is this with sendfile?
Tests were done with iperf at the time without any extra funky options, and
> indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
> where one cannot benefit from hardware offloads.
Is this with sendfile?
For normal send() the checksum is done in the user copy and for receiving it
can be also done during the copy in most cases
-Andi
--
To
On Mon, Oct 14, 2013 at 02:07:48PM -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
> > * Andi Kleen wrote:
> >
> > > Neil Horman writes:
> > >
> > > > Sébastien Dugué reported to me that devices implementing ipoib (which
> > > > don't have checksum offload
On Tue, Oct 15, 2013 at 09:32:48AM +0200, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> > >
> > > * Neil Horman wrote:
> > >
> > > > Sébastien Dugué reported to me that devices implementing ipoib (which
> > > > don't have
* Borislav Petkov wrote:
> On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
> > Most processors have hundreds of cachelines even in their L1 cache.
> > Thousands in the L2 cache, up to hundreds of thousands.
>
> Also, I have this hazy memory of prefetch hints being harmful in some
On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
> Most processors have hundreds of cachelines even in their L1 cache.
> Thousands in the L2 cache, up to hundreds of thousands.
Also, I have this hazy memory of prefetch hints being harmful in some
situations:
Hi Neil, Andi,
On Mon, 14 Oct 2013 16:25:28 -0400
Neil Horman wrote:
> On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
> > Neil Horman writes:
> >
> > > Sébastien Dugué reported to me that devices implementing ipoib (which
> > > don't have
> > > checksum offload hardware were
* Joe Perches wrote:
> On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> > On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > > attached patch brings much better results
> > > >
> > > > lpq83:~# ./netperf -H 7.7.8.84
* Neil Horman wrote:
> On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
> >
> > * Neil Horman wrote:
> >
> > > Sébastien Dugué reported to me that devices implementing ipoib (which
> > > don't have checksum offload hardware were spending a significant amount
> > > of time
On Tue, 2013-10-15 at 18:02 +0200, Andi Kleen wrote:
I get the csum_partial() if disabling prequeue.
At least in the ipoib case i would consider that a misconfiguration.
There is nothing you can do, if application is not blocked on recv(),
but using poll()/epoll()/select(), prequeue is not
On Tue, 2013-10-15 at 09:21 -0700, Joe Perches wrote:
Ingo, Eric _showed_ that the prefetch is good here.
How about looking at a little optimization to the minimal
prefetch that gives that level of performance.
Wait a minute, my point was to remind that main cost is the
memory fetching.
Its
* Neil Horman nhor...@tuxdriver.com wrote:
On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
Sébastien Dugué reported to me that devices implementing ipoib (which
don't have checksum offload hardware were spending a
* Joe Perches j...@perches.com wrote:
On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
attached patch brings much better results
lpq83:~# ./netperf -H 7.7.8.84 -l
Hi Neil, Andi,
On Mon, 14 Oct 2013 16:25:28 -0400
Neil Horman nhor...@tuxdriver.com wrote:
On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
Neil Horman nhor...@tuxdriver.com writes:
Sébastien Dugué reported to me that devices implementing ipoib (which
don't have
On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
Most processors have hundreds of cachelines even in their L1 cache.
Thousands in the L2 cache, up to hundreds of thousands.
Also, I have this hazy memory of prefetch hints being harmful in some
situations:
* Borislav Petkov b...@alien8.de wrote:
On Tue, Oct 15, 2013 at 09:41:23AM +0200, Ingo Molnar wrote:
Most processors have hundreds of cachelines even in their L1 cache.
Thousands in the L2 cache, up to hundreds of thousands.
Also, I have this hazy memory of prefetch hints being harmful
On Tue, Oct 15, 2013 at 09:32:48AM +0200, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
Sébastien Dugué reported to me that devices implementing ipoib
On Mon, Oct 14, 2013 at 02:07:48PM -0700, Eric Dumazet wrote:
On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
* Andi Kleen a...@firstfloor.org wrote:
Neil Horman nhor...@tuxdriver.com writes:
Sébastien Dugué reported to me that devices implementing ipoib (which
don't
indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
where one cannot benefit from hardware offloads.
Is this with sendfile?
For normal send() the checksum is done in the user copy and for receiving it
can be also done during the copy in most cases
-Andi
--
To
On Tue, 15 Oct 2013 15:33:36 +0200
Andi Kleen a...@firstfloor.org wrote:
indeed, our typical workload is connected mode IPoIB on mlx4 QDR hardware
where one cannot benefit from hardware offloads.
Is this with sendfile?
Tests were done with iperf at the time without any extra funky
On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
On Tue, 15 Oct 2013 15:33:36 +0200
Andi Kleen a...@firstfloor.org wrote:
indeed, our typical workload is connected mode IPoIB on mlx4 QDR
hardware
where one cannot benefit from hardware offloads.
Is this with sendfile?
Hi Eric,
On Tue, 15 Oct 2013 07:06:25 -0700
Eric Dumazet eric.duma...@gmail.com wrote:
On Tue, 2013-10-15 at 15:56 +0200, Sébastien Dugué wrote:
On Tue, 15 Oct 2013 15:33:36 +0200
Andi Kleen a...@firstfloor.org wrote:
indeed, our typical workload is connected mode IPoIB on mlx4 QDR
On Tue, 2013-10-15 at 16:15 +0200, Sébastien Dugué wrote:
Hi Eric,
On Tue, 15 Oct 2013 07:06:25 -0700
Eric Dumazet eric.duma...@gmail.com wrote:
But the csum cost is both for sender and receiver ?
No, it was only on the receiver side that I noticed it.
Yes, as Andi said, we do the
On Tue, 2013-10-15 at 07:26 -0700, Eric Dumazet wrote:
And the receiver should also do the same : (ethtool -K eth0 rx off)
10.55%netserver [kernel.kallsyms] [k]
csum_partial_copy_generic
I get the csum_partial() if disabling prequeue.
echo 1
I get the csum_partial() if disabling prequeue.
At least in the ipoib case i would consider that a misconfiguration.
don't do this if it hurts
There may be more such problems.
-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to
On Tue, 2013-10-15 at 09:41 +0200, Ingo Molnar wrote:
* Joe Perches j...@perches.com wrote:
On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
attached patch brings
On Mon, 2013-10-14 at 15:44 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> > On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > > attached patch brings much better results
> > >
> > > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > > MIGRATED TCP STREAM
On Mon, 2013-10-14 at 15:37 -0700, Joe Perches wrote:
> On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> > attached patch brings much better results
> >
> > lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84
> > () port
On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
> attached patch brings much better results
>
> lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
> port 0 AF_INET
> Recv SendSend
On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
> On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
>
> > So, early testing results today. I wrote a test module that, allocated a 4k
> > buffer, initalized it with random data, and called csum_partial on it 10
> > times, recording
On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
> So, early testing results today. I wrote a test module that, allocated a 4k
> buffer, initalized it with random data, and called csum_partial on it 10
> times, recording the time at the start and end of that loop. Results on a 2.4
>
On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
> * Andi Kleen wrote:
>
> > Neil Horman writes:
> >
> > > Sébastien Dugué reported to me that devices implementing ipoib (which
> > > don't have checksum offload hardware were spending a significant
> > > amount of time computing
> >
> >
On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
>
> * Neil Horman wrote:
>
> > Sébastien Dugué reported to me that devices implementing ipoib (which
> > don't have checksum offload hardware were spending a significant amount
> > of time computing checksums. We found that by
On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
> Neil Horman writes:
>
> > Sébastien Dugué reported to me that devices implementing ipoib (which don't
> > have
> > checksum offload hardware were spending a significant amount of time
> > computing
>
> Must be an odd workload, most
* Andi Kleen wrote:
> Neil Horman writes:
>
> > Sébastien Dugué reported to me that devices implementing ipoib (which
> > don't have checksum offload hardware were spending a significant
> > amount of time computing
>
> Must be an odd workload, most TCP/UDP workloads do copy-checksum
>
* Andi Kleen a...@firstfloor.org wrote:
Neil Horman nhor...@tuxdriver.com writes:
Sébastien Dugué reported to me that devices implementing ipoib (which
don't have checksum offload hardware were spending a significant
amount of time computing
Must be an odd workload, most TCP/UDP
On Sun, Oct 13, 2013 at 09:38:33PM -0700, Andi Kleen wrote:
Neil Horman nhor...@tuxdriver.com writes:
Sébastien Dugué reported to me that devices implementing ipoib (which don't
have
checksum offload hardware were spending a significant amount of time
computing
Must be an odd
On Sat, Oct 12, 2013 at 07:21:24PM +0200, Ingo Molnar wrote:
* Neil Horman nhor...@tuxdriver.com wrote:
Sébastien Dugué reported to me that devices implementing ipoib (which
don't have checksum offload hardware were spending a significant amount
of time computing checksums. We found
On Mon, 2013-10-14 at 09:49 +0200, Ingo Molnar wrote:
* Andi Kleen a...@firstfloor.org wrote:
Neil Horman nhor...@tuxdriver.com writes:
Sébastien Dugué reported to me that devices implementing ipoib (which
don't have checksum offload hardware were spending a significant
amount
On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
So, early testing results today. I wrote a test module that, allocated a 4k
buffer, initalized it with random data, and called csum_partial on it 10
times, recording the time at the start and end of that loop. Results on a 2.4
GHz
On Mon, 2013-10-14 at 14:19 -0700, Eric Dumazet wrote:
On Mon, 2013-10-14 at 16:28 -0400, Neil Horman wrote:
So, early testing results today. I wrote a test module that, allocated a 4k
buffer, initalized it with random data, and called csum_partial on it 10
times, recording the time
On Mon, 2013-10-14 at 15:18 -0700, Eric Dumazet wrote:
attached patch brings much better results
lpq83:~# ./netperf -H 7.7.8.84 -l 10 -Cc
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
port 0 AF_INET
Recv SendSend Utilization
101 - 200 of 214 matches
Mail list logo