* Pavel Machek wrote:
> > > > Yeah, so generic memcpy() replacement is only feasible I think if the
> > > > most
> > > > optimistic implementation is actually correct:
> > > >
> > > > - if no preempt disable()/enable() is required
> > > >
> > > > - if direct access to the AVX[2] registers
Hi!
> > On Tue, 20 Mar 2018, Ingo Molnar wrote:
> > > * Thomas Gleixner wrote:
> > >
> > > > > So I do think we could do more in this area to improve driver
> > > > > performance, if the
> > > > > code is correct and if there's actual benchmarks that are showing
> > > > > real benefits.
> > >
On Thu, Mar 22, 2018 at 5:40 PM, Alexei Starovoitov
wrote:
> On Thu, Mar 22, 2018 at 10:33:43AM +0100, Ingo Molnar wrote:
>>
>> - I think the BPF JIT, whose byte code machine languge is used by an
>>increasing number of kernel subsystems, could benefit from having vector
>> ops.
>>It wou
On Thu, Mar 22, 2018 at 10:33:43AM +0100, Ingo Molnar wrote:
>
> - I think the BPF JIT, whose byte code machine languge is used by an
>increasing number of kernel subsystems, could benefit from having vector
> ops.
>It would possibly allow the handling of floating point types.
this is o
On Thu, Mar 22, 2018 at 5:48 AM, David Laight wrote:
>
> So if we needed to do PIO reads using the AVX2 (or better AVX-512)
> registers would make a significant difference.
> Fortunately we can 'dma' most of the data we need to transfer.
I think this is the really fundamental issue.
A device tha
From: David Laight
> Sent: 22 March 2018 10:36
...
> Any code would need to be in memcpy_fromio(), not in every driver that
> might benefit.
> Then fallback code can be used if the registers aren't available.
>
> > (b) we can't guarantee that %ymm register write will show up on any
> > bus as a s
From: Sent: 21 March 2018 18:16
> To: Ingo Molnar
...
> All this to do a 32-byte PIO access, with absolutely zero data right
> now on what the win is?
>
> Yes, yes, I can find an Intel white-paper that talks about setting WC
> and then using xmm and ymm instructions to write a single 64-byte
> bur
* Andy Lutomirski wrote:
> On Wed, Mar 21, 2018 at 6:32 AM, Ingo Molnar wrote:
> >
> > * Linus Torvalds wrote:
> >
> >> And even if you ignore that "maintenance problems down the line" issue
> >> ("we can fix them when they happen") I don't want to see games like
> >> this, because I'm pretty
* Linus Torvalds wrote:
> And the real worry is things like AVX-512 etc, which is exactly when
> things like "save and restore one ymm register" will quite likely
> clear the upper bits of the zmm register.
Yeah, I think the only valid save/restore pattern is to 100% correctly
enumerate
the w
On Wed, Mar 21, 2018 at 12:46 AM, Ingo Molnar wrote:
>
> So I added a bit of instrumentation and the current state of things is that on
> 64-bit x86 every single task has an initialized FPU, every task has the exact
> same, fully filled in xfeatures (XINUSE) value:
Bah. Your CPU is apparently som
On Wed, Mar 21, 2018 at 6:32 AM, Ingo Molnar wrote:
>
> * Linus Torvalds wrote:
>
>> And even if you ignore that "maintenance problems down the line" issue
>> ("we can fix them when they happen") I don't want to see games like
>> this, because I'm pretty sure it breaks the optimized xsave by tagg
So I poked around a bit and I'm having second thoughts:
* Linus Torvalds wrote:
> On Tue, Mar 20, 2018 at 1:26 AM, Ingo Molnar wrote:
> >
> > So assuming the target driver will only load on modern FPUs I *think* it
> > should
> > actually be possible to do something like (pseudocode):
> >
> >
* Linus Torvalds wrote:
> And even if you ignore that "maintenance problems down the line" issue
> ("we can fix them when they happen") I don't want to see games like
> this, because I'm pretty sure it breaks the optimized xsave by tagging
> the state as being dirty.
That's true - and it would
On Tue, Mar 20, 2018 at 3:10 PM, David Laight wrote:
> From: Andy Lutomirski
>> Sent: 20 March 2018 14:57
> ...
>> I'd rather see us finally finish the work that Rik started to rework
>> this differently. I'd like kernel_fpu_begin() to look like:
>>
>> if (test_thread_flag(TIF_NEED_FPU_RESTORE))
On Tue, Mar 20, 2018 at 1:26 AM, Ingo Molnar wrote:
>
> So assuming the target driver will only load on modern FPUs I *think* it
> should
> actually be possible to do something like (pseudocode):
>
> vmovdqa %ymm0, 40(%rsp)
> vmovdqa %ymm1, 80(%rsp)
>
> ...
> # use
From: Andy Lutomirski
> Sent: 20 March 2018 14:57
...
> I'd rather see us finally finish the work that Rik started to rework
> this differently. I'd like kernel_fpu_begin() to look like:
>
> if (test_thread_flag(TIF_NEED_FPU_RESTORE)) {
> return; // we're already okay. maybe we need to check
>
On Tue, Mar 20, 2018 at 8:26 AM, Ingo Molnar wrote:
>
> * Thomas Gleixner wrote:
>
>> > Useful also for code that needs AVX-like registers to do things like CRCs.
>>
>> x86/crypto/ has a lot of AVX optimized code.
>
> Yeah, that's true, but the crypto code is processing fundamentally bigger
> bl
On Monday, March 03/19/18, 2018 at 20:57:22 +0530, Christoph Hellwig wrote:
> On Mon, Mar 19, 2018 at 07:50:33PM +0530, Rahul Lakkireddy wrote:
> > This series of patches add support for 256-bit IO read and write.
> > The APIs are readqq and writeqq (quad quadword - 4 x 64), that read
> > and write
From: Ingo Molnar
> Sent: 20 March 2018 10:54
...
> Note that a generic version might still be worth trying out, if and only if
> it's
> safe to access those vector registers directly: modern x86 CPUs will do their
> non-constant memcpy()s via the common memcpy_erms() function - which could in
> t
* Thomas Gleixner wrote:
> On Tue, 20 Mar 2018, Ingo Molnar wrote:
> > * Thomas Gleixner wrote:
> >
> > > > So I do think we could do more in this area to improve driver
> > > > performance, if the
> > > > code is correct and if there's actual benchmarks that are showing real
> > > > benefi
From: Thomas Gleixner
> Sent: 20 March 2018 09:41
> On Tue, 20 Mar 2018, Ingo Molnar wrote:
> > * Thomas Gleixner wrote:
...
> > > And if we go down that road then we want a AVX based memcpy()
> > > implementation which is runtime conditional on the feature bit(s) and
> > > length dependent. Just
On Tue, 20 Mar 2018, Ingo Molnar wrote:
> * Thomas Gleixner wrote:
>
> > > So I do think we could do more in this area to improve driver
> > > performance, if the
> > > code is correct and if there's actual benchmarks that are showing real
> > > benefits.
> >
> > If it's about hotpath perform
* Thomas Gleixner wrote:
> > So I do think we could do more in this area to improve driver performance,
> > if the
> > code is correct and if there's actual benchmarks that are showing real
> > benefits.
>
> If it's about hotpath performance I'm all for it, but the use case here is
> a debug
On Tue, 20 Mar 2018, Ingo Molnar wrote:
> * Thomas Gleixner wrote:
>
> > > Useful also for code that needs AVX-like registers to do things like CRCs.
> >
> > x86/crypto/ has a lot of AVX optimized code.
>
> Yeah, that's true, but the crypto code is processing fundamentally bigger
> blocks
> o
* Thomas Gleixner wrote:
> > Useful also for code that needs AVX-like registers to do things like CRCs.
>
> x86/crypto/ has a lot of AVX optimized code.
Yeah, that's true, but the crypto code is processing fundamentally bigger
blocks
of data, which amortizes the cost of using kernel_fpu_begi
On Mon, Mar 19, 2018 at 8:53 AM, David Laight wrote:
>
> The x87 and SSE registers can't be changed - they can contain callee-saved
> registers.
> But (IIRC) the AVX and AVX2 registers are all caller-saved.
No.
The kernel entry is not the usual function call.
On kernel entry, *all* registers ar
From: Thomas Gleixner
> Sent: 19 March 2018 15:37
...
> > If system call entry reset the AVX registers then any FP save/restore
> > would be faster because the AVX registers wouldn't need to be saved
> > (and the cpu won't save them).
> > I believe the instruction to reset the AVX registers is fast
On Mon, 19 Mar 2018, David Laight wrote:
> From: Thomas Gleixner
> > Sent: 19 March 2018 15:05
> >
> > On Mon, 19 Mar 2018, David Laight wrote:
> > > From: Rahul Lakkireddy
> > > In principle it ought to be possible to get access to one or two
> > > (eg) AVX registers by saving them to stack and t
On Mon, Mar 19, 2018 at 07:50:33PM +0530, Rahul Lakkireddy wrote:
> This series of patches add support for 256-bit IO read and write.
> The APIs are readqq and writeqq (quad quadword - 4 x 64), that read
> and write 256-bits at a time from IO, respectively.
What a horrible name. please encode the
From: Thomas Gleixner
> Sent: 19 March 2018 15:05
>
> On Mon, 19 Mar 2018, David Laight wrote:
> > From: Rahul Lakkireddy
> > In principle it ought to be possible to get access to one or two
> > (eg) AVX registers by saving them to stack and telling the fpu
> > save code where you've put them.
>
On Mon, 19 Mar 2018, David Laight wrote:
> From: Rahul Lakkireddy
> In principle it ought to be possible to get access to one or two
> (eg) AVX registers by saving them to stack and telling the fpu
> save code where you've put them.
No. We have functions for this and we are not adding new ad hoc m
From: Rahul Lakkireddy
> Sent: 19 March 2018 14:21
>
> This series of patches add support for 256-bit IO read and write.
> The APIs are readqq and writeqq (quad quadword - 4 x 64), that read
> and write 256-bits at a time from IO, respectively.
Why not use the AVX2 registers to get 512bit accesse
This series of patches add support for 256-bit IO read and write.
The APIs are readqq and writeqq (quad quadword - 4 x 64), that read
and write 256-bits at a time from IO, respectively.
Patch 1 adds u256 type and adds necessary non-atomic accessors. Also
adds byteorder conversion APIs.
Patch 2 a
33 matches
Mail list logo