Re: [PD] [OT] SSE/MMX tips?

2011-12-06 Thread Mathieu Bouchard

On Wed, 7 Sep 2011, Claude Heiland-Allen wrote:

On 07/09/11 12:17, Bill Gribble wrote:

The operation is integration.
Try calling it 'scan' and you might end up with more productive searches, at 
least to my mind integration is more about summing over continuous regions 
than something discrete like this.

Searching for data parallel algorithm scan leads to pages like:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.1876rep=rep1type=pdf


If I implemented [#scan] that way on one processor, it would be slower... 
that algorithm begins being faster (than the obvious one) if you have 
O(log n) processors.


(yes, again forgot to send a mail long ago)

 ___
| Mathieu Bouchard  tél: +1.514.383.3801  Villeray, Montréal, QC___
Pd-list@iem.at mailing list
UNSUBSCRIBE and account-management - 
http://lists.puredata.info/listinfo/pd-list


Re: [PD] [OT] SSE/MMX tips?

2011-09-08 Thread Mathieu Bouchard

On Wed, 7 Sep 2011, Charles Henry wrote:

On Wed, Sep 7, 2011 at 7:59 PM, Mathieu Bouchard ma...@artengine.ca wrote:

On Wed, 7 Sep 2011, Mathieu Bouchard wrote:

On Wed, 7 Sep 2011, Bill Gribble wrote:

So far iteration on plain floats seems to be the best I can come up with,
but HADDPS is tantalizingly close to what I want to do.  Any hints?

Sorry, what's HADDPS?


http://www.rz.uni-karlsruhe.de/rz/docs/VTune/reference/HADDPS--Packed_Single-FP_Horizontal_Add.htm


This is really interesting.  Your compiler probably knows how to
optimize this kind of information.


How can you tell that ? I bet it doesn't...

What could it be doing about a scan like this, anyway ?

btw, has anyone looked at the «restrict» keyword yet ?

 ___
| Mathieu Bouchard  tél: +1.514.383.3801  Villeray, Montréal, QC___
Pd-list@iem.at mailing list
UNSUBSCRIBE and account-management - 
http://lists.puredata.info/listinfo/pd-list


Re: [PD] [OT] SSE/MMX tips?

2011-09-08 Thread Charles Henry
On Thu, Sep 8, 2011 at 1:02 AM, Mathieu Bouchard ma...@artengine.ca wrote:
 On Wed, 7 Sep 2011, Charles Henry wrote:

 On Wed, Sep 7, 2011 at 7:59 PM, Mathieu Bouchard ma...@artengine.ca
 wrote:

 On Wed, 7 Sep 2011, Mathieu Bouchard wrote:

 On Wed, 7 Sep 2011, Bill Gribble wrote:

 So far iteration on plain floats seems to be the best I can come up
 with,
 but HADDPS is tantalizingly close to what I want to do.  Any hints?

 Sorry, what's HADDPS?

 http://www.rz.uni-karlsruhe.de/rz/docs/VTune/reference/HADDPS--Packed_Single-FP_Horizontal_Add.htm

 This is really interesting.  Your compiler probably knows how to
 optimize this kind of information.

 How can you tell that ? I bet it doesn't...

Yeah, I thought it over.  I was wrong.
I was also wrong about SSE4.2--AVX is the new instruction set with
256-bit wide operations.

 What could it be doing about a scan like this, anyway ?
fft-multiply by 2*pi*i*f-ifft and fall over...

I dunno, but I'm working on it a bit.

___
Pd-list@iem.at mailing list
UNSUBSCRIBE and account-management - 
http://lists.puredata.info/listinfo/pd-list


Re: [PD] [OT] SSE/MMX tips?

2011-09-08 Thread Mathieu Bouchard

On Wed, 7 Sep 2011, Bill Gribble wrote:


It's really just for fun anyway.


Well, if you wanted to really use SSE in that case, it would be 
appropriate to process 4 interleaved signals at once, or at least two.


Btw, if you want something fun, consider :

  a+b = (a^b) + ((ab)1)

that is, addition of ints can be split into a variable number of steps :

  int add (int a, int b) {return b ? add(a^b,(ab)1) : a;}

where the number of steps is the number of consecutive carries (of ones) 
it has to do. (how many average steps does that make, for random ints ?)


I've always found this formula fascinating, but I'm still waiting for the 
occasion to make use of it, 15 years later :)


 ___
| Mathieu Bouchard  tél: +1.514.383.3801  Villeray, Montréal, QC
___
Pd-list@iem.at mailing list
UNSUBSCRIBE and account-management - 
http://lists.puredata.info/listinfo/pd-list


[PD] [OT] SSE/MMX tips?

2011-09-07 Thread Bill Gribble
I am trying to code a simple operation using SSE2 instructions where possible.  
I have a feeling that what I want to do is just a matter of a couple of shufps 
and haddps instructions but I can't get it. Lazyweb please help!

The operation is integration. I have a vector of 4 single floats (v4sf) and a 
carry-in float to start.  

For example

CI F0 F1 F2 F3 
5  1  0  10 -5

Yields

F0 F1 F2 F3
6  6  16  11

So far iteration on plain floats seems to be the best I can come up with, but 
HADDPS is tantalizingly close to what I want to do.  Any hints?

Thanks,
Bill Gribble
___
Pd-list@iem.at mailing list
UNSUBSCRIBE and account-management - 
http://lists.puredata.info/listinfo/pd-list


Re: [PD] [OT] SSE/MMX tips?

2011-09-07 Thread Mathieu Bouchard

On Wed, 7 Sep 2011, Bill Gribble wrote:

So far iteration on plain floats seems to be the best I can come up 
with, but HADDPS is tantalizingly close to what I want to do.  Any 
hints?


Once I thought that with some commutativity you could speed things up like 
this :


(f0+f1+f2+f3)+(f4+f5+f6+f7)+...

can be rearranged as :

(f0+f4+...)+(f1+f5+...)+(f2+f6+...)+(f3+f7+...)

But I don't remember whether I tried it or not.

That's a speedup without even using MMX/SSE... theoretically, it can 
double the speed of a summation like this, and you can apply this boost to 
sum-of-products to get a certain amount of speedup too.


 ___
| Mathieu Bouchard  tél: +1.514.383.3801  Villeray, Montréal, QC___
Pd-list@iem.at mailing list
UNSUBSCRIBE and account-management - 
http://lists.puredata.info/listinfo/pd-list


Re: [PD] [OT] SSE/MMX tips?

2011-09-07 Thread Claude Heiland-Allen

On 07/09/11 12:17, Bill Gribble wrote:

The operation is integration.


Try calling it 'scan' and you might end up with more productive 
searches, at least to my mind integration is more about summing over 
continuous regions than something discrete like this.


Searching for data parallel algorithm scan leads to pages like:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.1876rep=rep1type=pdf
http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html


Claude

___
Pd-list@iem.at mailing list
UNSUBSCRIBE and account-management - 
http://lists.puredata.info/listinfo/pd-list


Re: [PD] [OT] SSE/MMX tips?

2011-09-07 Thread Bill Gribble
Ah!  Thanks.  Apparently also called prefix sum.  

Thanks,
Bill Gribble 

On Wed, 2011-09-07 at 16:21 +0100, Claude Heiland-Allen wrote:
 On 07/09/11 12:17, Bill Gribble wrote:
  The operation is integration.
 
 Try calling it 'scan' and you might end up with more productive 
 searches, at least to my mind integration is more about summing over 
 continuous regions than something discrete like this.
 
 Searching for data parallel algorithm scan leads to pages like:
 
 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.1876rep=rep1type=pdf
 http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
 
 
 Claude
 
 ___
 Pd-list@iem.at mailing list
 UNSUBSCRIBE and account-management - 
 http://lists.puredata.info/listinfo/pd-list



___
Pd-list@iem.at mailing list
UNSUBSCRIBE and account-management - 
http://lists.puredata.info/listinfo/pd-list


Re: [PD] [OT] SSE/MMX tips?

2011-09-07 Thread Mathieu Bouchard

On Wed, 7 Sep 2011, Mathieu Bouchard wrote:


On Wed, 7 Sep 2011, Bill Gribble wrote:

So far iteration on plain floats seems to be the best I can come up with, 
but HADDPS is tantalizingly close to what I want to do.  Any hints?


Once I thought that with some commutativity you could speed things up like 
this :


(f0+f1+f2+f3)+(f4+f5+f6+f7)+...

can be rearranged as :

(f0+f4+...)+(f1+f5+...)+(f2+f6+...)+(f3+f7+...)


But what I said does not apply to your case, because you want a scan, 
whether I didn't really read and assumed a fold.


I don't know how to optimise a scan.

 ___
| Mathieu Bouchard  tél: +1.514.383.3801  Villeray, Montréal, QC
___
Pd-list@iem.at mailing list
UNSUBSCRIBE and account-management - 
http://lists.puredata.info/listinfo/pd-list


Re: [PD] [OT] SSE/MMX tips?

2011-09-07 Thread Charles Henry
On Wed, Sep 7, 2011 at 7:59 PM, Mathieu Bouchard ma...@artengine.ca wrote:
 On Wed, 7 Sep 2011, Mathieu Bouchard wrote:

 On Wed, 7 Sep 2011, Bill Gribble wrote:

 So far iteration on plain floats seems to be the best I can come up with,
 but HADDPS is tantalizingly close to what I want to do.  Any hints?

Sorry, what's HADDPS?


 Once I thought that with some commutativity you could speed things up like
 this :

 (f0+f1+f2+f3)+(f4+f5+f6+f7)+...

 can be rearranged as :

 (f0+f4+...)+(f1+f5+...)+(f2+f6+...)+(f3+f7+...)

 But what I said does not apply to your case, because you want a scan,
 whether I didn't really read and assumed a fold.

 I don't know how to optimise a scan.

This is really interesting.  Your compiler probably knows how to
optimize this kind of information.  SSE3 is really about memory
allocation.  The instructions pack floats into a bigger section of
memory.  In SSE3, this means 4 floats in a 128-bit single operation.

SSE 4.2 has 256-bit wide (8 flops per clock)--the latest increase in
single-threaded computing power is in favor of single-precision float
(lucky for Pd-ers)

___
Pd-list@iem.at mailing list
UNSUBSCRIBE and account-management - 
http://lists.puredata.info/listinfo/pd-list


Re: [PD] [OT] SSE/MMX tips?

2011-09-07 Thread Bill Gribble
I noticed that your suggestion did not apply, but assumed it was a subtle 
riddle taunting me for an offtopic post!

I think the best I can do is 2 vector adds and 2 shifts in place of 4 float 
adds per 4 floats. Not much of a savings, but with the loop and fetch overhead 
it may be worth it. I'll benchmark and see!  

It's really just for fun anyway. 

Thanks,
Bill Gribble

On Sep 7, 2011, at 20:59, Mathieu Bouchard ma...@artengine.ca wrote:

 On Wed, 7 Sep 2011, Mathieu Bouchard wrote:
 
 On Wed, 7 Sep 2011, Bill Gribble wrote:
 
 So far iteration on plain floats seems to be the best I can come up with, 
 but HADDPS is tantalizingly close to what I want to do.  Any hints?
 
 Once I thought that with some commutativity you could speed things up like 
 this :
 
 (f0+f1+f2+f3)+(f4+f5+f6+f7)+...
 
 can be rearranged as :
 
 (f0+f4+...)+(f1+f5+...)+(f2+f6+...)+(f3+f7+...)
 
 But what I said does not apply to your case, because you want a scan, whether 
 I didn't really read and assumed a fold.
 
 I don't know how to optimise a scan.
 
 ___
 | Mathieu Bouchard  tél: +1.514.383.3801  Villeray, Montréal, QC

___
Pd-list@iem.at mailing list
UNSUBSCRIBE and account-management - 
http://lists.puredata.info/listinfo/pd-list