[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2019-07-03 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877934#comment-16877934
 ] 

Rinka Singh edited comment on LUCENE-7745 at 7/3/19 3:56 PM:
-

{quote}The basic idea is to compute sub-histograms in each thread block with 
each thread block accumulating into the local memory. Then, when each thread 
block finishes its workload, it atomically adds the result to global memory, 
reducing the overall amount of traffic to global memory.To increase throughput 
and reduce shared memory contention, the main contribution here is that they 
actually use R "replicated" sub-histograms in each thread block, and they 
offset them so that bin 0 of the 1st histogram falls into a different memory 
bank than bin 0 of the 2nd histogram, and so on for R histograms. Essentially, 
it improves throughput in the degenerate case where multiple threads are trying 
to accumulate the same histogram bin at the same time.
{quote}
So here's what I've done/am doing:

I have a basic histogramming (including eliminating stop words) working on a 
single GPU  (I have an old Quadro 2000 with 1 GB memory) - I've tested it for a 
5MB (text file) and it seems to be working OK.

The following is how I'm implementing it - briefly.

Read a file in from command line (linux executable) into the GPU
 * convert the stream to words, chunk them into blocks
 * eliminate the stop words
 * sort/merge (including word-count) everything first inside a block and then 
across blocks - I came up with my own sort - haven't had the time to explore 
the parallel sorts out there
 * This results in a sorted histogram is held in multiple blocks in the GPU.

The advantages of this approach (to my mind) are:
 * i can scale up use the entire GPU memory.  My guess is I can create and 
manage an 8-10 GB index in a V100 (it has 32GB) - like I said, I've only tested 
with a 5 MB text file so far.
 * Easy to add fresh data into the existing histogram.  All I need to do is 
create new blocks and sort/merge them all.
 * I'm guessing this should make it easy to implement scaling across GPUs which 
means on a multi-GPU machine, I can scale to the almost the number of GPUs 
there and then of course one can setup a cluster of such machines...  This is 
far in the future though...
 * The sort is kept separate so we can experiment with various sorts and see 
which one performs best.

 

The issues are:
 * It is currently horrendously slow (I use global memory all the way and no 
optimization).  Well OK much too slow for my liking (I went over to nVidia's 
office and tested it on a K80 and it was just twice as fast as my GPU).  I'm 
currently trying to implement a shared memory version (and a few other tweaks) 
that should speed it up.
 * I have yet to do comparisons with the histogramming tools out there and so 
cannot say how much better it is.  Once I have the basic inverted index in 
place, I'll reach out to you all for the testing.
 * It is still a bit fragile - I'm still finding bugs as I test but the basic 
works.

 

Currently in process:
 * code is modified for (some) performance.  Am debugging/testing - it will 
take a while.  As of now, I feel good about what I've done but I won't know 
till I get it to work and test for performance.
 * Need to add ability to handle multiple files (I think I will postpone this 
as one can always cat the files together and pass it in - that is a pretty 
simple script that can be wrapped around the executable).
 * Need to create inverted index.
 * we'll worry about searching on the index later but that should be pretty 
trivial - well actually nothing is trivial here.

 
{quote}Re: efficient histogram implementation in CUDA

If it helps, [this 
approach|https://scholar.google.com/scholar?cluster=4154868272073145366=en_sdt=0,3]
 has been good for a balance between GPU performance and ease of implementation 
for work I've done in the past. If academic paywalls block you for all those 
results, it looks to also be available (presumably by the authors) on 
[researchgate|https://www.researchgate.net/publication/256674650_An_optimized_approach_to_histogram_computation_on_GPU]
{quote}
 Took a quick look - they are all priced products.  I will take a look at 
researchgate sometime.

I apologize but I may not be very responsive in the next month or so as we are 
in the middle of a release at work and also my night time job (this).


was (Author: rinka):
{quote}The basic idea is to compute sub-histograms in each thread block with 
each thread block accumulating into the local memory. Then, when each thread 
block finishes its workload, it atomically adds the result to global memory, 
reducing the overall amount of traffic to global memory.To increase throughput 
and reduce shared memory contention, the main contribution here is that they 
actually use R "replicated" sub-histograms in each thread block, and they 

[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2019-06-19 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863938#comment-16863938
 ] 

Rinka Singh edited comment on LUCENE-7745 at 6/19/19 10:34 AM:
---

Hi [~mackncheesiest], All,
 A quick update. I've been going really slow (sorry 'bout that). My day job has 
consumed a lot of my time. Also, what I have working is histogramming (on text 
files) on GPUs - the problem with that is that it is horrendously slow - I use 
the GPU global memory all the way (it is just about 4-5 times faster than a 
CPU) instead of sorting in local memory. I've been trying to accelerate that 
before converting it into an inverted index. Nearly there :) you know how that 
is - the almost there syndrome... Once I get it done, I'll check it into my 
github.

Here's the lessons I learned in my journey:
 # Do all the decision making in the CPU.  See if parallelization can 
substitute for decision making - you need to think parallelization/optimization 
as part of design not as an after-thought - this is counter to what everyone 
says about optimization.  The reason is there could be SIGNIFICANT design 
changes.
 # Do as fine grained parallelism as possible.  Don't think - one cpu-thread == 
one gpu-thread.  think as parallel as possible.
 # The best metaphor I found (of working with GPUs) - think of it as an 
embedded board attached to your machine and you move data to and fro from the 
board, debug on the board.  Dump all parallel processing on the board and 
sequential on the CPU.
 # Read the nVidia Manuals (they are your best bet).  I figured it is better to 
stay with CUDA (as against OpenCL) given the wealth of CUDA info and support 
out there...
 # Writing code:
 ## explicitly think about cache memory (that's your shared memory) and 
registers (local memory) and manage them.  This is COMPLETELY different from 
writing CPU code - the compiler does this for you.
 ## Try to used const, shared and register memory as much as possible.  Avoid 
__syncthreads () if you can.
 ## Here's where the dragons lie...
 # Engineer productivity is roughly 1/10th the normal productivity (And I mean 
writing C/C++ not python).  I've written and thrown away code umpteen times - 
something that I just wouldn't need to do when writing standard code.
 # Added on 6/19: As a metaphor think about this as VHDL/Verilog programming 
(without the timing constructs - timing is not a major issue here since the 
threads on the device execute in (almost) lockstep).

Having said all this, :) I've a bunch of limitations that a regular software 
engineer will not have and have been struggling to get over them.  I've been a 
manager for way too long and find it really difficult to focus on just one 
thing (the standard ADHD that most managers eventually develop).  Also, I WAS a 
C programmer, loong ago - no, not even C++ and I just haven't had the bandwidth 
to pick C++ up and then let's not even talk about my day job pressures - I do 
this for an hour or two at night (sigh)...

I will put out everything I've done once I've crossed a milestone (a working 
accelerated histogram).  Then will modify that to do inverted indexing.

Hope this helps...  In the meantime, if you want me to review 
documents/design/thoughts/anything, please feel free to mail them to me at: 
rinka (dot) singh (at) gmail.  At least ping me - I really don't look at 
the Apache messages and would probably miss something...

Sorry, did I mention this - I'm a COMPLETE noob at contributing to O.S. So 
please forgive me if I get the processes and systems wrong...  I'm trying to 
learn.


was (Author: rinka):
Hi [~mackncheesiest], All,
 A quick update. I've been going really slow (sorry 'bout that). My day job has 
consumed a lot of my time. Also, what I have working is histogramming (on text 
files) on GPUs - the problem with that is that it is horrendously slow - I use 
the GPU global memory all the way (it is just about 4-5 times faster than a 
CPU) instead of sorting in local memory. I've been trying to accelerate that 
before converting it into an inverted index. Nearly there :) you know how that 
is - the almost there syndrome... Once I get it done, I'll check it into my 
github.

Here's the lessons I learned in my journey:
 # Do all the decision making in the CPU.  See if parallelization can 
substitute for decision making - you need to think parallelization/optimization 
as part of design not as an after-thought - this is counter to what everyone 
says about optimization.  The reason is there could be SIGNIFICANT design 
changes.
 # Do as fine grained parallelism as possible.  Don't think - one cpu-thread == 
one gpu-thread.  think as parallel as possible.
 # The best metaphor I found (of working with GPUs) - think of it as an 
embedded board attached to your machine and you move data to and fro from the 
board, debug on the board.  Dump all parallel 

[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2019-06-14 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16863938#comment-16863938
 ] 

Rinka Singh edited comment on LUCENE-7745 at 6/14/19 10:43 AM:
---

Hi [~mackncheesiest], All,
 A quick update. I've been going really slow (sorry 'bout that). My day job has 
consumed a lot of my time. Also, what I have working is histogramming (on text 
files) on GPUs - the problem with that is that it is horrendously slow - I use 
the GPU global memory all the way (it is just about 4-5 times faster than a 
CPU) instead of sorting in local memory. I've been trying to accelerate that 
before converting it into an inverted index. Nearly there :) you know how that 
is - the almost there syndrome... Once I get it done, I'll check it into my 
github.

Here's the lessons I learned in my journey:
 # Do all the decision making in the CPU.  See if parallelization can 
substitute for decision making - you need to think parallelization/optimization 
as part of design not as an after-thought - this is counter to what everyone 
says about optimization.  The reason is there could be SIGNIFICANT design 
changes.
 # Do as fine grained parallelism as possible.  Don't think - one cpu-thread == 
one gpu-thread.  think as parallel as possible.
 # The best metaphor I found (of working with GPUs) - think of it as an 
embedded board attached to your machine and you move data to and fro from the 
board, debug on the board.  Dump all parallel processing on the board and 
sequential on the CPU.
 # Read the nVidia Manuals (they are your best bet).  I figured it is better to 
stay with CUDA (as against OpenCL) given the wealth of CUDA info and support 
out there...
 # Writing code:
 ## explicitly think about cache memory (that's your shared memory) and 
registers (local memory) and manage them.  This is COMPLETELY different from 
writing CPU code - the compiler does this for you.
 ## Try to used const, shared and register memory as much as possible.  Avoid 
__syncthreads () if you can.
 ## Here's where the dragons lie...
 # Engineer productivity is roughly 1/10th the normal productivity (And I mean 
writing C/C++ not python).  I've written and thrown away code umpteen times - 
something that I just wouldn't need to do when writing standard code.

Having said all this, :) I've a bunch of limitations that a regular software 
engineer will not have and have been struggling to get over them.  I've been a 
manager for way too long and find it really difficult to focus on just one 
thing (the standard ADHD that most managers eventually develop).  Also, I WAS a 
C programmer, loong ago - no, not even C++ and I just haven't had the bandwidth 
to pick C++ up and then let's not even talk about my day job pressures - I do 
this for an hour or two at night (sigh)...

I will put out everything I've done once I've crossed a milestone (a working 
accelerated histogram).  Then will modify that to do inverted indexing.

Hope this helps...  In the meantime, if you want me to review 
documents/design/thoughts/anything, please feel free to mail them to me at: 
rinka (dot) singh (at) gmail.  At least ping me - I really don't look at 
the Apache messages and would probably miss something...

Sorry, did I mention this - I'm a COMPLETE noob at contributing to O.S. So 
please forgive me if I get the processes and systems wrong...  I'm trying to 
learn.


was (Author: rinka):
Hi [~mackncheesiest], All,
 A quick update. I've been going really slow (sorry 'bout that). My day job has 
consumed a lot of my time. Also, what I have working is histogramming (on text 
files) on GPUs - the problem with that is that it is horrendously slow - I use 
the GPU global memory all the way (it is just about 4-5 times faster than a 
CPU) instead of sorting in local memory. I've been trying to accelerate that 
before converting it into an inverted index. Nearly there :) you know how that 
is - the almost there syndrome... Once I get it done, I'll check it into my 
github.

Here's the lessons I learned in my journey:
 # Do all the decision making in the CPU.  See if parallelization can 
substitute for decision making - you need to think parallelization/optimization 
as part of design not as an after-thought - this is counter to what everyone 
says about optimization.  The reason is there could be SIGNIFICANT design 
changes.
 # Do as fine grained parallelism as possible.  Don't think - one cpu-thread == 
one gpu-thread.  think as parallel as possible.
 # The best metaphor I found (of working with GPUs) - think of it as an 
embedded board attached to your machine and you move data to and fro from the 
board, debug on the board.  Dump all parallel processing on the board and 
sequential on the CPU.
 # Read the nVidia Manuals (they are your best bet).  I figured it is better to 
stay with CUDA (as against OpenCL) given the wealth of CUDA info and support 
out 

[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2018-11-28 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702135#comment-16702135
 ] 

Rinka Singh edited comment on LUCENE-7745 at 11/28/18 4:57 PM:
---

Edited.  Sorry...

A few questions.
* How critical is the inverted index to the user experience?
* What happens if the inverted index is speeded up?
* How many AWS instances would usually be used for searching through ~140GB 
sized inverted index and are there any performance numbers around this? (I'd 
like to compare to a server with 8 GPUs costing about $135-140K) - not sure 
what the equivalent GPU instances on Google Cloud/AWS would cost... 

Assumptions (please validate):
 * Documents are being added to the inverted index however the Index itself 
doesn't grow rapidly
 * the Maximum Index size will be less than 140GB - I assume 8 GPUs


was (Author: rinka):
A few questions.  How critical is the inverted index to the user experience?  
What happens if the inverted index is speeded up?

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2018-11-28 Thread Rinka Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701990#comment-16701990
 ] 

Rinka Singh edited comment on LUCENE-7745 at 11/28/18 3:08 PM:
---

[~jpountz]
{quote}(Unrelated to your comment Rinka, but seeing activity on this issue 
reminded me that I wanted to share something) There are limited use-cases for 
GPU accelelation in Lucene due to the fact that query processing is full of 
branches, especially since we added support for impacts and WAND.{quote}

While Yes branches do impact the performance, well designed (GPU) code will 
consist of a combo of both CPU (the decision making part) and GPU code.  For 
example, I wrote a histogram as a test case that saw SIGNIFICANT acceleration 
and I also identified further code areas that can be improved.  I'm fairly sure 
(gut feel), I can squeeze out a 40-50x kind of improvement at the very least on 
a mid-sized GPU (given the time etc.,). I think things will be much, much 
better on a high end GPU and with further scale-up on a multi-gpu system...

My point is - thinking (GPU) parallel is a completely different ball-game and 
requires a mind-shift.  Once that happens, the value add will be massive and 
gut tells me Lucene is a huge opportunity.

Incidentally, this is why I want to develop a library that I can put out there 
for integration.

{quote}That said Mike initially mentioned that BooleanScorer might be one 
scorer that could benefit from GPU acceleration as it scores large blocks of 
documents at once. I just attached a specialization of a disjunction over term 
queries that should make it easy to experiment with Cuda, see the TODO in the 
end on top of the computeScores method.
{quote}

Lucene is really new to me (and so is working with Apache - sorry, I am a 
newbie to Apache) :). Please will you put links here...  


was (Author: rinka):
[~jpountz]
{quote}(Unrelated to your comment Rinka, but seeing activity on this issue 
reminded me that I wanted to share something) There are limited use-cases for 
GPU accelelation in Lucene due to the fact that query processing is full of 
branches, especially since we added support for impacts and WAND.{quote}

While Yes branches do impact the performance, well designed (GPU) code will 
consist of a combo of both CPU (the decision making part) and GPU code.  For 
example, I wrote a histogram as a test case that saw SIGNIFICANT acceleration 
and I also identified further code areas that can be improved.  I'm fairly sure 
(gut feel), I can squeeze out a 40-50x kind of improvement at the very least on 
a mid-sized GPU (given the time etc.,). I think things will be much, much 
better on a high end GPU and with further scale-up on a multi-gpu system...

Incidentally, this is why I want to develop a library that I can put out there 
for integration.

{quote}That said Mike initially mentioned that BooleanScorer might be one 
scorer that could benefit from GPU acceleration as it scores large blocks of 
documents at once. I just attached a specialization of a disjunction over term 
queries that should make it easy to experiment with Cuda, see the TODO in the 
end on top of the computeScores method.
{quote}

Lucene is really new to me (and so is working with Apache - sorry, I am a 
newbie to Apache) :). Please will you put links here...  

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: TermDisjunctionQuery.java, gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2018-06-27 Thread Ishan Chattopadhyaya (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524890#comment-16524890
 ] 

Ishan Chattopadhyaya edited comment on LUCENE-7745 at 6/27/18 11:07 AM:


Here [0] are some very initial experiments that I ran, along with Kishore 
Angani, a colleague at Unbxd.

1. Generic problem: Given a result set (of document hits) and a scoring 
function, return a sorted list of documents along with the computed scores 
(which may leverage one or more indexed fields).
2. Specific problem: Given (up to 11M) points and associated docids, compute 
the distance from a given query point. Return the sorted list of documents 
based on these distances.
3. GPU implementation based on Thrust library (C++ based Apache 2.0 licensed 
library), called from JNI wrapper. Timings include copying data (scores and 
sorted docids) back from GPU to host system and access from Java (via 
DirectByteBuffer).
4. CPU implementation was based on SpatialExample [1], which is perhaps not the 
fastest (points fields are better, I think).
5. Hardware: CPU is i7 5820k 4.3GHz (OC), 32GB RAM @ 2133MHz. GPU is Nvidia GTX 
1080, 11GB GDDR5 memory.

Results seem promising. The GPU is able to score 11M documents in ~50ms!. Here, 
blue is GPU and red is CPU (Lucene). 

!gpu-benchmarks.png|width=450!


[0] - https://github.com/chatman/gpu-benchmarks
[1] - 
https://github.com/apache/lucene-solr/blob/master/lucene/spatial-extras/src/test/org/apache/lucene/spatial/SpatialExample.java


was (Author: ichattopadhyaya):
Here [0] are some very initial experiments that I ran, along with Kishore 
Angani, a colleague at Unbxd.

1. Generic problem: Given a result set (of document hits) and a scoring 
function, return a sorted list of documents along with the computed scores.
2. Specific problem: Given (up to 11M) points and associated docids, compute 
the distance from a given query point. Return the sorted list of documents 
based on these distances.
3. GPU implementation based on Thrust library (C++ based Apache 2.0 licensed 
library), called from JNI wrapper. Timings include copying data (scores and 
sorted docids) back from GPU to host system and access from Java (via 
DirectByteBuffer).
4. CPU implementation was based on SpatialExample [1], which is perhaps not the 
fastest (points fields are better, I think).
5. Hardware: CPU is i7 5820k 4.3GHz (OC), 32GB RAM @ 2133MHz. GPU is Nvidia GTX 
1080, 11GB GDDR5 memory.

Results seem promising. The GPU is able to score 11M documents in ~50ms!. Here, 
blue is GPU and red is CPU (Lucene). 

!gpu-benchmarks.png|width=450!


[0] - https://github.com/chatman/gpu-benchmarks
[1] - 
https://github.com/apache/lucene-solr/blob/master/lucene/spatial-extras/src/test/org/apache/lucene/spatial/SpatialExample.java

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2018-06-27 Thread Ishan Chattopadhyaya (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524890#comment-16524890
 ] 

Ishan Chattopadhyaya edited comment on LUCENE-7745 at 6/27/18 10:51 AM:


Here [0] are some very initial experiments that I ran, along with Kishore 
Angani, a colleague at Unbxd.

1. Generic problem: Given a result set (of document hits) and a scoring 
function, return a sorted list of documents along with the computed scores.
2. Specific problem: Given (up to 11M) points and associated docids, compute 
the distance from a given query point. Return the sorted list of documents 
based on these distances.
3. GPU implementation based on Thrust library (C++ based Apache 2.0 licensed 
library), called from JNI wrapper. Timings include copying data (scores and 
sorted docids) back from GPU to host system and access from Java (via 
DirectByteBuffer).
4. CPU implementation was based on SpatialExample [1], which is perhaps not the 
fastest (points fields are better, I think).
5. Hardware: CPU is i7 5820k 4.3GHz (OC), 32GB RAM @ 2133MHz. GPU is Nvidia GTX 
1080, 11GB GDDR5 memory.

Results seem promising. The GPU is able to score 11M documents in ~50ms!. Here, 
blue is GPU and red is CPU (Lucene). 

!gpu-benchmarks.png|width=450!


[0] - https://github.com/chatman/gpu-benchmarks
[1] - 
https://github.com/apache/lucene-solr/blob/master/lucene/spatial-extras/src/test/org/apache/lucene/spatial/SpatialExample.java


was (Author: ichattopadhyaya):
Here [0] are some very initial experiments that I ran, along with Kishore 
Angani, a colleague at Unbxd.

1. Generic problem: Given a result set (of document hits) and a scoring 
function, return a sorted list of documents along with the computed scores.
2. Specific problem: Given (up to 11M) points and associated docids, compute 
the distance from a given query point. Return the sorted list of documents 
based on these distances.
3. GPU implementation based on Thrust library (C++ based Apache 2.0 licensed 
library), called from JNI wrapper. Timings include copying data (scores and 
sorted docids) back from GPU to host system and access from Java (via 
DirectByteBuffer).
4. CPU implementation was based on SpatialExample [1], which is perhaps not the 
fastest (points fields are better, I think).
5. Hardware: CPU is i7 5820k 4.3GHz (OC), 32GB RAM @ 2133MHz. GPU is Nvidia GTX 
1080, 11GB GDDR5 memory.

Results seem promising. The GPU is able to score 11M documents in ~50ms!. Here, 
blue is GPU and red is CPU (Lucene). 

!gpu-benchmarks.png|width=800!


[0] - https://github.com/chatman/gpu-benchmarks
[1] - 
https://github.com/apache/lucene-solr/blob/master/lucene/spatial-extras/src/test/org/apache/lucene/spatial/SpatialExample.java

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2018-06-27 Thread Ishan Chattopadhyaya (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524890#comment-16524890
 ] 

Ishan Chattopadhyaya edited comment on LUCENE-7745 at 6/27/18 10:51 AM:


Here [0] are some very initial experiments that I ran, along with Kishore 
Angani, a colleague at Unbxd.

1. Generic problem: Given a result set (of document hits) and a scoring 
function, return a sorted list of documents along with the computed scores.
2. Specific problem: Given (up to 11M) points and associated docids, compute 
the distance from a given query point. Return the sorted list of documents 
based on these distances.
3. GPU implementation based on Thrust library (C++ based Apache 2.0 licensed 
library), called from JNI wrapper. Timings include copying data (scores and 
sorted docids) back from GPU to host system and access from Java (via 
DirectByteBuffer).
4. CPU implementation was based on SpatialExample [1], which is perhaps not the 
fastest (points fields are better, I think).
5. Hardware: CPU is i7 5820k 4.3GHz (OC), 32GB RAM @ 2133MHz. GPU is Nvidia GTX 
1080, 11GB GDDR5 memory.

Results seem promising. The GPU is able to score 11M documents in ~50ms!. Here, 
blue is GPU and red is CPU (Lucene). 

!gpu-benchmarks.png|width=800!


[0] - https://github.com/chatman/gpu-benchmarks
[1] - 
https://github.com/apache/lucene-solr/blob/master/lucene/spatial-extras/src/test/org/apache/lucene/spatial/SpatialExample.java


was (Author: ichattopadhyaya):
Here [0] are some very initial experiments that I ran, along with Kishore 
Angani, a colleague at Unbxd.

1. Generic problem: Given a result set (of document hits) and a scoring 
function, return a sorted list of documents along with the computed scores.
2. Specific problem: Given (up to 11M) points and associated docids, compute 
the distance from a given query point. Return the sorted list of documents 
based on these distances.
3. GPU implementation based on Thrust library (C++ based Apache 2.0 licensed 
library), called from JNI wrapper. Timings include copying data (scores and 
sorted docids) back from GPU to host system and access from Java (via 
DirectByteBuffer).
4. CPU implementation was based on SpatialExample [1], which is perhaps not the 
fastest (points fields are better, I think).
5. Hardware: CPU is i7 5820k 4.3GHz (OC), 32GB RAM @ 2133MHz. GPU is Nvidia GTX 
1080, 11GB GDDR5 memory.

Results seem promising. The GPU is able to score 11M documents in ~50ms!. Here, 
blue is GPU and red is CPU (Lucene). 

!gpu-benchmarks.png!width=800!


[0] - https://github.com/chatman/gpu-benchmarks
[1] - 
https://github.com/apache/lucene-solr/blob/master/lucene/spatial-extras/src/test/org/apache/lucene/spatial/SpatialExample.java

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2018-06-27 Thread Ishan Chattopadhyaya (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524890#comment-16524890
 ] 

Ishan Chattopadhyaya edited comment on LUCENE-7745 at 6/27/18 10:50 AM:


Here [0] are some very initial experiments that I ran, along with Kishore 
Angani, a colleague at Unbxd.

1. Generic problem: Given a result set (of document hits) and a scoring 
function, return a sorted list of documents along with the computed scores.
2. Specific problem: Given (up to 11M) points and associated docids, compute 
the distance from a given query point. Return the sorted list of documents 
based on these distances.
3. GPU implementation based on Thrust library (C++ based Apache 2.0 licensed 
library), called from JNI wrapper. Timings include copying data (scores and 
sorted docids) back from GPU to host system and access from Java (via 
DirectByteBuffer).
4. CPU implementation was based on SpatialExample [1], which is perhaps not the 
fastest (points fields are better, I think).
5. Hardware: CPU is i7 5820k 4.3GHz (OC), 32GB RAM @ 2133MHz. GPU is Nvidia GTX 
1080, 11GB GDDR5 memory.

Results seem promising. The GPU is able to score 11M documents in ~50ms!. Here, 
blue is GPU and red is CPU (Lucene). 

!gpu-benchmarks.png!width=800!


[0] - https://github.com/chatman/gpu-benchmarks
[1] - 
https://github.com/apache/lucene-solr/blob/master/lucene/spatial-extras/src/test/org/apache/lucene/spatial/SpatialExample.java


was (Author: ichattopadhyaya):
Here [0] are some very initial experiments that I ran, along with Kishore 
Angani, a colleague at Unbxd.

1. Generic problem: Given a result set (of document hits) and a scoring 
function, return a sorted list of documents along with the computed scores.
2. Specific problem: Given (up to 11M) points and associated docids, compute 
the distance from a given query point. Return the sorted list of documents 
based on these distances.
3. GPU implementation based on Thrust library (C++ based Apache 2.0 licensed 
library), called from JNI wrapper. Timings include copying data (scores and 
sorted docids) back from GPU to host system and access from Java (via 
DirectByteBuffer).
4. CPU implementation was based on SpatialExample [1], which is perhaps not the 
fastest (points fields are better, I think).
5. Hardware: CPU is i7 5820k 4.3GHz (OC), 32GB RAM @ 2133MHz. GPU is Nvidia GTX 
1080, 11GB GDDR5 memory.

Results seem promising. The GPU is able to score 11M documents in ~50ms!. Here, 
blue is GPU and red is CPU (Lucene). 

!Screenshot from 2018-06-27 15-33-37.png!width=800!


[0] - https://github.com/chatman/gpu-benchmarks
[1] - 
https://github.com/apache/lucene-solr/blob/master/lucene/spatial-extras/src/test/org/apache/lucene/spatial/SpatialExample.java

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: gpu-benchmarks.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2018-06-27 Thread Ishan Chattopadhyaya (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524890#comment-16524890
 ] 

Ishan Chattopadhyaya edited comment on LUCENE-7745 at 6/27/18 10:40 AM:


Here [0] are some very initial experiments that I ran, along with Kishore 
Angani, a colleague at Unbxd.

1. Generic problem: Given a result set (of document hits) and a scoring 
function, return a sorted list of documents along with the computed scores.
2. Specific problem: Given (up to 11M) points and associated docids, compute 
the distance from a given query point. Return the sorted list of documents 
based on these distances.
3. GPU implementation based on Thrust library (C++ based Apache 2.0 licensed 
library), called from JNI wrapper. Timings include copying data (scores and 
sorted docids) back from GPU to host system and access from Java (via 
DirectByteBuffer).
4. CPU implementation was based on SpatialExample [1], which is perhaps not the 
fastest (points fields are better, I think).
5. Hardware: CPU is i7 5820k 4.3GHz (OC), 32GB RAM @ 2133MHz. GPU is Nvidia GTX 
1080, 11GB GDDR5 memory.

Results seem promising. The GPU is able to score 11M documents in ~50ms!. Here, 
blue is GPU and red is CPU (Lucene). 

!Screenshot from 2018-06-27 15-33-37.png!width=800!


[0] - https://github.com/chatman/gpu-benchmarks
[1] - 
https://github.com/apache/lucene-solr/blob/master/lucene/spatial-extras/src/test/org/apache/lucene/spatial/SpatialExample.java


was (Author: ichattopadhyaya):
Here [0] are some very initial experiments that I ran, along with Kishore 
Angani, a colleague at Unbxd.

1. Generic problem: Given a result set (of document hits) and a scoring 
function, return a sorted list of documents along with the computed scores.
2. Specific problem: Given (up to 11M) points and associated docids, compute 
the distance from a given query point. Return the sorted list of documents 
based on these distances.
3. GPU implementation based on Thrust library (C++ based Apache 2.0 licensed 
library), called from JNI wrapper. Timings include copying data (scores and 
sorted docids) back from GPU to host system and access from Java (via 
DirectByteBuffer).
4. CPU implementation was based on SpatialExample [1], which is perhaps not the 
fastest (points fields are better, I think).
5. Hardware: CPU is i7 5820k 4.3GHz (OC), 32GB RAM @ 2133MHz. GPU is Nvidia GTX 
1080, 11GB GDDR5 memory.

Results seem promising. The GPU is able to score 11M documents in ~50ms!. Here, 
blue is GPU and red is CPU (Lucene). 

!Screenshot from 2018-06-27 15-33-37.png!


[0] - https://github.com/chatman/gpu-benchmarks
[1] - 
https://github.com/apache/lucene-solr/blob/master/lucene/spatial-extras/src/test/org/apache/lucene/spatial/SpatialExample.java

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>Priority: Major
>  Labels: gsoc2017, mentor
> Attachments: Screenshot from 2018-06-27 15-33-37.png
>
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2017-04-03 Thread vikash (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953766#comment-15953766
 ] 

vikash edited comment on LUCENE-7745 at 4/3/17 4:46 PM:


oops i could not do that, i submitted my proposal, and if you check it now the 
latest edited format is the submitted version I made some changes to it again 
before submitting it and sadly i could not change the github link, it only 
points to my home directory in github,  but can I start working still now and I 
shall give you the link that has my working and if it would be possible for you 
, you could show to the Apache Software Foundation my works, will that be ok? 
And since as I have said in my proposal that I will work from april itself so I 
will do some working and so will the repository I build for lucene and work I 
store there be checked by ASF by visiting my profile and navigating to the 
lucene repository i create there? can that help me increase my chances?
And by whom will my proposal be checked?


was (Author: qwerty123):
oops i could not do that, i submitted my proposal by the way and if you check 
it now the latest edited format is the submitted version and sadly i could not 
change the github link, it only points to my home directory in github,  but can 
I start working still and I shall give you the link that has my working and if 
it would be possible for you , you could show to the Apache Software Foundation 
my works, will that be ok? 
And since as I have said in my proposal that I will work from april itself so I 
will do some working and so will the repository I build for lucene and work I 
store there be checked by ASF can that help me increase my chances?
And by whom will my proposal be checked?

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>  Labels: gsoc2017, mentor
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2017-04-03 Thread vikash (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953766#comment-15953766
 ] 

vikash edited comment on LUCENE-7745 at 4/3/17 4:44 PM:


oops i could not do that, i submitted my proposal by the way and if you check 
it now the latest edited format is the submitted version and sadly i could not 
change the github link, it only points to my home directory in github,  but can 
I start working still and I shall give you the link that has my working and if 
it would be possible for you , you could show to the Apache Software Foundation 
my works, will that be ok? 
And since as I have said in my proposal that I will work from april itself so I 
will do some working and so will the repository I build for lucene and work I 
store there be checked by ASF can that help me increase my chances?
And by whom will my proposal be checked?


was (Author: qwerty123):
oops i could not do that, i submitted my proposal by the way and if you check 
it now the latest edited format is the submitted version and sadly i could not 
change the github link, it only points to my home directory in github,  but can 
I start working still and I shall give you the link that has my working and if 
it would be possible for you , you could show to the Apache Software Foundation 
my works, will that be ok? 
And since as I have said in my proposal that I will work from april itself so I 
will do some working and so will the repository I build for lucene and work I 
store there be checked by ASF?
And by whom will my proposal be checked?

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>  Labels: gsoc2017, mentor
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2017-04-03 Thread vikash (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953766#comment-15953766
 ] 

vikash edited comment on LUCENE-7745 at 4/3/17 4:44 PM:


oops i could not do that, i submitted my proposal by the way and if you check 
it now the latest edited format is the submitted version and sadly i could not 
change the github link, it only points to my home directory in github,  but can 
I start working still and I shall give you the link that has my working and if 
it would be possible for you , you could show to the Apache Software Foundation 
my works, will that be ok? 
And since as I have said in my proposal that I will work from april itself so I 
will do some working and so will the repository I build for lucene and work I 
store there be checked by ASF?
And by whom will my proposal be checked?


was (Author: qwerty123):
oops i could not do that, i submitted my proposal by the way and if you check 
it now the latest edited format is the submitted version and sadly i could not 
change the github link, it only points to my home directory in github,  but can 
I start working still and I shall give you the link that has my working and if 
it would be possible for you , you could show to the Apache Software Foundation 
my works, will that be ok? 

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>  Labels: gsoc2017, mentor
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2017-03-28 Thread Ishan Chattopadhyaya (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15945632#comment-15945632
 ] 

Ishan Chattopadhyaya edited comment on LUCENE-7745 at 3/28/17 5:58 PM:
---

Hi Vikash,

Regarding licensing issue:
The work done in this project would be exploratory. That code won't necessarily 
go into Lucene. When we are at a point where we see clear benefits from the 
work done here, we would then have to explore all aspects of productionizing it 
(including licensing).

Regarding next steps:
{quote}
BooleanScorer calls a lot of classes, e.g. the BM25 similarity or TF-IDF to do 
the calculation that could possibly be parallelized.
{quote}
# First, understand how BooleanScorer calls these similarity classes and does 
the scoring. There are unit tests in Lucene that can help you get there. This 
might help: https://wiki.apache.org/lucene-java/HowToContribute
# Write a standalone CUDA/OpenCL project that does the same processing on the 
GPU.
# Benchmark the speed of doing so on GPU vs. speed observed when doing the same 
through the BooleanScorer. Preferably, on a large resultset. Include time for 
copying results and scores in and out of the device memory from/to the main 
memory.
# Optimize step 2, if possible.

Once this is achieved (which in itself could be a sufficient GSoC project), one 
can have stretch goals to try out other parts of Lucene to optimize (e.g. 
spatial search).

Another stretch goal, if the results for optimizations are positive, could be 
to integrate the solution into Lucene. Most suitable way to do so would be to 
create hooks into Lucene so that plugins can be built to delegate parts of the 
processing to external code. And then, write a plugin (that uses jCuda, for 
example) and do an integration test.


was (Author: ichattopadhyaya):
Hi Vikash,

Regarding licensing issue:
The work done in this project would be exploratory. That code won't necessarily 
go into Lucene. When we are at a point where we see clear benefits from the 
work done here, we would then have to explore all aspects of productionizing it 
(including licensing).

Regarding next steps:
{quote}
BooleanScorer calls a lot of classes, e.g. the BM25 similarity or TF-IDF to do 
the calculation that could possibly be parallelized.
{quote}
# First, understand how BooleanScorer calls these similarity classes and does 
the scoring. There are unit tests in Lucene that can help you get there. This 
might help: https://wiki.apache.org/lucene-java/HowToContribute
# Write a standalone CUDA/OpenCL project that does the same processing on the 
GPU.
# Benchmark the speed of doing so on GPU vs. speed observed when doing the same 
through the BooleanScorer. Preferably, on a large resultset. Include time for 
copying results and scores in and out of the device memory from/to the main 
memory.
# Optimize step 2, if possible.

Once this is achieved (which in itself could be a sufficient GSoC project), one 
can have stretch goals to try out other parts of Lucene to optimize (e.g. 
spatial search).

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>  Labels: gsoc2017, mentor
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2017-03-28 Thread vikash (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944884#comment-15944884
 ] 

vikash edited comment on LUCENE-7745 at 3/28/17 9:48 AM:
-

Hi all, I have been reading about GPU acceleration and in particular to be 
precise about GPU accelerated computing I find this project very interesting 
and so can anyone give me further lead what is to be done now? I mean the ideas 
that Ishaan suggested are pretty good but I am still not able to understand 
that what Mr David means by  (a) could whatever comes of this actually be 
contributed to Lucene itself, whydo you think that you doubt that the outcome 
of this project not be contributed to Lucene.


was (Author: qwerty123):
Hi all, I have been reading about GPU acceleration and in particular to be 
precise about GPU accelerated computing I find this project very interesting 
and so can anyone give me further lead what is to be done now? I mean the ideas 
that Ishaan suggested are pretty good but I am still not able to understand 
that what Mr David means by  (a) could whatever comes of this actually be 
contributed to Lucene itself, why can the outcome of this project not be 
contributed to Lucene?

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>  Labels: gsoc2017, mentor
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7745) Explore GPU acceleration

2017-03-20 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15932714#comment-15932714
 ] 

Uwe Schindler edited comment on LUCENE-7745 at 3/20/17 2:24 PM:


Hi,
in General, including CUDA into Lucene may be a good idea, but I see no real 
possibility to do this inside Lucene Core or any other module. My idea would be 
to add some abstraction to the relevant parts of Lucene and make it easier to 
"plug in" different implementations. Then this code could also be hosted 
outside Lucene (if Licenses is a problem) anywhere on Github.

We still should have the following in our head: Mike's example looks "simple" 
as a quick test if we see gains, but making the whole thing ready for commit or 
bundling in any project in/outside Lucene is a whole different story. Currently 
BooleanScorer calls a lot of classes, e.g. the BM25 similarity or TF-IDF to do 
the calculation that could possibly be parallelized. But for moving all this to 
CUDA, you have to add "plugin points" all there and change APIs completely. It 
is also hard to test, because none of our Jenkins servers has a GPU! Also for 
0/8/15 users of Lucene, this could be a huge problem, if we add native stuff 
into Lucene that they may never use. Because of that it MUST BE SEPARATED from 
Lucene core. Completely...

IMHO, I'd create a full new search engine like CLucene in C code if I would 
solely focus on GPU parallelization. The current iterator based approaches are 
not easy to transform or plug into CUDA...

For the GSoc project, we should make sure to the GSoc student that this is just 
a project to "explore" GPU acceleration: if it brings any performance - I doubt 
that, because the call overhead between Java and CUDA is way too high - in 
contrast to Postgres where all in plain C/C++. The results would then be used 
to plan and investigate ways how to include that into Lucene as "plugin points" 
(e.g., as SPI modules).


was (Author: thetaphi):
Hi,
in General, including CUDA into Lucene may be a good idea, but I see no real 
possibility to do this inside Lucene Core or any other module. My idea would be 
to add some abstraction to the relevant parts of Lucene and make it easier to 
"plug in" different implementations. Then this code could also be hosted 
outside Lucene (if Licenses is a problem) anywhere on Github.

We still should have the following in our head: Mike's example looks "simple" 
as a quick test if we see gains, but making the whole thing ready for commit or 
bundling in any project in/outside Lucene is a whole different story. Currently 
BooleanScorer calls a lot of classes, e.g. the BM25 similarity or TF-IDF to do 
the calculation that could possibly be parallelized. But for moving all this to 
CUDA, you have to add "plugin points" all there and change APIs completely. It 
is also hard to test, because none of our Jenkins servers has a GPU! Also for 
uses of Lucene, this could be a huge problem, if we add native stuff into 
Lucene that they may never use. Because of that it MUST BE SEPARATED from 
Lucene core. Completely...

IMHO, I'd create a full new search engine like CLucene in C code if I would 
solely focus on GPU parallelization. The current iterator based approaches are 
not easy to transform or plug into CUDA...

For the GSoc project, we should make sure to the GSoc student that this is just 
a project to "explore" GPU acceleration: if it brings any performance - I doubt 
that, because the call overhead between Java and CUDA is way too high - in 
contrast to Postgres where all in plain C/C++. The results would then be used 
to plan and investigate ways how to include that into Lucene as "plugin points" 
(e.g., as SPI modules).

> Explore GPU acceleration
> 
>
> Key: LUCENE-7745
> URL: https://issues.apache.org/jira/browse/LUCENE-7745
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ishan Chattopadhyaya
>  Labels: gsoc2017, mentor
>
> There are parts of Lucene that can potentially be speeded up if computations 
> were to be offloaded from CPU to the GPU(s). With commodity GPUs having as 
> high as 12GB of high bandwidth RAM, we might be able to leverage GPUs to 
> speed parts of Lucene (indexing, search).
> First that comes to mind is spatial filtering, which is traditionally known 
> to be a good candidate for GPU based speedup (esp. when complex polygons are 
> involved). In the past, Mike McCandless has mentioned that "both initial 
> indexing and merging are CPU/IO intensive, but they are very amenable to 
> soaking up the hardware's concurrency."
> I'm opening this issue as an exploratory task, suitable for a GSoC project. I 
> volunteer to mentor any GSoC student willing to work on this this summer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)