Re: Running Lucene/SOR on Hadoop

2016-01-09 Thread Steve Davids
You might consider trying to get the de-duplication done at index time: https://cwiki.apache.org/confluence/display/solr/De-Duplication that way the map reduce job wouldn't even be necessary. When it comes to the map reduce job, you would need to be more specific with *what* you are doing for

Re: Running Lucene/SOR on Hadoop

2016-01-09 Thread Dino Chopins
Hi Tim, Thank you for the great pointer. Will join the group. Thanks, Dino On Tue, Jan 5, 2016 at 2:10 AM, Tim Williams wrote: > Apache Blur (Incubating) has several approaches (hive, spark, m/r) > that could probably help with this ranging from very experimental to >

Re: Running Lucene/SOR on Hadoop

2016-01-09 Thread Dino Chopins
Hi Steve, I cannot remove deduplication at index time, but rather to find duplicates of the document then inform the duplicate data back to user. Yes, I need to query each document of all 40 million rows. It will be about 10 mapper tasks max. Will try the SolrJ for this purpose. Thanks Steve.

Re: Running Lucene/SOR on Hadoop

2016-01-04 Thread Tim Williams
Apache Blur (Incubating) has several approaches (hive, spark, m/r) that could probably help with this ranging from very experimental to stable. If you're interested, you can ask over on blur-u...@incubator.apache.org ... Thanks, --tim On Fri, Dec 25, 2015 at 4:28 AM, Dino Chopins

Re: Running Lucene/SOR on Hadoop

2015-12-24 Thread Dino Chopins
Hi Erick, Thank you for your response and pointer. What I mean by running Lucene/SOLR on Hadoop is to have Lucene/SOLR index available to be queried using mapreduce or any best practice recommended. I need to have this mechanism to do large scale row deduplication. Let me elaborate why I need

Running Lucene/SOR on Hadoop

2015-12-13 Thread Dino Chopins
Hi, I've tried to figure out how can we run Lucene/SOLR on Hadoop, and found several sources. The last pointer is Apache Blur project and it is an incubating project. Is there any straightforward implementation of Lucene/SOLR on Hadoop? Or best practice of how to incorporate Lucene/SOLR on

Re: Running Lucene/SOR on Hadoop

2015-12-13 Thread Erick Erickson
First, what do you mean "run Lucene/Solr on Hadoop"? You can use the HdfsDirectoryFactory to store Solr/Lucene indexes on Hadoop, at that point the actual filesystem that holds the index is transparent to the end user, you just use Solr as you would if it was using indexes on the local file