Re: Lucene, Spark, HDFS question

2018-03-14 Thread Debasish Das
I have written spark lucene integration as part of Verizon trapezium/dal
project...you can extract the data stored in hdfs indices and feed it to
spark...

https://github.com/Verizon/trapezium/tree/master/dal/src/test/scala/com/verizon/bda/trapezium/dal

I intend to publish it as spark package as soon as I get time.

You can use spark-solr or spark-elastic but I did not want to bring solr
elastic dependency to be performant...

Thanks.
Deb

On Mar 13, 2018 4:31 PM, "Tom Hirschfeld"  wrote:

Hello!


*Background*: My team is running a machine learning pipeline, and part of
the pipeline is an http scrape of a web based Lucene application via
http calls. The scrape outputs a CSV file that we then upload to HDFS and
use it as input to run a spark ML job.

*Question: *Is there a way for our spark application to read from a lucene
index stored in HDFS?  Specifically, I see here

that
solr-core has an hdfs directory type that seems to be compatible with our
lucene indexreader. Is this compatible? Are we able to store our index in
HDFS and read from a spark job?


Best,
Tom Hirschfeld


Lucene, Spark, HDFS question

2018-03-13 Thread Tom Hirschfeld
Hello!


*Background*: My team is running a machine learning pipeline, and part of
the pipeline is an http scrape of a web based Lucene application via
http calls. The scrape outputs a CSV file that we then upload to HDFS and
use it as input to run a spark ML job.

*Question: *Is there a way for our spark application to read from a lucene
index stored in HDFS?  Specifically, I see here

that
solr-core has an hdfs directory type that seems to be compatible with our
lucene indexreader. Is this compatible? Are we able to store our index in
HDFS and read from a spark job?


Best,
Tom Hirschfeld