RE: Solr with Hadoop

2013-07-18 Thread Saikat Kanjilal
I'm familiar with and have used both the DSE cluster as well as am in the 
process of evaluating cloudera search, in general cloudera search has tight 
integration with hdfs and takes care of replication and sharding transparently 
by using the pre-existing hdfs replication and sharding, however cloudera 
search actually uses solrcloud underneath and you would need to install 
zookeeper to enable coordination between each of the solr nodes.   DataStax 
allows you to talk to Solr, however their model scales around the data model 
and architecture of cassandra, release 3.1 allows for some additional solr 
admin functionality and removes the need to write cassandra specific code.

If you go the open source route you have a few options:

1) You can build a custom plugin inside solr that would internally query hdfs 
and return data, you would need to figure out how to scale this potentially 
using a solution very similar to cloudera search (i.e. leverage solrcloud), and 
if using solrcloud you would need ot install zookeeper for node coordination

2) You could write create a flume channel that accumulates specific events from 
hdfs and create a sink to write data directly to solr

3) I would look at cloudera search if you need tight integration into hadoop, 
it might save you some time and efforts

I dont think you want to have solr trigger map-reduce jobs if you're looking at 
having very fast throughput through your search service.


Hope this helps, ping me offline if you have more questions.
Regards

> From: mlie...@impetus.com
> To: solr-user@lucene.apache.org
> Subject: Re: Solr with Hadoop
> Date: Thu, 18 Jul 2013 15:41:36 +
> 
> Rajesh,
> 
> If you require to have an integration between Solr and Hadoop or NoSQL, I
> would recommend using a commercial distribution. I think most are free to
> use as long as you don't require support.
> I inquired about the Cloudera Search capability, but it seems like that
> far it is just preliminary: there is no tight integration yet between
> Hbase and Solr, for example, other than full text search on the HDFS data
> (I believe enabled in Hue). I am not too familiar with what MapR's M7 has
> to offer.
> However Datastax does a good job of tightly integrating Solr with
> Cassandra, and lets you query over the data ingested from Solr in Hive for
> example, which is pretty nice. Solr would not trigger Hadoop jobs, though.
> 
> Cheers,
> Matt
> 
> 
> On 7/17/13 7:37 PM, "Rajesh Jain"  wrote:
> 
> >I
> > have a newbie question on integrating Solr with Hadoop.
> >
> >There are some vendors like Cloudera/MapR who have announced Solr Search
> >for Hadoop.
> >
> >If I use the Apache distro, how can I use Solr Search on docs in
> >HDFS/Hadoop
> >
> >Is there a tutorial on how to use it or getting started.
> >
> >I am using Flume to sink CSV docs into Hadoop/HDFS and I would like to use
> >Solr to provide Search.
> >
> >Does Solr Search trigger MapReduce Jobs (like Splunk-Hunk) does?
> >
> >Thanks,
> >Rajesh
> >
> 
> 
> 
> 
> 
> 
> 
> 
> 
> NOTE: This message may contain information that is confidential, proprietary, 
> privileged or otherwise protected by law. The message is intended solely for 
> the named addressee. If received in error, please destroy and notify the 
> sender. Any use of this email is prohibited when received in error. Impetus 
> does not represent, warrant and/or guarantee, that the integrity of this 
> communication has been maintained nor that the communication is free of 
> errors, virus, interception or interference.
  

Re: Solr with Hadoop

2013-07-18 Thread Matt Lieber
Rajesh,

If you require to have an integration between Solr and Hadoop or NoSQL, I
would recommend using a commercial distribution. I think most are free to
use as long as you don't require support.
I inquired about the Cloudera Search capability, but it seems like that
far it is just preliminary: there is no tight integration yet between
Hbase and Solr, for example, other than full text search on the HDFS data
(I believe enabled in Hue). I am not too familiar with what MapR's M7 has
to offer.
However Datastax does a good job of tightly integrating Solr with
Cassandra, and lets you query over the data ingested from Solr in Hive for
example, which is pretty nice. Solr would not trigger Hadoop jobs, though.

Cheers,
Matt


On 7/17/13 7:37 PM, "Rajesh Jain"  wrote:

>I
> have a newbie question on integrating Solr with Hadoop.
>
>There are some vendors like Cloudera/MapR who have announced Solr Search
>for Hadoop.
>
>If I use the Apache distro, how can I use Solr Search on docs in
>HDFS/Hadoop
>
>Is there a tutorial on how to use it or getting started.
>
>I am using Flume to sink CSV docs into Hadoop/HDFS and I would like to use
>Solr to provide Search.
>
>Does Solr Search trigger MapReduce Jobs (like Splunk-Hunk) does?
>
>Thanks,
>Rajesh
>









NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: solr with hadoop

2010-07-06 Thread Jason Rutherglen
> If you do distributed indexing correctly, what about updating the documents
> and what about replicating them correctly?

Yes, you can do you and it'll work great.

On Mon, Jul 5, 2010 at 7:42 AM, MitchK  wrote:
>
> I need to revive this discussion...
>
> If you do distributed indexing correctly, what about updating the documents
> and what about replicating them correctly?
>
> Does this work? Or wasn't this an issue?
>
> Kind regards
> - Mitch
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p944413.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: solr with hadoop

2010-07-05 Thread MitchK

I need to revive this discussion...

If you do distributed indexing correctly, what about updating the documents
and what about replicating them correctly?

Does this work? Or wasn't this an issue?

Kind regards
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p944413.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr with hadoop

2010-06-23 Thread Otis Gospodnetic
I don't think it's ever been discussed - your Q below is #1 hit currently: 
http://search-lucene.com/?q=%2B%28dih+OR+dataimporthandler%29+hdfs
 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Jon Baer 
> To: solr-user@lucene.apache.org
> Sent: Tue, June 22, 2010 12:47:14 PM
> Subject: Re: solr with hadoop
> 
> I was playing around w/ Sqoop the other day, its a simple Cloudera tool for 
> imports (mysql -> hdfs) @ 
> href="http://www.cloudera.com/developers/downloads/sqoop/"; target=_blank 
> >http://www.cloudera.com/developers/downloads/sqoop/

It seems to me 
> (it would be pretty efficient) to dump to HDFS and have something like Data 
> Import Handler be able to read from hdfs:// directly ...

Has this route 
> been discussed / developed before (ie DIH w/ hdfs:// handler)?

- 
> Jon

On Jun 22, 2010, at 12:29 PM, MitchK wrote:

> 
> I 
> wanted to add a Jira-issue about exactly what Otis is asking here.
> 
> Unfortunately, I haven't time for it because of my exams.
> 
> 
> However, I'd like to add a question to Otis' ones:
> If you destribute the 
> indexing-progress this way, are you able to replicate
> the different 
> documents correctly?
> 
> Thank you.
> - Mitch
> 
> 
> Otis Gospodnetic-2 wrote:
>> 
>> Stu,
>> 
> 
>> Interesting!  Can you provide more details about your 
> setup?  By "load
>> balance the indexing stage" you mean 
> "distribute the indexing process",
>> right?  Do you simply take 
> your content to be indexed, split it into N
>> chunks where N matches 
> the number of TaskNodes in your Hadoop cluster and
>> provide a map 
> function that does the indexing?  What does the reduce
>> function 
> do?  Does that call IndexWriter.addAllIndexes or do you do that
>> 
> outside Hadoop?
>> 
>> Thanks,
>> Otis
>> 
> --
>> Sematext -- 
> >http://sematext.com/ -- Lucene - Solr - Nutch
>> 
>> 
> - Original Message 
>> From: Stu Hood <
> ymailto="mailto:stuh...@webmail.us"; 
> href="mailto:stuh...@webmail.us";>stuh...@webmail.us>
>> To: 
> ymailto="mailto:solr-user@lucene.apache.org"; 
> href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org
>> 
> Sent: Monday, January 7, 2008 7:14:20 PM
>> Subject: Re: solr with 
> hadoop
>> 
>> As Mike suggested, we use Hadoop to organize our 
> data en route to Solr.
>> Hadoop allows us to load balance the indexing 
> stage, and then we use
>> the raw Lucene IndexWriter.addAllIndexes 
> method to merge the data to be
>> hosted on Solr instances.
>> 
> 
>> Thanks,
>> Stu
>> 
>> 
>> 
> 
>> -Original Message-
>> From: Mike Klaas <
> ymailto="mailto:mike.kl...@gmail.com"; 
> href="mailto:mike.kl...@gmail.com";>mike.kl...@gmail.com>
>> 
> Sent: Friday, January 4, 2008 3:04pm
>> To: 
> ymailto="mailto:solr-user@lucene.apache.org"; 
> href="mailto:solr-user@lucene.apache.org";>solr-user@lucene.apache.org
>> 
> Subject: Re: solr with hadoop
>> 
>> On 4-Jan-08, at 11:37 AM, 
> Evgeniy Strokin wrote:
>> 
>>> I have huge index base 
> (about 110 millions documents, 100 fields  
>>> each). But size 
> of the index base is reasonable, it's about 70 Gb.  
>>> All I 
> need is increase performance, since some queries, which match  
> 
>>> big number of documents, are running slow.
>>> So I 
> was thinking is any benefits to use hadoop for this? And if  
> 
>>> so, what direction should I go? Is anybody did something 
> for  
>>> integration Solr with Hadoop? Does it give any 
> performance boost?
>>> 
>> Hadoop might be useful for 
> organizing your data enroute to Solr, but  
>> I don't see how it 
> could be used to boost performance over a huge  
>> Solr 
> index.  To accomplish that, you need to split it up over two  
> 
>> machines (for which you might find hadoop useful).
>> 
> 
>> -Mike
>> 
>> 
>> 
>> 
> 
>> 
>> 
>> 
> -- 
> View this message in 
> context: 
> href="http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914589.html";
>  
> target=_blank 
> >http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914589.html
> 
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr with hadoop

2010-06-22 Thread Jon Baer
I was playing around w/ Sqoop the other day, its a simple Cloudera tool for 
imports (mysql -> hdfs) @ http://www.cloudera.com/developers/downloads/sqoop/

It seems to me (it would be pretty efficient) to dump to HDFS and have 
something like Data Import Handler be able to read from hdfs:// directly ...

Has this route been discussed / developed before (ie DIH w/ hdfs:// handler)?

- Jon

On Jun 22, 2010, at 12:29 PM, MitchK wrote:

> 
> I wanted to add a Jira-issue about exactly what Otis is asking here.
> Unfortunately, I haven't time for it because of my exams.
> 
> However, I'd like to add a question to Otis' ones:
> If you destribute the indexing-progress this way, are you able to replicate
> the different documents correctly?
> 
> Thank you.
> - Mitch
> 
> Otis Gospodnetic-2 wrote:
>> 
>> Stu,
>> 
>> Interesting!  Can you provide more details about your setup?  By "load
>> balance the indexing stage" you mean "distribute the indexing process",
>> right?  Do you simply take your content to be indexed, split it into N
>> chunks where N matches the number of TaskNodes in your Hadoop cluster and
>> provide a map function that does the indexing?  What does the reduce
>> function do?  Does that call IndexWriter.addAllIndexes or do you do that
>> outside Hadoop?
>> 
>> Thanks,
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> 
>> - Original Message 
>> From: Stu Hood 
>> To: solr-user@lucene.apache.org
>> Sent: Monday, January 7, 2008 7:14:20 PM
>> Subject: Re: solr with hadoop
>> 
>> As Mike suggested, we use Hadoop to organize our data en route to Solr.
>> Hadoop allows us to load balance the indexing stage, and then we use
>> the raw Lucene IndexWriter.addAllIndexes method to merge the data to be
>> hosted on Solr instances.
>> 
>> Thanks,
>> Stu
>> 
>> 
>> 
>> -Original Message-
>> From: Mike Klaas 
>> Sent: Friday, January 4, 2008 3:04pm
>> To: solr-user@lucene.apache.org
>> Subject: Re: solr with hadoop
>> 
>> On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:
>> 
>>> I have huge index base (about 110 millions documents, 100 fields  
>>> each). But size of the index base is reasonable, it's about 70 Gb.  
>>> All I need is increase performance, since some queries, which match  
>>> big number of documents, are running slow.
>>> So I was thinking is any benefits to use hadoop for this? And if  
>>> so, what direction should I go? Is anybody did something for  
>>> integration Solr with Hadoop? Does it give any performance boost?
>>> 
>> Hadoop might be useful for organizing your data enroute to Solr, but  
>> I don't see how it could be used to boost performance over a huge  
>> Solr index.  To accomplish that, you need to split it up over two  
>> machines (for which you might find hadoop useful).
>> 
>> -Mike
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914589.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr with hadoop

2010-06-22 Thread MitchK

I wanted to add a Jira-issue about exactly what Otis is asking here.
Unfortunately, I haven't time for it because of my exams.

However, I'd like to add a question to Otis' ones:
If you destribute the indexing-progress this way, are you able to replicate
the different documents correctly?

Thank you.
- Mitch

Otis Gospodnetic-2 wrote:
> 
> Stu,
> 
> Interesting!  Can you provide more details about your setup?  By "load
> balance the indexing stage" you mean "distribute the indexing process",
> right?  Do you simply take your content to be indexed, split it into N
> chunks where N matches the number of TaskNodes in your Hadoop cluster and
> provide a map function that does the indexing?  What does the reduce
> function do?  Does that call IndexWriter.addAllIndexes or do you do that
> outside Hadoop?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> - Original Message 
> From: Stu Hood 
> To: solr-user@lucene.apache.org
> Sent: Monday, January 7, 2008 7:14:20 PM
> Subject: Re: solr with hadoop
> 
> As Mike suggested, we use Hadoop to organize our data en route to Solr.
>  Hadoop allows us to load balance the indexing stage, and then we use
>  the raw Lucene IndexWriter.addAllIndexes method to merge the data to be
>  hosted on Solr instances.
> 
> Thanks,
> Stu
> 
> 
> 
> -Original Message-
> From: Mike Klaas 
> Sent: Friday, January 4, 2008 3:04pm
> To: solr-user@lucene.apache.org
> Subject: Re: solr with hadoop
> 
> On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:
> 
>> I have huge index base (about 110 millions documents, 100 fields  
>> each). But size of the index base is reasonable, it's about 70 Gb.  
>> All I need is increase performance, since some queries, which match  
>> big number of documents, are running slow.
>> So I was thinking is any benefits to use hadoop for this? And if  
>> so, what direction should I go? Is anybody did something for  
>> integration Solr with Hadoop? Does it give any performance boost?
>>
> Hadoop might be useful for organizing your data enroute to Solr, but  
> I don't see how it could be used to boost performance over a huge  
> Solr index.  To accomplish that, you need to split it up over two  
> machines (for which you might find hadoop useful).
> 
> -Mike
> 
> 
> 
> 
> 
> 
> 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914589.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr with hadoop

2010-06-22 Thread Marc Sturlese

I think a good solution could be to use hadoop with SOLR-1301 to build solr
shards and then use solr distributed search against these shards (you will
have to copy to local from HDFS to search against them)
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914576.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr with hadoop

2010-06-22 Thread Neeb

Hi,

We currently have a master-slave setup for solr with two slave servers. We
are using Solrj (stream-update-solr-server) to index master slave, which
takes 6 hours to index around 15 million documents.

I would like to explore hadoop, in particularly for indexing job using
mapreduce approach. 

- I have read some comments on the JIRA tickets, but it still seems unclear
how this setup will work. 
- I am not sure as what tasks will be done at map phase and what on reduce
phase. 
- And would it merge the multiple indices together into one during reduce
phase or is this a separate task out of mapreduce?

Any directions and guidance over this setup would be highly appreciated.

Thanks in advance,
-Ali
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914483.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr with hadoop

2008-01-07 Thread Otis Gospodnetic
Stu,

Interesting!  Can you provide more details about your setup?  By "load balance 
the indexing stage" you mean "distribute the indexing process", right?  Do you 
simply take your content to be indexed, split it into N chunks where N matches 
the number of TaskNodes in your Hadoop cluster and provide a map function that 
does the indexing?  What does the reduce function do?  Does that call 
IndexWriter.addAllIndexes or do you do that outside Hadoop?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Stu Hood <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, January 7, 2008 7:14:20 PM
Subject: Re: solr with hadoop

As Mike suggested, we use Hadoop to organize our data en route to Solr.
 Hadoop allows us to load balance the indexing stage, and then we use
 the raw Lucene IndexWriter.addAllIndexes method to merge the data to be
 hosted on Solr instances.

Thanks,
Stu



-Original Message-
From: Mike Klaas <[EMAIL PROTECTED]>
Sent: Friday, January 4, 2008 3:04pm
To: solr-user@lucene.apache.org
Subject: Re: solr with hadoop

On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:

> I have huge index base (about 110 millions documents, 100 fields  
> each). But size of the index base is reasonable, it's about 70 Gb.  
> All I need is increase performance, since some queries, which match  
> big number of documents, are running slow.
> So I was thinking is any benefits to use hadoop for this? And if  
> so, what direction should I go? Is anybody did something for  
> integration Solr with Hadoop? Does it give any performance boost?
>
Hadoop might be useful for organizing your data enroute to Solr, but  
I don't see how it could be used to boost performance over a huge  
Solr index.  To accomplish that, you need to split it up over two  
machines (for which you might find hadoop useful).

-Mike







Re: solr with hadoop

2008-01-07 Thread Stu Hood
As Mike suggested, we use Hadoop to organize our data en route to Solr. Hadoop 
allows us to load balance the indexing stage, and then we use the raw Lucene 
IndexWriter.addAllIndexes method to merge the data to be hosted on Solr 
instances.

Thanks,
Stu



-Original Message-
From: Mike Klaas <[EMAIL PROTECTED]>
Sent: Friday, January 4, 2008 3:04pm
To: solr-user@lucene.apache.org
Subject: Re: solr with hadoop

On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:

> I have huge index base (about 110 millions documents, 100 fields  
> each). But size of the index base is reasonable, it's about 70 Gb.  
> All I need is increase performance, since some queries, which match  
> big number of documents, are running slow.
> So I was thinking is any benefits to use hadoop for this? And if  
> so, what direction should I go? Is anybody did something for  
> integration Solr with Hadoop? Does it give any performance boost?
>
Hadoop might be useful for organizing your data enroute to Solr, but  
I don't see how it could be used to boost performance over a huge  
Solr index.  To accomplish that, you need to split it up over two  
machines (for which you might find hadoop useful).

-Mike




Re: solr with hadoop

2008-01-05 Thread Otis Gospodnetic
Evgeniy,

Two simple options:
1) take your index, put it on N Solr search servers, and put them behind a load 
balancer
2) take your index, split it in N (or create N smaller indices from scratch) 
and put it on N Solr search servers (and see SOLR-303)

Each will help in a different way and it sounds like 2) might be more suitable 
for you.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Evgeniy Strokin <[EMAIL PROTECTED]>
To: Solr User 
Sent: Friday, January 4, 2008 2:37:41 PM
Subject: solr with hadoop

I have huge index base (about 110 millions documents, 100 fields each).
 But size of the index base is reasonable, it's about 70 Gb. All I need
 is increase performance, since some queries, which match big number of
 documents, are running slow.
So I was thinking is any benefits to use hadoop for this? And if so,
 what direction should I go? Is anybody did something for integration Solr
 with Hadoop? Does it give any performance boost?
 
Any information is helpful for me,
Thanks,
Eugene




Re: solr with hadoop

2008-01-04 Thread Ryan McKinley

Mike Klaas wrote:

On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:

I have huge index base (about 110 millions documents, 100 fields 
each). But size of the index base is reasonable, it's about 70 Gb. All 
I need is increase performance, since some queries, which match big 
number of documents, are running slow.
So I was thinking is any benefits to use hadoop for this? And if so, 
what direction should I go? Is anybody did something for integration 
Solr with Hadoop? Does it give any performance boost?


Hadoop might be useful for organizing your data enroute to Solr, but I 
don't see how it could be used to boost performance over a huge Solr 
index.  To accomplish that, you need to split it up over two machines 
(for which you might find hadoop useful).




you may want to check out:
https://issues.apache.org/jira/browse/SOLR-303

ryan


Re: solr with hadoop

2008-01-04 Thread Mike Klaas

On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:

I have huge index base (about 110 millions documents, 100 fields  
each). But size of the index base is reasonable, it's about 70 Gb.  
All I need is increase performance, since some queries, which match  
big number of documents, are running slow.
So I was thinking is any benefits to use hadoop for this? And if  
so, what direction should I go? Is anybody did something for  
integration Solr with Hadoop? Does it give any performance boost?


Hadoop might be useful for organizing your data enroute to Solr, but  
I don't see how it could be used to boost performance over a huge  
Solr index.  To accomplish that, you need to split it up over two  
machines (for which you might find hadoop useful).


-Mike