I was playing around w/ Sqoop the other day, its a simple Cloudera tool for imports (mysql -> hdfs) @ http://www.cloudera.com/developers/downloads/sqoop/
It seems to me (it would be pretty efficient) to dump to HDFS and have something like Data Import Handler be able to read from hdfs:// directly ... Has this route been discussed / developed before (ie DIH w/ hdfs:// handler)? - Jon On Jun 22, 2010, at 12:29 PM, MitchK wrote: > > I wanted to add a Jira-issue about exactly what Otis is asking here. > Unfortunately, I haven't time for it because of my exams. > > However, I'd like to add a question to Otis' ones: > If you destribute the indexing-progress this way, are you able to replicate > the different documents correctly? > > Thank you. > - Mitch > > Otis Gospodnetic-2 wrote: >> >> Stu, >> >> Interesting! Can you provide more details about your setup? By "load >> balance the indexing stage" you mean "distribute the indexing process", >> right? Do you simply take your content to be indexed, split it into N >> chunks where N matches the number of TaskNodes in your Hadoop cluster and >> provide a map function that does the indexing? What does the reduce >> function do? Does that call IndexWriter.addAllIndexes or do you do that >> outside Hadoop? >> >> Thanks, >> Otis >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> ----- Original Message ---- >> From: Stu Hood <stuh...@webmail.us> >> To: solr-user@lucene.apache.org >> Sent: Monday, January 7, 2008 7:14:20 PM >> Subject: Re: solr with hadoop >> >> As Mike suggested, we use Hadoop to organize our data en route to Solr. >> Hadoop allows us to load balance the indexing stage, and then we use >> the raw Lucene IndexWriter.addAllIndexes method to merge the data to be >> hosted on Solr instances. >> >> Thanks, >> Stu >> >> >> >> -----Original Message----- >> From: Mike Klaas <mike.kl...@gmail.com> >> Sent: Friday, January 4, 2008 3:04pm >> To: solr-user@lucene.apache.org >> Subject: Re: solr with hadoop >> >> On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote: >> >>> I have huge index base (about 110 millions documents, 100 fields >>> each). But size of the index base is reasonable, it's about 70 Gb. >>> All I need is increase performance, since some queries, which match >>> big number of documents, are running slow. >>> So I was thinking is any benefits to use hadoop for this? And if >>> so, what direction should I go? Is anybody did something for >>> integration Solr with Hadoop? Does it give any performance boost? >>> >> Hadoop might be useful for organizing your data enroute to Solr, but >> I don't see how it could be used to boost performance over a huge >> Solr index. To accomplish that, you need to split it up over two >> machines (for which you might find hadoop useful). >> >> -Mike >> >> >> >> >> >> >> > -- > View this message in context: > http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914589.html > Sent from the Solr - User mailing list archive at Nabble.com.