Thanks Markus for the useful insights. I'm planning on starting with a simple single Solr instance based system. We'll also mention about the SolrCloud in our publication and would look in to it in the future if our workloads become much larger.
thanks, Thilina On Mon, Oct 22, 2012 at 6:56 PM, Markus Jelsma <[email protected]>wrote: > Hi > > -----Original message----- > > From:Thilina Gunarathne <[email protected]> > > Sent: Tue 23-Oct-2012 00:38 > > To: [email protected] > > Subject: Re: Best practice to index a large crawl through Solr? > > > > Hi Markus, > > Thanks a lot for the info. > > > > Hi - Hadoop can write more records per second than Solr can analyze and > > > store, especially with multiple reducers (threads in Solr). SolrCloud > is > > > notoriously slow when it comes to indexing compared to a stand-alone > setup. > > > > Can this be overcome by using the Nutch Solrindex job for indexing? In > > other words, does the Solr becomes a bottleneck for the SolrIndex job? > > Nutch trunk can only write to a single Solr URL and if you have more than > a few reducers Solr is the bottleneck. But that should not be a problem > when dealing with a few milliion records. It is a matter of minutes. > > > > > Out of curiosity, does SolrCloud supports any data locality when loading > > data from Nutch? For an example, if I'm co-locating SolrCloud on the same > > nodes that are running Hadoop/HBase, can SolrCloud work with the local > > region servers to load the data? Eventually, we would have to process > > millions of records and I'm just wondering whether the communication > > between Nutch and Solr would be a huge bottleneck. > > Data locallity is more a thing for distributed processing, moving the > program to the data in the assumption that it's cheaper in terms of > bandwidth. That does not apply to SolrCloud, it works with hash ranges > based on your ID and then points documents to a specific shard (see > SolrCloud wiki page referred to in this thread). If you want a stable and > performing Nutch and Solr cluster you must separate them. Both have > specific resource requirements and should not run on the same node. If you > mix them, it is hard to provide a reliable service. > > We operate one Nutch cluster and several Solr clusters with a lot of > documents and don't worry about the bottleneck. Based on my experiences i > think you should not worry too much at this point about Solr being an > indexing bottle neck, you can scale out if it becomes a problem. > > A significant improvement in very large scale indexing from a Nutch > cluster to a SolrCloud cluster is NUTCH-1377 but it's tedious to implement. > Right now we don't yet need it because the bottleneck is insignificant for > now, even with many millions of documents. Unless you are going to work > with A LOT of records this should not be a big problem for the next few > months. > > https://issues.apache.org/jira/browse/NUTCH-1377 > > > > > thanks, > > Thilina > > > > > > > However, this should not be a problem at all as your not dealing with > > > millions of records. Trying to tie HBase as a backend to Solr is not a > good > > > idea at all. The best and fastest storage for Solr is a disk and > > > MMappedDirectory enabled (default in recent version) and plenty of RAM. > > > Keep in mind that Solr keeps several parts of the index in memory and > > > others if it can and it is very efficient in doing that. > > > > > > With only a few million records it's easy and fast enough to run Hadoop > > > locally (or pseudo if you can) and have a single Solr node running. > > > > > > -----Original message----- > > > > From:Thilina Gunarathne <[email protected]> > > > > Sent: Mon 22-Oct-2012 22:35 > > > > To: [email protected] > > > > Subject: Re: Best practice to index a large crawl through Solr? > > > > > > > > Hi Alex, > > > > Thanks again for the information. > > > > > > > > My current requirement is to implement a simple searching > application > > > for > > > > a publication. Our current data sizes probably would not exceed the > > > amount > > > > of records you mentioned and for now, we should be fine with a single > > > Solr > > > > instance. I'm going to check out the SolrCloud for our future needs. > > > > > > > > >Hm, so you are thinking Nutch -> HBase -> Solr -> HBase, that does > > > > >sound pretty crazy. > > > > I agree :).. Unfortunately (or may be luckily) I do not have much > time to > > > > invest on this and I'll probably have to rely on the existing tools, > > > rather > > > > than trying to reinvent the wheels :).. > > > > > > > > thanks, > > > > Thilina > > > > > > > > > > > > On Mon, Oct 22, 2012 at 4:00 PM, Alejandro Caceres < > > > > [email protected]> wrote: > > > > > > > > > No problem. Wrt to your first question, Solr would actually be > storing > > > > > this data locally. Solr sharding actually uses its own mechanism > > > > > called SolrCloud. I'd recommend checking it out here: > > > > > http://wiki.apache.org/solr/SolrCloud, it seems cool though I > have not > > > > > used it myself. > > > > > > > > > > Hm, so you are thinking Nutch -> HBase -> Solr -> HBase, that does > > > > > sound pretty crazy. You can most definitely find a more efficient > way > > > > > to do this, either by going to HBase directly from the start (I > > > > > wouldn't do so personally) or just using Solr. It might be good to > > > > > know what kind of application you are looking to build and asking > more > > > > > specifically. > > > > > > > > > > Alex > > > > > > > > > > On Mon, Oct 22, 2012 at 3:48 PM, Thilina Gunarathne < > [email protected] > > > > > > > > > wrote: > > > > > > Hi Alex, > > > > > > Thanks for the very fast response :).. > > > > > > > > > > > > It sort of depends on your purpose and the amount of data. I > > > currently > > > > > >> have a single Solr instance (~1GB of memory, 2 processors on the > > > > > >> server) serving almost ~3,700,000 records from Nutch and it's > still > > > > > >> working great for me. If you have around that I'd say a single > Solr > > > > > >> instance is OK, depending on if you are planning on making your > data > > > > > >> publicly available or not. > > > > > >> > > > > > > This is very useful information. In this case, would the Solr > > > instance be > > > > > > retrieving and storing all the data locally or is it still using > the > > > > > Nutch > > > > > > data store to retrieve the actual content while serving the > queries? > > > > > > > > > > > > > > > > > >> If you're creating something larger of some sort, Solr 4.0, > which > > > > > >> supports sharding natively would be a great option (I think it's > > > still > > > > > >> in Beta, but if you're feeling brave...). This is especially > true if > > > > > >> you are creating a search engine of some sort, or would like > easily > > > > > >> searchable data. > > > > > >> > > > > > > That's interesting. I'll check that out. By any chance, do you > know > > > > > whether > > > > > > the Solr sharding is using the HDFS to store the data or is it > using > > > it's > > > > > > own infrastructure? > > > > > > > > > > > > > > > > > >> I would imagine doing this directly from HBase would not be a > great > > > > > >> option, as Nutch is storing the data in the format that is > > > convenient > > > > > >> for Nutch itself to use, and not so much in a format that it is > > > > > >> friendly for you to reuse for your own purposes. > > > > > >> > > > > > > I was actually thinking of a scenario where we would use Solr to > > > index > > > > > the > > > > > > data and storing the resultant index in HBase. Then using the > HBase > > > > > > directly to perform simple index lookups.. Please pardon my > lack of > > > > > > knowledge on Nutch and Solr, if the above sounds ludicrous :).. > > > > > > > > > > > > thanks, > > > > > > Thilina > > > > > > > > > > > > > > > > > >> IMO your best bet is going to try out Solr 4.0. > > > > > >> > > > > > >> Alex > > > > > >> > > > > > >> On Mon, Oct 22, 2012 at 3:03 PM, Thilina Gunarathne < > > > [email protected]> > > > > > >> wrote: > > > > > >> > Dear All, > > > > > >> > What would be the best practice to index a large crawl using > > > Solr? The > > > > > >> > crawl is performed on a multi node Hadoop cluster using HBase > as > > > the > > > > > back > > > > > >> > end.. Would Solr become a bottleneck if we use just a single > Solr > > > > > >> instance? > > > > > >> > Is it possible to store the indexed data on HBase and to > serve > > > them > > > > > from > > > > > >> > the HBase it self? > > > > > >> > > > > > > >> > thanks a lot, > > > > > >> > Thilina > > > > > >> > > > > > > >> > -- > > > > > >> > https://www.cs.indiana.edu/~tgunarat/ > > > > > >> > http://www.linkedin.com/in/thilina > > > > > >> > http://thilina.gunarathne.org > > > > > >> > > > > > >> > > > > > >> > > > > > >> -- > > > > > >> ___ > > > > > >> > > > > > >> Alejandro Caceres > > > > > >> Hyperion Gray, LLC > > > > > >> Owner/CTO > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > https://www.cs.indiana.edu/~tgunarat/ > > > > > > http://www.linkedin.com/in/thilina > > > > > > http://thilina.gunarathne.org > > > > > > > > > > > > > > > > > > > > -- > > > > > ___ > > > > > > > > > > Alejandro Caceres > > > > > Hyperion Gray, LLC > > > > > Owner/CTO > > > > > > > > > > > > > > > > > > > > > -- > > > > https://www.cs.indiana.edu/~tgunarat/ > > > > http://www.linkedin.com/in/thilina > > > > http://thilina.gunarathne.org > > > > > > > > > > > > > > > -- > > https://www.cs.indiana.edu/~tgunarat/ > > http://www.linkedin.com/in/thilina > > http://thilina.gunarathne.org > > > -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org

