Erick's probably too modest to say so ;=) , but he wrote a great blog entry on indexing with SolrJ - http://searchhub.org/2012/02/14/indexing-with-solrj/ . I took the guts of the code in that blog and easily customized it to write a very fast indexer (content from MySQL, I excised all the Tika code as I am not using it).
You should replace StreamingUpdateSolrServer by ConcurrentUpdateSolrServer and experiment to find the optimal number of threads to configure. -Simon On Sun, Jan 26, 2014 at 11:28 AM, Erick Erickson <erickerick...@gmail.com>wrote: > 1> That's what I'd do. For incremental updates you might have to > create a trigger on the main table and insert rows into another table > that is then used to do the incremental updates. This is particularly > relevant for deletes. Consider the case where you've ingested all your > data then rows are deleted. Removing those same documents from Solr > requires either a> re-indexing everything or b> getting all the docs > in Solr and comparing them with the rows in the DB etc. This is > expensive. c> recording the changes as above and just processing > deletes from the "change table". > > 2> SolrJ is usually the most current. I don't know how much work > SolrNet gets. However, under the covers it's just HTTP calls so since > you have access in either to just adding HTTP parameters, you should > be able to get the full functionality out of either. I _think_ that > I'd go with whatever you're most comfortable with. > > Best, > Erick > > On Sun, Jan 26, 2014 at 9:54 AM, Susheel Kumar > <susheel.ku...@thedigitalgroup.net> wrote: > > Thank you Erick for your valuable inputs. Yes, we have to re-index data > again & again. I'll look into possibility of tuning db access. > > > > On SolrJ and automating the indexing (incremental as well as one time) I > want to get your opinion on below two points. We will be indexing separate > sets of tables with similar data structure > > > > - Should we use SolrJ and write Java programs that can be scheduled to > trigger indexing on demand/schedule based. > > > > - Is using SolrJ a better idea even for searching than using SolrNet? As > our frontend is in .Net so we started using SolrNet but I am afraid down > the road when we scale/support SolrClod using SolrJ is better? > > > > > > Thanks > > Susheel > > -----Original Message----- > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > Sent: Sunday, January 26, 2014 8:37 AM > > To: solr-user@lucene.apache.org > > Subject: Re: Solr server requirements for 100+ million documents > > > > Dumping the raw data would probably be a good idea. I guarantee you'll > be re-indexing the data several times as you change the schema to > accommodate different requirements... > > > > But it may also be worth spending some time figuring out why the DB > access is slow. Sometimes one can tune that. > > > > If you go the SolrJ route, you also have the possibility of setting up N > clients to work simultaneously, sometimes that'll help. > > > > FWIW, > > Erick > > > > On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar < > susheel.ku...@thedigitalgroup.net> wrote: > >> Hi Kranti, > >> > >> Attach are the solrconfig & schema xml for review. I did run indexing > with just few fields (5-6 fields) in schema.xml & keeping the same db > config but Indexing almost still taking similar time (average 1 million > records 1 hr) which confirms that the bottleneck is in the data acquisition > which in our case is oracle database. I am thinking to not use > dataimporthandler / jdbc to get data from Oracle but to rather dump data > somehow from oracle using SQL loader and then index it. Any thoughts? > >> > >> Thnx > >> > >> -----Original Message----- > >> From: Kranti Parisa [mailto:kranti.par...@gmail.com] > >> Sent: Saturday, January 25, 2014 12:08 AM > >> To: solr-user@lucene.apache.org > >> Subject: Re: Solr server requirements for 100+ million documents > >> > >> can you post the complete solrconfig.xml file and schema.xml files to > review all of your settings that would impact your indexing performance. > >> > >> Thanks, > >> Kranti K. Parisa > >> http://www.linkedin.com/in/krantiparisa > >> > >> > >> > >> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar < > susheel.ku...@thedigitalgroup.net> wrote: > >> > >>> Thanks, Svante. Your indexing speed using db seems to really fast. > >>> Can you please provide some more detail on how you are indexing db > >>> records. Is it thru DataImportHandler? And what database? Is that > >>> local db? We are indexing around 70 fields (60 multivalued) but data > >>> is not populated always in all fields. The average size of document is > in 5-10 kbs. > >>> > >>> -----Original Message----- > >>> From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf > >>> Of svante karlsson > >>> Sent: Friday, January 24, 2014 5:05 PM > >>> To: solr-user@lucene.apache.org > >>> Subject: Re: Solr server requirements for 100+ million documents > >>> > >>> I just indexed 100 million db docs (records) with 22 fields (4 > >>> multivalued) in 9524 sec using libcurl. > >>> 11 million took 763 seconds so the speed drops somewhat with > >>> increasing dbsize. > >>> > >>> We write 1000 docs (just an arbitrary number) in each request from > >>> two threads. If you will be using solrcloud you will want more writer > threads. > >>> > >>> The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with > >>> one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi > virtual machine. > >>> > >>> /svante > >>> > >>> > >>> > >>> > >>> 2014/1/24 Susheel Kumar <susheel.ku...@thedigitalgroup.net> > >>> > >>> > Thanks, Erick for the info. > >>> > > >>> > For indexing I agree the more time is consumed in data acquisition > >>> > which in our case from Database. For indexing currently we are > >>> > using the manual process i.e. Solr dashboard Data Import but now > >>> > looking to automate. How do you suggest to automate the index part. > >>> > Do you recommend to use SolrJ or should we try to automate using > Curl? > >>> > > >>> > > >>> > -----Original Message----- > >>> > From: Erick Erickson [mailto:erickerick...@gmail.com] > >>> > Sent: Friday, January 24, 2014 2:59 PM > >>> > To: solr-user@lucene.apache.org > >>> > Subject: Re: Solr server requirements for 100+ million documents > >>> > > >>> > Can't be done with the information you provided, and can only be > >>> > guessed at even with more comprehensive information. > >>> > > >>> > Here's why: > >>> > > >>> > > >>> > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why > >>> > - > >>> > we > >>> > -dont-have-a-definitive-answer/ > >>> > > >>> > Also, at a guess, your indexing speed is so slow due to data > >>> > acquisition; I rather doubt you're being limited by raw Solr > indexing. > >>> > If you're using SolrJ, try commenting out the > >>> > server.add() bit and running again. My guess is that your indexing > >>> > speed will be almost unchanged, in which case it's the data > >>> > acquisition process is where you should concentrate efforts. As a > >>> > comparison, I can index 11M Wikipedia docs on my laptop in 45 > >>> > minutes without any attempts at parallelization. > >>> > > >>> > > >>> > Best, > >>> > Erick > >>> > > >>> > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar < > >>> > susheel.ku...@thedigitalgroup.net> wrote: > >>> > > Hi, > >>> > > > >>> > > Currently we are indexing 10 million document from database (10 > >>> > > db data > >>> > entities) & index size is around 8 GB on windows virtual box. > >>> > Indexing in one shot taking 12+ hours while indexing parallel in > >>> > separate cores & merging them together taking 4+ hours. > >>> > > > >>> > > We are looking to scale to 100+ million documents and looking for > >>> > recommendation on servers requirements on below parameters for a > >>> > Production environment. There can be 200+ users performing search > >>> > same > >>> time. > >>> > > > >>> > > No of physical servers (considering solr cloud) Memory > >>> > > requirement Processor requirement (# cores) Linux as OS oppose to > >>> > > windows > >>> > > > >>> > > Thanks in advance. > >>> > > Susheel > >>> > > > >>> > > >>> >