Erick's probably too modest to say so ;=) , but he wrote a great blog entry
on indexing with SolrJ -
http://searchhub.org/2012/02/14/indexing-with-solrj/ .  I took the guts of
the code in that blog and  easily customized it to write a very fast
indexer  (content from MySQL, I excised all the Tika code as I am not using
it).

You should replace StreamingUpdateSolrServer by  ConcurrentUpdateSolrServer
and experiment to find the optimal number of threads to configure.

-Simon


On Sun, Jan 26, 2014 at 11:28 AM, Erick Erickson <erickerick...@gmail.com>wrote:

> 1> That's what I'd do. For incremental updates you might have to
> create a trigger on the main table and insert rows into another table
> that is then used to do the incremental updates. This is particularly
> relevant for deletes. Consider the case where you've ingested all your
> data then rows are deleted. Removing those same documents from Solr
> requires either a> re-indexing everything or b> getting all the docs
> in Solr and comparing them with the rows in the DB etc. This is
> expensive. c> recording the changes as above and just processing
> deletes from the "change table".
>
> 2> SolrJ is usually the most current. I don't know how much work
> SolrNet gets. However, under the covers it's just HTTP calls so since
> you have access in either to just adding HTTP parameters, you should
> be able to get the full functionality out of either. I _think_ that
> I'd go with whatever you're most comfortable with.
>
> Best,
> Erick
>
> On Sun, Jan 26, 2014 at 9:54 AM, Susheel Kumar
> <susheel.ku...@thedigitalgroup.net> wrote:
> > Thank you Erick for your valuable inputs. Yes, we have to re-index data
> again & again. I'll look into possibility of tuning db access.
> >
> > On SolrJ and automating the indexing (incremental as well as one time) I
> want to get your opinion on below two points. We will be indexing separate
> sets of tables with similar data structure
> >
> > - Should we use SolrJ and write Java programs that can be scheduled to
> trigger indexing on demand/schedule based.
> >
> > - Is using SolrJ a better idea even for searching than using SolrNet? As
> our frontend is in .Net so we started using SolrNet but I am afraid down
> the road when we scale/support SolrClod using SolrJ is better?
> >
> >
> > Thanks
> > Susheel
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Sunday, January 26, 2014 8:37 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr server requirements for 100+ million documents
> >
> > Dumping the raw data would probably be a good idea. I guarantee you'll
> be re-indexing the data several times as you change the schema to
> accommodate different requirements...
> >
> > But it may also be worth spending some time figuring out why the DB
> access is slow. Sometimes one can tune that.
> >
> > If you go the SolrJ route, you also have the possibility of setting up N
> clients to work simultaneously, sometimes that'll help.
> >
> > FWIW,
> > Erick
> >
> > On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar <
> susheel.ku...@thedigitalgroup.net> wrote:
> >> Hi Kranti,
> >>
> >> Attach are the solrconfig & schema xml for review. I did run indexing
> with just few fields (5-6 fields) in schema.xml & keeping the same db
> config but Indexing almost still taking similar time (average 1 million
> records 1 hr) which confirms that the bottleneck is in the data acquisition
> which in our case is oracle database. I am thinking to not use
> dataimporthandler / jdbc to get data from Oracle but to rather dump data
> somehow from oracle using SQL loader and then index it. Any thoughts?
> >>
> >> Thnx
> >>
> >> -----Original Message-----
> >> From: Kranti Parisa [mailto:kranti.par...@gmail.com]
> >> Sent: Saturday, January 25, 2014 12:08 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Solr server requirements for 100+ million documents
> >>
> >> can you post the complete solrconfig.xml file and schema.xml files to
> review all of your settings that would impact your indexing performance.
> >>
> >> Thanks,
> >> Kranti K. Parisa
> >> http://www.linkedin.com/in/krantiparisa
> >>
> >>
> >>
> >> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
> susheel.ku...@thedigitalgroup.net> wrote:
> >>
> >>> Thanks, Svante. Your indexing speed using db seems to really fast.
> >>> Can you please provide some more detail on how you are indexing db
> >>> records. Is it thru DataImportHandler? And what database? Is that
> >>> local db?  We are indexing around 70 fields (60 multivalued) but data
> >>> is not populated always in all fields. The average size of document is
> in 5-10 kbs.
> >>>
> >>> -----Original Message-----
> >>> From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf
> >>> Of svante karlsson
> >>> Sent: Friday, January 24, 2014 5:05 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Solr server requirements for 100+ million documents
> >>>
> >>> I just indexed 100 million db docs (records) with 22 fields (4
> >>> multivalued) in 9524 sec using libcurl.
> >>> 11 million took 763 seconds so the speed drops somewhat with
> >>> increasing dbsize.
> >>>
> >>> We write 1000 docs (just an arbitrary number) in each request from
> >>> two threads. If you will be using solrcloud you will want more writer
> threads.
> >>>
> >>> The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with
> >>> one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi
> virtual machine.
> >>>
> >>> /svante
> >>>
> >>>
> >>>
> >>>
> >>> 2014/1/24 Susheel Kumar <susheel.ku...@thedigitalgroup.net>
> >>>
> >>> > Thanks, Erick for the info.
> >>> >
> >>> > For indexing I agree the more time is consumed in data acquisition
> >>> > which in our case from Database.  For indexing currently we are
> >>> > using the manual process i.e. Solr dashboard Data Import but now
> >>> > looking to automate.  How do you suggest to automate the index part.
> >>> > Do you recommend to use SolrJ or should we try to automate using
> Curl?
> >>> >
> >>> >
> >>> > -----Original Message-----
> >>> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> >>> > Sent: Friday, January 24, 2014 2:59 PM
> >>> > To: solr-user@lucene.apache.org
> >>> > Subject: Re: Solr server requirements for 100+ million documents
> >>> >
> >>> > Can't be done with the information you provided, and can only be
> >>> > guessed at even with more comprehensive information.
> >>> >
> >>> > Here's why:
> >>> >
> >>> >
> >>> > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why
> >>> > -
> >>> > we
> >>> > -dont-have-a-definitive-answer/
> >>> >
> >>> > Also, at a guess, your indexing speed is so slow due to data
> >>> > acquisition; I rather doubt you're being limited by raw Solr
> indexing.
> >>> > If you're using SolrJ, try commenting out the
> >>> > server.add() bit and running again. My guess is that your indexing
> >>> > speed will be almost unchanged, in which case it's the data
> >>> > acquisition process is where you should concentrate efforts. As a
> >>> > comparison, I can index 11M Wikipedia docs on my laptop in 45
> >>> > minutes without any attempts at parallelization.
> >>> >
> >>> >
> >>> > Best,
> >>> > Erick
> >>> >
> >>> > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
> >>> > susheel.ku...@thedigitalgroup.net> wrote:
> >>> > > Hi,
> >>> > >
> >>> > > Currently we are indexing 10 million document from database (10
> >>> > > db data
> >>> > entities) & index size is around 8 GB on windows virtual box.
> >>> > Indexing in one shot taking 12+ hours while indexing parallel in
> >>> > separate cores & merging them together taking 4+ hours.
> >>> > >
> >>> > > We are looking to scale to 100+ million documents and looking for
> >>> > recommendation on servers requirements on below parameters for a
> >>> > Production environment. There can be 200+ users performing search
> >>> > same
> >>> time.
> >>> > >
> >>> > > No of physical servers (considering solr cloud) Memory
> >>> > > requirement Processor requirement (# cores) Linux as OS oppose to
> >>> > > windows
> >>> > >
> >>> > > Thanks in advance.
> >>> > > Susheel
> >>> > >
> >>> >
> >>>
>

Reply via email to