On Thu, Sep 30, 2010 at 10:49 PM, Sharma, Raghvendra <sraghven...@corelogic.com> wrote: > I have been able to load around a million rows/docs in around 5+ minutes. > The schema contains around 250+ fields. For the moment, I have kept > everything as string. > I am sure there are ways to get better loading speeds than this.
A million documents with 250 fields in 5 minutes sounds fast to me. As a comparison, we do a million documents with about 60 fields in an hour, using multiple Solr cores. However, this is very likely an apples to oranges comparison, as we are pulling large amounts of data from a database over a network. What indexing times are you aiming for? If you can shard your data, using multiple cores on a single Solr instance, and/or multiple Solr instances will speed up your indexing. However, if you want a complete, non-sharded index, you will need to merge the sharded ones. > Will the data type matter in loading speeds ?? or anything else ? Data type might matter if there is a lot of processing involved for that data type. E.g., the text type has several analyzers and tokenizers. > Can someone help me with any tips ? perhaps any best practices kind of > document/article.. > Anything .. [...] The Solr Wiki has many suggestions, e.g., look at the documentation on the DataImportHandler. In our experience, XML import has been very fast. A generic document is difficult as the speed is dependent on many things, such as the data source, number and type of fields, size of data, etc. Your best bet is to try out several approaches. Regards, Gora