Minfeng- This issue is tougher as the number of shard you have raise, you can read Erick Erickson's post: http://grokbase.com/t/lucene/solr-user/131p75p833/how-distributed-queries-works. If you have 100M docs I guess you are running this issue.
The common way to deal with this issue is by filtering on a value that would return fewer results every query, as a creation_date field, and every query change this field range. For your data import use-case you might want to generate your data-import.xml with different entities, each one for another creation_date range. Thus no need for deep paging. Another option is using http://wiki.apache.org/solr/CommonQueryParameters#pageDoc_and_pageScore. Implementing it in a multi sharded environment, as all your scores=1.0 thus results are ranked by shard (according to the internal [docId] of each shard), is not possible of my knowledge. Caching all the query results in each shard (by raising the queryResultWindow) should help, wouldn't it? Best, Manu On Mon, Jun 10, 2013 at 8:56 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > SolrEntityProcessor is fine for small amounts of data but not useful for > such a large index. The problem is that deep paging in search results is > expensive. As the "start" value for a query increases so does the cost of > the query. You are much better off just re-indexing the data. > > > On Mon, Jun 10, 2013 at 11:19 PM, Mingfeng Yang <mfy...@wisewindow.com > >wrote: > > > I trying to migrate 100M documents from a solr index (v3.6) to a > solrcloud > > index (v4.1, 4 shards) by using SolrEntityProcessor. My data-config.xml > is > > like > > > > <dataConfig> <document> <entity name="sep" > processor="SolrEntityProcessor" > > url="http://10.64.35.117:8995/solr/" query="*:*" rows="2000" fl= > > > > > "author_class,authorlink,author_location_text,author_text,author,category,date,dimension,entity,id,language,md5_text,op_dimension,opinion_text,query_id,search_source,sentiment,source_domain_text,source_domain,text,textshingle,title,topic,topic_text,url" > > /> </document> </dataConfig> > > > > Initially, the data import rate is about 1K docs/second, but it > eventually > > decrease to 20docs/second after running for tens of hours. > > > > Last time I tried data import with solorentityprocessor, the transfer > rate > > can be as high as 3K docs/seconds. > > > > Anyone has any clues what can cause the slowdown? > > > > Thanks, > > Ming- > > > > > > -- > Regards, > Shalin Shekhar Mangar. >