Re: SolrEntityProcessor gets slower and slower

2013-07-21 Thread Manuel Le Normand
 Minfeng- This issue is tougher as the number of shard you have raise, you
can read Erick Erickson's post:
http://grokbase.com/t/lucene/solr-user/131p75p833/how-distributed-queries-works.
If you have 100M docs I guess you are running this issue.

The common way to deal with this issue is by filtering on a value that
would return fewer results every query, as a creation_date field, and every
query change this field range. For your data import use-case you might want
to generate your data-import.xml with different entities, each one for
another creation_date range. Thus no need for deep paging.

Another option is using
http://wiki.apache.org/solr/CommonQueryParameters#pageDoc_and_pageScore.
Implementing
it in a multi sharded environment, as all your scores=1.0 thus results are
ranked by shard (according to the internal [docId] of each shard), is not
possible of my knowledge.

Caching all the query results in each shard (by raising the
queryResultWindow) should help, wouldn't it?


Best,

Manu


On Mon, Jun 10, 2013 at 8:56 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> SolrEntityProcessor is fine for small amounts of data but not useful for
> such a large index. The problem is that deep paging in search results is
> expensive. As the "start" value for a query increases so does the cost of
> the query. You are much better off just re-indexing the data.
>
>
> On Mon, Jun 10, 2013 at 11:19 PM, Mingfeng Yang  >wrote:
>
> > I trying to migrate 100M documents from a solr index (v3.6) to a
> solrcloud
> > index (v4.1, 4 shards) by using SolrEntityProcessor.  My data-config.xml
> is
> > like
> >
> >processor="SolrEntityProcessor"
> > url="http://10.64.35.117:8995/solr/"; query="*:*" rows="2000" fl=
> >
> >
> "author_class,authorlink,author_location_text,author_text,author,category,date,dimension,entity,id,language,md5_text,op_dimension,opinion_text,query_id,search_source,sentiment,source_domain_text,source_domain,text,textshingle,title,topic,topic_text,url"
> > />  
> >
> > Initially, the data import rate is about 1K docs/second, but it
> eventually
> > decrease to 20docs/second after running for tens of hours.
> >
> > Last time I tried data import with solorentityprocessor, the transfer
> rate
> > can be as high as 3K docs/seconds.
> >
> > Anyone has any clues what can cause the slowdown?
> >
> > Thanks,
> > Ming-
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: SolrEntityProcessor gets slower and slower

2013-06-10 Thread Shalin Shekhar Mangar
SolrEntityProcessor is fine for small amounts of data but not useful for
such a large index. The problem is that deep paging in search results is
expensive. As the "start" value for a query increases so does the cost of
the query. You are much better off just re-indexing the data.


On Mon, Jun 10, 2013 at 11:19 PM, Mingfeng Yang wrote:

> I trying to migrate 100M documents from a solr index (v3.6) to a solrcloud
> index (v4.1, 4 shards) by using SolrEntityProcessor.  My data-config.xml is
> like
>
>url="http://10.64.35.117:8995/solr/"; query="*:*" rows="2000" fl=
>
> "author_class,authorlink,author_location_text,author_text,author,category,date,dimension,entity,id,language,md5_text,op_dimension,opinion_text,query_id,search_source,sentiment,source_domain_text,source_domain,text,textshingle,title,topic,topic_text,url"
> />  
>
> Initially, the data import rate is about 1K docs/second, but it eventually
> decrease to 20docs/second after running for tens of hours.
>
> Last time I tried data import with solorentityprocessor, the transfer rate
> can be as high as 3K docs/seconds.
>
> Anyone has any clues what can cause the slowdown?
>
> Thanks,
> Ming-
>



-- 
Regards,
Shalin Shekhar Mangar.


SolrEntityProcessor gets slower and slower

2013-06-10 Thread Mingfeng Yang
I trying to migrate 100M documents from a solr index (v3.6) to a solrcloud
index (v4.1, 4 shards) by using SolrEntityProcessor.  My data-config.xml is
like

  http://10.64.35.117:8995/solr/"; query="*:*" rows="2000" fl=
"author_class,authorlink,author_location_text,author_text,author,category,date,dimension,entity,id,language,md5_text,op_dimension,opinion_text,query_id,search_source,sentiment,source_domain_text,source_domain,text,textshingle,title,topic,topic_text,url"
/>  

Initially, the data import rate is about 1K docs/second, but it eventually
decrease to 20docs/second after running for tens of hours.

Last time I tried data import with solorentityprocessor, the transfer rate
can be as high as 3K docs/seconds.

Anyone has any clues what can cause the slowdown?

Thanks,
Ming-