Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-03 Thread scott chu
 
solr-user,妳好

I keep forgeting to mention one thing along the discussion session. Our data is 
Chinese news articles and we use CJK tokenizer (i.e. 2-gram) currently. The 
time spent to indexing is quite slow, compared to indexing english articles. 
That's why I am so worrying about indexing performance on 10M Chinese docs and 
turn to study SolrCloud. It could also be the reason why we index 1M docs kinda 
slow. Frankly, we didn't delve into writing a better-performance Chinese 
tokenizer in past years due to some policy reason (However, we did make a plan 
to write one next year using MMSeg  algorithm or 1-ngram+query-preprocessor). 

- Original Message - 
From: Erick Erickson 
To: solr-user 
Date: 2015-09-04, 00:07:43
Subject: Re: Re: Re: concept and choice: custom sharding or auto sharding?


bq: If you switch to SolrCloud, will you still keep numShards parameter to 1

yes. Although if you want to add more replicas you might want to specify that.

For 10M documents, I wouldn't be very fancy. Indexing them shouldn't take
very long, and I think your time would be better spent on other things than
trying to get fancy with splitshard and the like. Just create a
SolrCloud cluster
with as many replicas as you want and index from scratch unless it's
prohibitively expensive.

I can index 200M docs on my local Mac Pro in a couple of hours. Is it really
worth trying to do something you'll probably never do again (i.e. SPLITSHARD)?

If you really don't want to re-index _and_ you have only one shard in the
master/slave setup, here's what I'd do to migrate
1> create a new SolrCloud cluster with exactly one node (i.e. the "leader").
2> shut it down
3> copy the index from your master/slave to the new node, completely
 replacing the data directory
4> bring the node back up and check it.
5> use the collecitons API ADDREPLICA command to bring up as many
replicas as you want, they'll pull down the index and from that point on
you should be good.
5a> In this case, it'll actually do a complete replication from the leader to
 the followers, but thereafter incremental updates will be sent to all

 the nodes in the cluster rather than the older style master/slave
 occasional replication.

Best,
Erick

On Thu, Sep 3, 2015 at 8:54 AM, scott chu <scott@udngroup.com> wrote:
>
> solr-user,妳好
>
> If you switch to SolrCloud, will you still keep numShards parameter to 1? If
> you are migrating to SolrCloud and going to split that single shard into

> multple shards, Wouldn't you have to reindex the data? Is it possible just
> put that single shard into SolrCloud and call SPLITSHARD API to split it?
>
> I ask this cause I'd like to try first use master-slave architecture, like
> Eric suggest that 10M is not a "vast" thing. Then later, I might migrate it
> to SolrCloud possibly because I want to take advange of the Zookeeper
> functionality for HA/DR.
>
> - Original Message -----
> From: Toke Eskildsen
> To: solr-user
> Date: 2015-09-03, 18:33:39
> Subject: Re: Re: concept and choice: custom sharding or auto sharding?
>
> On Thu, 2015-09-03 at 18:24 +0800, Scott Chu wrote:
>> Do you use master-slave or SolrCloud for that single shard?
>
> Due to legacy reasons we are just using 2 fully independent Solrs, each
> indexing independently, with an Apache load balancer in front for the
> searches. It does give us the occasional hiccup, so we'll be switching
> to SolrCloud at some point.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>
>
> -
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6086 / 病毒庫: 4409/10567 - 發佈日期: 09/03/15
>
>
>
>


-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6086 / 病毒庫: 4409/10567 - 發佈日期: 09/03/15




 


Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-03 Thread scott chu
 
solr-user,妳好

If you switch to SolrCloud, will you still keep numShards parameter to 1? If 
you are migrating to SolrCloud and going to split that single shard into 
multple shards, Wouldn't you have to reindex the data? Is it possible just put 
that single shard into SolrCloud and call SPLITSHARD API to split it?

I ask this cause I'd like to try first use master-slave architecture, like Eric 
suggest that 10M is not a "vast" thing. Then later, I might migrate it to 
SolrCloud possibly because I want to take advange of the Zookeeper 
functionality for HA/DR.
- Original Message - 
From: Toke Eskildsen 
To: solr-user 
Date: 2015-09-03, 18:33:39
Subject: Re: Re: concept and choice: custom sharding or auto sharding?


On Thu, 2015-09-03 at 18:24 +0800, Scott Chu wrote:
> Do you use master-slave or SolrCloud for that single shard?

Due to legacy reasons we are just using 2 fully independent Solrs, each
indexing independently, with an Apache load balancer in front for the
searches. It does give us the occasional hiccup, so we'll be switching
to SolrCloud at some point.

- Toke Eskildsen, State and University Library, Denmark




-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6086 / 病毒庫: 4409/10567 - 發佈日期: 09/03/15



 


Re: Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-03 Thread scott chu
 
solr-user,妳好

No, both. But first I have to face the indexing performance problem. Where can 
I see information about concurrent/parallel indexing on Solr? Thanks in advance.
- Original Message - 
From: Toke Eskildsen 
To: solr_user lucene_apache 
Date: 2015-09-04, 00:57:51
Subject: Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?


scott chu <scott@udngroup.com> wrote:
?
> I keep forgeting to mention one thing along the discussion session.
> Our data is Chinese news articles and we use CJK tokenizer
> (i.e. 2-gram) currently. The time spent to indexing is quite slow,
> compared to indexing english articles. That's why I am so
> worrying about indexing performance on 10M Chinese docs
> and turn to study SolrCloud.

The performance problem is indexing and not searching? Solr supports concurrent 
indexing, so if you are able to send the data in parallel, just start as many 
indexing threads as you have cores. Of course that does not help if you are 
already doing that.

Also sanity check that you are not doing commits all the time.

- Toke Eskildsen


-
???
??? AVG ?? - www.avg.com
??: 2015.0.6086 / ???: 4409/10567 - : 09/03/15




 


Re: Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-03 Thread Erick Erickson
Ah, that may make my suggestions unworkable re: just reindexing.

Still, how much time are we talking about here? I've very often found
that indexing performance isn't gated by the Solr processing, but by
whatever is feeding Solr. A quick test is to fire up your indexing
and see if the CPU utilization by Solr is very high. As Toke said,
though, if you're using DIH you're out of luck.

Here's an article to get you started with SolrJ:
http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Thu, Sep 3, 2015 at 10:26 AM, Toke Eskildsen  
wrote:
> scott chu  wrote:
>> No, both. But first I have to face the indexing performance problem.
>> Where can I see information about concurrent/parallel indexing on Solr?
>
> Depends on how you index. If you use a Java program,
> http://lucene.apache.org/solr/5_2_0/solr-solrj/index.html?org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.html
> seems to do the trick (I haven't tried that one myself).
>
> If you are sending updates using curl or similar, you just need to start more 
> processes doing that.
>
> If you are using DataImportHandler, I think you are out of luck. As far as I 
> know, it does not support multiple index threads.
>
> - Toke Eskildsen


Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-03 Thread Toke Eskildsen
scott chu  wrote:
 
> I keep forgeting to mention one thing along the discussion session.
> Our data is Chinese news articles and we use CJK tokenizer
> (i.e. 2-gram) currently. The time spent to indexing is quite slow,
> compared to indexing english articles. That's why I am so
> worrying about indexing performance on 10M Chinese docs
> and turn to study SolrCloud.

The performance problem is indexing and not searching? Solr supports concurrent 
indexing, so if you are able to send the data in parallel, just start as many 
indexing threads as you have cores. Of course that does not help if you are 
already doing that.

Also sanity check that you are not doing commits all the time.

- Toke Eskildsen


Re: Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-03 Thread Toke Eskildsen
scott chu  wrote:
> No, both. But first I have to face the indexing performance problem.
> Where can I see information about concurrent/parallel indexing on Solr?

Depends on how you index. If you use a Java program,
http://lucene.apache.org/solr/5_2_0/solr-solrj/index.html?org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.html
seems to do the trick (I haven't tried that one myself).

If you are sending updates using curl or similar, you just need to start more 
processes doing that.

If you are using DataImportHandler, I think you are out of luck. As far as I 
know, it does not support multiple index threads.

- Toke Eskildsen


Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-03 Thread Erick Erickson
bq: If you switch to SolrCloud, will you still keep numShards parameter to 1

yes. Although if you want to add more replicas you might want to specify that.

For 10M documents, I wouldn't be very fancy. Indexing them shouldn't take
very long, and I think your time would be better spent on other things than
trying to get fancy with splitshard and the like. Just create a
SolrCloud cluster
with as many replicas as you want and index from scratch unless it's
prohibitively expensive.

I can index 200M docs on my local Mac Pro in a couple of hours. Is it really
worth trying to do something you'll probably never do again (i.e. SPLITSHARD)?

If you really don't want to re-index _and_ you have only one shard in the
master/slave setup, here's what I'd do to migrate
1> create a new SolrCloud cluster with exactly one node (i.e. the "leader").
2> shut it down
3> copy the index from your master/slave to the new node, completely
 replacing the data directory
4> bring the node back up and check it.
5> use the collecitons API ADDREPLICA command to bring up as many
replicas as you want, they'll pull down the index and from that point on
you should be good.
5a> In this case, it'll actually do a complete replication from the leader to
 the followers, but thereafter incremental updates will be sent to all
 the nodes in the cluster rather than the older style master/slave
 occasional replication.

Best,
Erick

On Thu, Sep 3, 2015 at 8:54 AM, scott chu <scott@udngroup.com> wrote:
>
> solr-user,妳好
>
> If you switch to SolrCloud, will you still keep numShards parameter to 1? If
> you are migrating to SolrCloud and going to split that single shard into
> multple shards, Wouldn't you have to reindex the data? Is it possible just
> put that single shard into SolrCloud and call SPLITSHARD API to split it?
>
> I ask this cause I'd like to try first use master-slave architecture, like
> Eric suggest that 10M is not a "vast" thing. Then later, I might migrate it
> to SolrCloud possibly because I want to take advange of the Zookeeper
> functionality for HA/DR.
>
> - Original Message -----
> From: Toke Eskildsen
> To: solr-user
> Date: 2015-09-03, 18:33:39
> Subject: Re: Re: concept and choice: custom sharding or auto sharding?
>
> On Thu, 2015-09-03 at 18:24 +0800, Scott Chu wrote:
>> Do you use master-slave or SolrCloud for that single shard?
>
> Due to legacy reasons we are just using 2 fully independent Solrs, each
> indexing independently, with an Apache load balancer in front for the
> searches. It does give us the occasional hiccup, so we'll be switching
> to SolrCloud at some point.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>
>
> -
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6086 / 病毒庫: 4409/10567 - 發佈日期: 09/03/15
>
>
>
>


Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-02 Thread scott chu
 
solr-user,妳好

Sorry ,wrong again. Auto sharding is not implicit router.
- Original Message - 
From: scott chu 
To: solr-user 
Date: 2015-09-02, 23:50:20
Subject: Re: Re: Re: concept and choice: custom sharding or auto sharding?


 
solr-user,妳好

Thanks! I'll go back to check my old environment and that article is really 
helpful.

BTW, I think I got wrong about compositeID. In the reference guide, it said 
compositeID needs numShards. That means what I describe in question 5 seems 
wrong cause I intend to plan one shard one whole year news article and I 
thought SolrCloud will create new shard for me itself when I add new year's 
articles. But since compositeID needs to specify numShards first, there's no 
way I can know how many years I will put in SolrCloud in advance . IT looks 
like if I want to use SolrCloud afte all, I may have to use auto sharding (i.e. 
implicit router).
- Original Message - 
From: Erick Erickson 
To: solr-user 
Date: 2015-09-02, 23:30:53
Subject: Re: Re: concept and choice: custom sharding or auto sharding?


bq: Why do you say: "at 10M documents there's rarely a need to shard at all?"

Because I routinely see 50M docs on a single node and I've seen over 300M docs
on a single node with sub-second responses. So if you're saying that
you see poor
performance at 1M docs then I suspect there's something radically
wrong with your
setup. Too little memory, very bad query patterns, whatever. If my
suspicion is true,
then sharding will just mask the underlying problem.

You need to quantify your performance concerns. It's one thing to say
"my node satisfies 50 queries-per-second with 500ms response time" and
another to say "My queries take 5,000 ms".

In the first case, you do indeed need to add more servers to increase QPS if
you need 500 QPS. And adding more slaves is the best way to do that.
In the second, you need to understand the slowdown because sharding
will be a band-aid.

This might help:
https://wiki.apache.org/solr/SolrPerformanceProblems

Best,
Erick



On Wed, Sep 2, 2015 at 8:19 AM, scott chu <scott@udngroup.com> wrote:
>
> solr-user,妳好
>
> Do you mean I only have to put 10M documents in one index and copy it to

> many slaves in a classic Solr master-slave architecture to provide querying
> serivce on internet, and it won't have obvious downgrade of query
> performance? But I did have add 1M document into one index on master and

> provide 2 slaves to serve querying service on internet, the query
> performance is kinda sad. Why do you say: "at 10M documents there's rarely a
> need to shard at all?" Do I provide too few slaves? What amount of documents
> is suitable for a need for shard in SolrCloud?
>
> - Original Message -----
>
> From: Erick Erickson
> To: solr-user
> Date: 2015-09-02, 23:00:29
> Subject: Re: concept and choice: custom sharding or auto sharding?
>
> Frankly, at 10M documents there's rarely a need to shard at all.
> Why do you think you need to? This seems like adding
> complexity for no good reason. Sharding should only really
> be used when you have too many documents to fit on a single
> shard as it adds some overhead, restricts some
> possibilities (cross-core join for instance, a couple of
> grouping options don't work in distributed mode etc.).
>
> You can still run SolrCloud and have it manage multiple
> _replicas_ of a single shard for HA/DR.
>
> So this seems like an XY problem, you're asking specific
> questions about shard routing because you think it'll
> solve some problem without telling us what the problem
> is.
>
> Best,
> Erick
>
> On Wed, Sep 2, 2015 at 7:47 AM, scott chu <scott@udngroup.com> wrote:
>> I post a question on Stackoverflow
>> http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud:
>> However, since this is a mail-list, I repost the question below to request
>> for suggestion and more subtle concept of SolrCloud's behavior on document
>> routing.
>> I want to establish a SolrCloud clsuter for over 10 millions of news
>> articles. After reading this article in Apache Solr Refernce guide: Shards
>> and Indexing Data in SolrCloud, I have a plan as follows:
>> Add prefix ED2001! to document ID where ED means some newspaper source and
>> 2001 is the year part in published date of news article, i.e. I want to put
>> all news articles of specific news paper source published in specific year
>> to a shard.
>> Create collection with router.name set to compositeID.
>> Add documents?
>> Query Collection?
>> Practically, I got some questions:
>> How to add doucments based on this plan? Do I have to specify special
>> parameters when updating the collection/core?
>> Is this called "cust

Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-02 Thread scott chu
 
solr-user,妳好

Thanks! I'll go back to check my old environment and that article is really 
helpful.

BTW, I think I got wrong about compositeID. In the reference guide, it said 
compositeID needs numShards. That means what I describe in question 5 seems 
wrong cause I intend to plan one shard one whole year news article and I 
thought SolrCloud will create new shard for me itself when I add new year's 
articles. But since compositeID needs to specify numShards first, there's no 
way I can know how many years I will put in SolrCloud in advance . IT looks 
like if I want to use SolrCloud afte all, I may have to use auto sharding (i.e. 
implicit router).
- Original Message - 
From: Erick Erickson 
To: solr-user 
Date: 2015-09-02, 23:30:53
Subject: Re: Re: concept and choice: custom sharding or auto sharding?


bq: Why do you say: "at 10M documents there's rarely a need to shard at all?"

Because I routinely see 50M docs on a single node and I've seen over 300M docs
on a single node with sub-second responses. So if you're saying that
you see poor
performance at 1M docs then I suspect there's something radically
wrong with your
setup. Too little memory, very bad query patterns, whatever. If my
suspicion is true,
then sharding will just mask the underlying problem.

You need to quantify your performance concerns. It's one thing to say
"my node satisfies 50 queries-per-second with 500ms response time" and
another to say "My queries take 5,000 ms".

In the first case, you do indeed need to add more servers to increase QPS if
you need 500 QPS. And adding more slaves is the best way to do that.
In the second, you need to understand the slowdown because sharding
will be a band-aid.

This might help:
https://wiki.apache.org/solr/SolrPerformanceProblems

Best,
Erick



On Wed, Sep 2, 2015 at 8:19 AM, scott chu <scott@udngroup.com> wrote:
>
> solr-user,妳好
>
> Do you mean I only have to put 10M documents in one index and copy it to

> many slaves in a classic Solr master-slave architecture to provide querying
> serivce on internet, and it won't have obvious downgrade of query
> performance? But I did have add 1M document into one index on master and

> provide 2 slaves to serve querying service on internet, the query
> performance is kinda sad. Why do you say: "at 10M documents there's rarely a
> need to shard at all?" Do I provide too few slaves? What amount of documents
> is suitable for a need for shard in SolrCloud?
>
> - Original Message -----
>
> From: Erick Erickson
> To: solr-user
> Date: 2015-09-02, 23:00:29
> Subject: Re: concept and choice: custom sharding or auto sharding?
>
> Frankly, at 10M documents there's rarely a need to shard at all.
> Why do you think you need to? This seems like adding
> complexity for no good reason. Sharding should only really
> be used when you have too many documents to fit on a single
> shard as it adds some overhead, restricts some
> possibilities (cross-core join for instance, a couple of
> grouping options don't work in distributed mode etc.).
>
> You can still run SolrCloud and have it manage multiple
> _replicas_ of a single shard for HA/DR.
>
> So this seems like an XY problem, you're asking specific
> questions about shard routing because you think it'll
> solve some problem without telling us what the problem
> is.
>
> Best,
> Erick
>
> On Wed, Sep 2, 2015 at 7:47 AM, scott chu <scott@udngroup.com> wrote:
>> I post a question on Stackoverflow
>> http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud:
>> However, since this is a mail-list, I repost the question below to request
>> for suggestion and more subtle concept of SolrCloud's behavior on document
>> routing.
>> I want to establish a SolrCloud clsuter for over 10 millions of news
>> articles. After reading this article in Apache Solr Refernce guide: Shards
>> and Indexing Data in SolrCloud, I have a plan as follows:
>> Add prefix ED2001! to document ID where ED means some newspaper source and
>> 2001 is the year part in published date of news article, i.e. I want to put
>> all news articles of specific news paper source published in specific year
>> to a shard.
>> Create collection with router.name set to compositeID.
>> Add documents?
>> Query Collection?
>> Practically, I got some questions:
>> How to add doucments based on this plan? Do I have to specify special
>> parameters when updating the collection/core?
>> Is this called "custom sharding"? If not, what is "custom sharding"?
>> Is auto sharding a better choice for my case since there's a
>> shard-splitting feature for auto sharding when the shard is too big?
>> Can I query without