Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?
solr-user,妳好 I keep forgeting to mention one thing along the discussion session. Our data is Chinese news articles and we use CJK tokenizer (i.e. 2-gram) currently. The time spent to indexing is quite slow, compared to indexing english articles. That's why I am so worrying about indexing performance on 10M Chinese docs and turn to study SolrCloud. It could also be the reason why we index 1M docs kinda slow. Frankly, we didn't delve into writing a better-performance Chinese tokenizer in past years due to some policy reason (However, we did make a plan to write one next year using MMSeg algorithm or 1-ngram+query-preprocessor). - Original Message - From: Erick Erickson To: solr-user Date: 2015-09-04, 00:07:43 Subject: Re: Re: Re: concept and choice: custom sharding or auto sharding? bq: If you switch to SolrCloud, will you still keep numShards parameter to 1 yes. Although if you want to add more replicas you might want to specify that. For 10M documents, I wouldn't be very fancy. Indexing them shouldn't take very long, and I think your time would be better spent on other things than trying to get fancy with splitshard and the like. Just create a SolrCloud cluster with as many replicas as you want and index from scratch unless it's prohibitively expensive. I can index 200M docs on my local Mac Pro in a couple of hours. Is it really worth trying to do something you'll probably never do again (i.e. SPLITSHARD)? If you really don't want to re-index _and_ you have only one shard in the master/slave setup, here's what I'd do to migrate 1> create a new SolrCloud cluster with exactly one node (i.e. the "leader"). 2> shut it down 3> copy the index from your master/slave to the new node, completely replacing the data directory 4> bring the node back up and check it. 5> use the collecitons API ADDREPLICA command to bring up as many replicas as you want, they'll pull down the index and from that point on you should be good. 5a> In this case, it'll actually do a complete replication from the leader to the followers, but thereafter incremental updates will be sent to all the nodes in the cluster rather than the older style master/slave occasional replication. Best, Erick On Thu, Sep 3, 2015 at 8:54 AM, scott chu <scott@udngroup.com> wrote: > > solr-user,妳好 > > If you switch to SolrCloud, will you still keep numShards parameter to 1? If > you are migrating to SolrCloud and going to split that single shard into > multple shards, Wouldn't you have to reindex the data? Is it possible just > put that single shard into SolrCloud and call SPLITSHARD API to split it? > > I ask this cause I'd like to try first use master-slave architecture, like > Eric suggest that 10M is not a "vast" thing. Then later, I might migrate it > to SolrCloud possibly because I want to take advange of the Zookeeper > functionality for HA/DR. > > - Original Message ----- > From: Toke Eskildsen > To: solr-user > Date: 2015-09-03, 18:33:39 > Subject: Re: Re: concept and choice: custom sharding or auto sharding? > > On Thu, 2015-09-03 at 18:24 +0800, Scott Chu wrote: >> Do you use master-slave or SolrCloud for that single shard? > > Due to legacy reasons we are just using 2 fully independent Solrs, each > indexing independently, with an Apache load balancer in front for the > searches. It does give us the occasional hiccup, so we'll be switching > to SolrCloud at some point. > > - Toke Eskildsen, State and University Library, Denmark > > > > > - > 未在此訊息中找到病毒。 > 已透過 AVG 檢查 - www.avg.com > 版本: 2015.0.6086 / 病毒庫: 4409/10567 - 發佈日期: 09/03/15 > > > > - 未在此訊息中找到病毒。 已透過 AVG 檢查 - www.avg.com 版本: 2015.0.6086 / 病毒庫: 4409/10567 - 發佈日期: 09/03/15
Re: Re: Re: concept and choice: custom sharding or auto sharding?
solr-user,妳好 If you switch to SolrCloud, will you still keep numShards parameter to 1? If you are migrating to SolrCloud and going to split that single shard into multple shards, Wouldn't you have to reindex the data? Is it possible just put that single shard into SolrCloud and call SPLITSHARD API to split it? I ask this cause I'd like to try first use master-slave architecture, like Eric suggest that 10M is not a "vast" thing. Then later, I might migrate it to SolrCloud possibly because I want to take advange of the Zookeeper functionality for HA/DR. - Original Message - From: Toke Eskildsen To: solr-user Date: 2015-09-03, 18:33:39 Subject: Re: Re: concept and choice: custom sharding or auto sharding? On Thu, 2015-09-03 at 18:24 +0800, Scott Chu wrote: > Do you use master-slave or SolrCloud for that single shard? Due to legacy reasons we are just using 2 fully independent Solrs, each indexing independently, with an Apache load balancer in front for the searches. It does give us the occasional hiccup, so we'll be switching to SolrCloud at some point. - Toke Eskildsen, State and University Library, Denmark - 未在此訊息中找到病毒。 已透過 AVG 檢查 - www.avg.com 版本: 2015.0.6086 / 病毒庫: 4409/10567 - 發佈日期: 09/03/15
Re: Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?
solr-user,妳好 No, both. But first I have to face the indexing performance problem. Where can I see information about concurrent/parallel indexing on Solr? Thanks in advance. - Original Message - From: Toke Eskildsen To: solr_user lucene_apache Date: 2015-09-04, 00:57:51 Subject: Re: Re: Re: Re: concept and choice: custom sharding or auto sharding? scott chu <scott@udngroup.com> wrote: ? > I keep forgeting to mention one thing along the discussion session. > Our data is Chinese news articles and we use CJK tokenizer > (i.e. 2-gram) currently. The time spent to indexing is quite slow, > compared to indexing english articles. That's why I am so > worrying about indexing performance on 10M Chinese docs > and turn to study SolrCloud. The performance problem is indexing and not searching? Solr supports concurrent indexing, so if you are able to send the data in parallel, just start as many indexing threads as you have cores. Of course that does not help if you are already doing that. Also sanity check that you are not doing commits all the time. - Toke Eskildsen - ??? ??? AVG ?? - www.avg.com ??: 2015.0.6086 / ???: 4409/10567 - : 09/03/15
Re: Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?
Ah, that may make my suggestions unworkable re: just reindexing. Still, how much time are we talking about here? I've very often found that indexing performance isn't gated by the Solr processing, but by whatever is feeding Solr. A quick test is to fire up your indexing and see if the CPU utilization by Solr is very high. As Toke said, though, if you're using DIH you're out of luck. Here's an article to get you started with SolrJ: http://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Thu, Sep 3, 2015 at 10:26 AM, Toke Eskildsenwrote: > scott chu wrote: >> No, both. But first I have to face the indexing performance problem. >> Where can I see information about concurrent/parallel indexing on Solr? > > Depends on how you index. If you use a Java program, > http://lucene.apache.org/solr/5_2_0/solr-solrj/index.html?org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.html > seems to do the trick (I haven't tried that one myself). > > If you are sending updates using curl or similar, you just need to start more > processes doing that. > > If you are using DataImportHandler, I think you are out of luck. As far as I > know, it does not support multiple index threads. > > - Toke Eskildsen
Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?
scott chuwrote: > I keep forgeting to mention one thing along the discussion session. > Our data is Chinese news articles and we use CJK tokenizer > (i.e. 2-gram) currently. The time spent to indexing is quite slow, > compared to indexing english articles. That's why I am so > worrying about indexing performance on 10M Chinese docs > and turn to study SolrCloud. The performance problem is indexing and not searching? Solr supports concurrent indexing, so if you are able to send the data in parallel, just start as many indexing threads as you have cores. Of course that does not help if you are already doing that. Also sanity check that you are not doing commits all the time. - Toke Eskildsen
Re: Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?
scott chuwrote: > No, both. But first I have to face the indexing performance problem. > Where can I see information about concurrent/parallel indexing on Solr? Depends on how you index. If you use a Java program, http://lucene.apache.org/solr/5_2_0/solr-solrj/index.html?org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.html seems to do the trick (I haven't tried that one myself). If you are sending updates using curl or similar, you just need to start more processes doing that. If you are using DataImportHandler, I think you are out of luck. As far as I know, it does not support multiple index threads. - Toke Eskildsen
Re: Re: Re: concept and choice: custom sharding or auto sharding?
bq: If you switch to SolrCloud, will you still keep numShards parameter to 1 yes. Although if you want to add more replicas you might want to specify that. For 10M documents, I wouldn't be very fancy. Indexing them shouldn't take very long, and I think your time would be better spent on other things than trying to get fancy with splitshard and the like. Just create a SolrCloud cluster with as many replicas as you want and index from scratch unless it's prohibitively expensive. I can index 200M docs on my local Mac Pro in a couple of hours. Is it really worth trying to do something you'll probably never do again (i.e. SPLITSHARD)? If you really don't want to re-index _and_ you have only one shard in the master/slave setup, here's what I'd do to migrate 1> create a new SolrCloud cluster with exactly one node (i.e. the "leader"). 2> shut it down 3> copy the index from your master/slave to the new node, completely replacing the data directory 4> bring the node back up and check it. 5> use the collecitons API ADDREPLICA command to bring up as many replicas as you want, they'll pull down the index and from that point on you should be good. 5a> In this case, it'll actually do a complete replication from the leader to the followers, but thereafter incremental updates will be sent to all the nodes in the cluster rather than the older style master/slave occasional replication. Best, Erick On Thu, Sep 3, 2015 at 8:54 AM, scott chu <scott@udngroup.com> wrote: > > solr-user,妳好 > > If you switch to SolrCloud, will you still keep numShards parameter to 1? If > you are migrating to SolrCloud and going to split that single shard into > multple shards, Wouldn't you have to reindex the data? Is it possible just > put that single shard into SolrCloud and call SPLITSHARD API to split it? > > I ask this cause I'd like to try first use master-slave architecture, like > Eric suggest that 10M is not a "vast" thing. Then later, I might migrate it > to SolrCloud possibly because I want to take advange of the Zookeeper > functionality for HA/DR. > > - Original Message ----- > From: Toke Eskildsen > To: solr-user > Date: 2015-09-03, 18:33:39 > Subject: Re: Re: concept and choice: custom sharding or auto sharding? > > On Thu, 2015-09-03 at 18:24 +0800, Scott Chu wrote: >> Do you use master-slave or SolrCloud for that single shard? > > Due to legacy reasons we are just using 2 fully independent Solrs, each > indexing independently, with an Apache load balancer in front for the > searches. It does give us the occasional hiccup, so we'll be switching > to SolrCloud at some point. > > - Toke Eskildsen, State and University Library, Denmark > > > > > - > 未在此訊息中找到病毒。 > 已透過 AVG 檢查 - www.avg.com > 版本: 2015.0.6086 / 病毒庫: 4409/10567 - 發佈日期: 09/03/15 > > > >
Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?
solr-user,妳好 Sorry ,wrong again. Auto sharding is not implicit router. - Original Message - From: scott chu To: solr-user Date: 2015-09-02, 23:50:20 Subject: Re: Re: Re: concept and choice: custom sharding or auto sharding? solr-user,妳好 Thanks! I'll go back to check my old environment and that article is really helpful. BTW, I think I got wrong about compositeID. In the reference guide, it said compositeID needs numShards. That means what I describe in question 5 seems wrong cause I intend to plan one shard one whole year news article and I thought SolrCloud will create new shard for me itself when I add new year's articles. But since compositeID needs to specify numShards first, there's no way I can know how many years I will put in SolrCloud in advance . IT looks like if I want to use SolrCloud afte all, I may have to use auto sharding (i.e. implicit router). - Original Message - From: Erick Erickson To: solr-user Date: 2015-09-02, 23:30:53 Subject: Re: Re: concept and choice: custom sharding or auto sharding? bq: Why do you say: "at 10M documents there's rarely a need to shard at all?" Because I routinely see 50M docs on a single node and I've seen over 300M docs on a single node with sub-second responses. So if you're saying that you see poor performance at 1M docs then I suspect there's something radically wrong with your setup. Too little memory, very bad query patterns, whatever. If my suspicion is true, then sharding will just mask the underlying problem. You need to quantify your performance concerns. It's one thing to say "my node satisfies 50 queries-per-second with 500ms response time" and another to say "My queries take 5,000 ms". In the first case, you do indeed need to add more servers to increase QPS if you need 500 QPS. And adding more slaves is the best way to do that. In the second, you need to understand the slowdown because sharding will be a band-aid. This might help: https://wiki.apache.org/solr/SolrPerformanceProblems Best, Erick On Wed, Sep 2, 2015 at 8:19 AM, scott chu <scott@udngroup.com> wrote: > > solr-user,妳好 > > Do you mean I only have to put 10M documents in one index and copy it to > many slaves in a classic Solr master-slave architecture to provide querying > serivce on internet, and it won't have obvious downgrade of query > performance? But I did have add 1M document into one index on master and > provide 2 slaves to serve querying service on internet, the query > performance is kinda sad. Why do you say: "at 10M documents there's rarely a > need to shard at all?" Do I provide too few slaves? What amount of documents > is suitable for a need for shard in SolrCloud? > > - Original Message ----- > > From: Erick Erickson > To: solr-user > Date: 2015-09-02, 23:00:29 > Subject: Re: concept and choice: custom sharding or auto sharding? > > Frankly, at 10M documents there's rarely a need to shard at all. > Why do you think you need to? This seems like adding > complexity for no good reason. Sharding should only really > be used when you have too many documents to fit on a single > shard as it adds some overhead, restricts some > possibilities (cross-core join for instance, a couple of > grouping options don't work in distributed mode etc.). > > You can still run SolrCloud and have it manage multiple > _replicas_ of a single shard for HA/DR. > > So this seems like an XY problem, you're asking specific > questions about shard routing because you think it'll > solve some problem without telling us what the problem > is. > > Best, > Erick > > On Wed, Sep 2, 2015 at 7:47 AM, scott chu <scott@udngroup.com> wrote: >> I post a question on Stackoverflow >> http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud: >> However, since this is a mail-list, I repost the question below to request >> for suggestion and more subtle concept of SolrCloud's behavior on document >> routing. >> I want to establish a SolrCloud clsuter for over 10 millions of news >> articles. After reading this article in Apache Solr Refernce guide: Shards >> and Indexing Data in SolrCloud, I have a plan as follows: >> Add prefix ED2001! to document ID where ED means some newspaper source and >> 2001 is the year part in published date of news article, i.e. I want to put >> all news articles of specific news paper source published in specific year >> to a shard. >> Create collection with router.name set to compositeID. >> Add documents? >> Query Collection? >> Practically, I got some questions: >> How to add doucments based on this plan? Do I have to specify special >> parameters when updating the collection/core? >> Is this called "cust
Re: Re: Re: concept and choice: custom sharding or auto sharding?
solr-user,妳好 Thanks! I'll go back to check my old environment and that article is really helpful. BTW, I think I got wrong about compositeID. In the reference guide, it said compositeID needs numShards. That means what I describe in question 5 seems wrong cause I intend to plan one shard one whole year news article and I thought SolrCloud will create new shard for me itself when I add new year's articles. But since compositeID needs to specify numShards first, there's no way I can know how many years I will put in SolrCloud in advance . IT looks like if I want to use SolrCloud afte all, I may have to use auto sharding (i.e. implicit router). - Original Message - From: Erick Erickson To: solr-user Date: 2015-09-02, 23:30:53 Subject: Re: Re: concept and choice: custom sharding or auto sharding? bq: Why do you say: "at 10M documents there's rarely a need to shard at all?" Because I routinely see 50M docs on a single node and I've seen over 300M docs on a single node with sub-second responses. So if you're saying that you see poor performance at 1M docs then I suspect there's something radically wrong with your setup. Too little memory, very bad query patterns, whatever. If my suspicion is true, then sharding will just mask the underlying problem. You need to quantify your performance concerns. It's one thing to say "my node satisfies 50 queries-per-second with 500ms response time" and another to say "My queries take 5,000 ms". In the first case, you do indeed need to add more servers to increase QPS if you need 500 QPS. And adding more slaves is the best way to do that. In the second, you need to understand the slowdown because sharding will be a band-aid. This might help: https://wiki.apache.org/solr/SolrPerformanceProblems Best, Erick On Wed, Sep 2, 2015 at 8:19 AM, scott chu <scott@udngroup.com> wrote: > > solr-user,妳好 > > Do you mean I only have to put 10M documents in one index and copy it to > many slaves in a classic Solr master-slave architecture to provide querying > serivce on internet, and it won't have obvious downgrade of query > performance? But I did have add 1M document into one index on master and > provide 2 slaves to serve querying service on internet, the query > performance is kinda sad. Why do you say: "at 10M documents there's rarely a > need to shard at all?" Do I provide too few slaves? What amount of documents > is suitable for a need for shard in SolrCloud? > > - Original Message ----- > > From: Erick Erickson > To: solr-user > Date: 2015-09-02, 23:00:29 > Subject: Re: concept and choice: custom sharding or auto sharding? > > Frankly, at 10M documents there's rarely a need to shard at all. > Why do you think you need to? This seems like adding > complexity for no good reason. Sharding should only really > be used when you have too many documents to fit on a single > shard as it adds some overhead, restricts some > possibilities (cross-core join for instance, a couple of > grouping options don't work in distributed mode etc.). > > You can still run SolrCloud and have it manage multiple > _replicas_ of a single shard for HA/DR. > > So this seems like an XY problem, you're asking specific > questions about shard routing because you think it'll > solve some problem without telling us what the problem > is. > > Best, > Erick > > On Wed, Sep 2, 2015 at 7:47 AM, scott chu <scott@udngroup.com> wrote: >> I post a question on Stackoverflow >> http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud: >> However, since this is a mail-list, I repost the question below to request >> for suggestion and more subtle concept of SolrCloud's behavior on document >> routing. >> I want to establish a SolrCloud clsuter for over 10 millions of news >> articles. After reading this article in Apache Solr Refernce guide: Shards >> and Indexing Data in SolrCloud, I have a plan as follows: >> Add prefix ED2001! to document ID where ED means some newspaper source and >> 2001 is the year part in published date of news article, i.e. I want to put >> all news articles of specific news paper source published in specific year >> to a shard. >> Create collection with router.name set to compositeID. >> Add documents? >> Query Collection? >> Practically, I got some questions: >> How to add doucments based on this plan? Do I have to specify special >> parameters when updating the collection/core? >> Is this called "custom sharding"? If not, what is "custom sharding"? >> Is auto sharding a better choice for my case since there's a >> shard-splitting feature for auto sharding when the shard is too big? >> Can I query without