Re: Lucene to Solrcloud migration
Hi Eric, Michael, thank you both for your comments. 2014-11-11 5:05 GMT+01:00 Erick Erickson erickerick...@gmail.com: bq: - the documents are organized in shards according to date (integer) and language (a possibly extensible discrete set) bq: - the indexes are disjunct OK, I'm having a hard time getting my head around these two statements. If the indexes are disjunct in the sense that you only search one at a time, then they are different collections in SolrCloud jargon. I just meant that every document is contained in a single one of the indexes. I have a lot of Lucene indexes for various [language X timespan], but logically we are speaking about a single huge index. That is why I thought it would be natural to represent is as a single SolrCloud collection. If, on the other hand, these are a big collection and you want to search them all with a single query, I suggest that in SolrCloud land you don't want them to be discrete shards. My reasoning here is that let's say you have a bunch of documents for October, 2014 in Spanish. By putting these all on a single shard, your queries all have to be serviced by that one shard. You don't get any parallelism. That is right. Actually the parallelization is not the main issue right now. The queries are very sparse, currently our system does not support load balancing at all. I imagined that in the future it could be achievable via SolrCloud replication. The main consideration is to be able to plug the indexes in and out on demand. The total size of the data is in terabytes. We usually want to search only the latest indexes but occassionally it is needed to plug in one of the older ones. Maybe (probably) I still have some misconceptions about the uses of SolrCloud... If it really does make sense in your case to route all the doc to a single shard, then Michael's comment is spot-on use compositeId router. You confuse me here. I was not thinking about a single shard, on the contrary, any [language X timespan] index would be itself a shard. I agree that compositeId router seems to be natural for what I need. I am currently searching for the way to convert my indexes in such way that my document ID's have the composite format. Currently these are just unique integers, so I would like to prefix all the document ID's of an index with it's language and timespan. I do not know how, but I believe this should be possible, as it is a constant operation that would not change the structure of the index. Best, Michal Best, Erick On Mon, Nov 10, 2014 at 11:50 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hi Michal, Is there a particular reason to shard your collections like that? If it was mainly for ease of operations, I'd consider just using CompositeId to prevent specific types of queries hotspotting particular nodes. If your ingest rate is fast, you might also consider making each collection an alias that points to many actual collections, and periodically closing off a collection and starting a new one. This prevents cache churn and the impact of large merges. Michael On 11/10/14 08:03, Michal Krajňanský wrote: Hi All, I have been working on a project that has long employed Lucene indexer. Currently, the system implements a proprietary document routing and index plugging/unplugging on top of the Lucene and of course contains a great body of indexes. Recently an idea came up to migrate from Lucene to Solrcloud, which appears to be more powerfull that our proprietary system. Could you suggest the best way to seamlessly migrate the system to use Solrcloud, when the reindexing is not an option? - all the existing indexes represent a single collection in terms of Solrcloud - the documents are organized in shards according to date (integer) and language (a possibly extensible discrete set) - the indexes are disjunct I have been able to convert the existing indexes to the newest Lucene version and plug them individually into the Solrcloud. However, there is the question of routing, sharding etc. Any insight appreciated. Best, Michal Krajnansky
Re: Lucene to Solrcloud migration
Hm. So I found that one can update stored fields with atomic update operation, however according to http://stackoverflow.com/questions/19058795/it-is-possible-to-update-uniquekey-in-solr-4 this will not work for uniqueKey. So I guess with compositeId router I am out of luck. I have been also searching for a way to implement my own routing mechanism. Anyway, this seem to be a cleaner solution -- I would not need to modify existing index, just compute hash from the other (stored) fields than just document id. Can you confirm that it is possible? The documentation is however very modest (I only found that it is possible to specify custom hash function). Best, Michal 2014-11-11 16:48 GMT+01:00 Michael Della Bitta michael.della.bi...@appinions.com: Yeah, Erick confused me a bit too, but I think what he's talking about takes for granted that you'd have your various indexes directly set up as individual collections. If instead you're considering one big collection, or a few collections based on aggregations of your individual indexes, having big, multisharded collections using compositeId should work, unless there's a use case we're not discussing. Michael On 11/11/14 10:27, Michal Krajňanský wrote: Hi Eric, Michael, thank you both for your comments. 2014-11-11 5:05 GMT+01:00 Erick Erickson erickerick...@gmail.com: bq: - the documents are organized in shards according to date (integer) and language (a possibly extensible discrete set) bq: - the indexes are disjunct OK, I'm having a hard time getting my head around these two statements. If the indexes are disjunct in the sense that you only search one at a time, then they are different collections in SolrCloud jargon. I just meant that every document is contained in a single one of the indexes. I have a lot of Lucene indexes for various [language X timespan], but logically we are speaking about a single huge index. That is why I thought it would be natural to represent is as a single SolrCloud collection. If, on the other hand, these are a big collection and you want to search them all with a single query, I suggest that in SolrCloud land you don't want them to be discrete shards. My reasoning here is that let's say you have a bunch of documents for October, 2014 in Spanish. By putting these all on a single shard, your queries all have to be serviced by that one shard. You don't get any parallelism. That is right. Actually the parallelization is not the main issue right now. The queries are very sparse, currently our system does not support load balancing at all. I imagined that in the future it could be achievable via SolrCloud replication. The main consideration is to be able to plug the indexes in and out on demand. The total size of the data is in terabytes. We usually want to search only the latest indexes but occassionally it is needed to plug in one of the older ones. Maybe (probably) I still have some misconceptions about the uses of SolrCloud... If it really does make sense in your case to route all the doc to a single shard, then Michael's comment is spot-on use compositeId router. You confuse me here. I was not thinking about a single shard, on the contrary, any [language X timespan] index would be itself a shard. I agree that compositeId router seems to be natural for what I need. I am currently searching for the way to convert my indexes in such way that my document ID's have the composite format. Currently these are just unique integers, so I would like to prefix all the document ID's of an index with it's language and timespan. I do not know how, but I believe this should be possible, as it is a constant operation that would not change the structure of the index. Best, Michal Best, Erick On Mon, Nov 10, 2014 at 11:50 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hi Michal, Is there a particular reason to shard your collections like that? If it was mainly for ease of operations, I'd consider just using CompositeId to prevent specific types of queries hotspotting particular nodes. If your ingest rate is fast, you might also consider making each collection an alias that points to many actual collections, and periodically closing off a collection and starting a new one. This prevents cache churn and the impact of large merges. Michael On 11/10/14 08:03, Michal Krajňanský wrote: Hi All, I have been working on a project that has long employed Lucene indexer. Currently, the system implements a proprietary document routing and index plugging/unplugging on top of the Lucene and of course contains a great body of indexes. Recently an idea came up to migrate from Lucene to Solrcloud, which appears to be more powerfull that our proprietary system. Could you suggest the best way to seamlessly migrate the system to use Solrcloud, when the reindexing is not an option? - all the existing indexes
Lucene to Solrcloud migration
Hi All, I have been working on a project that has long employed Lucene indexer. Currently, the system implements a proprietary document routing and index plugging/unplugging on top of the Lucene and of course contains a great body of indexes. Recently an idea came up to migrate from Lucene to Solrcloud, which appears to be more powerfull that our proprietary system. Could you suggest the best way to seamlessly migrate the system to use Solrcloud, when the reindexing is not an option? - all the existing indexes represent a single collection in terms of Solrcloud - the documents are organized in shards according to date (integer) and language (a possibly extensible discrete set) - the indexes are disjunct I have been able to convert the existing indexes to the newest Lucene version and plug them individually into the Solrcloud. However, there is the question of routing, sharding etc. Any insight appreciated. Best, Michal Krajnansky
Re: Solrcloud replicas do not match
Hi Erick, I found the issue to be related to my other question (about shared solrconfig.xml) which you also answered. Turns out that I had set data.dir variable in solrconfig.xml to an absolute path that coincided with a different index. So replica tried to be created there and something nasty probably happened. When removed the variable value, the replica starts to be created where expected (and appropriatelly grows in size). During this recovery process (copying 60GB of data), the Solr Admin console is unusable however. Anything I could do about this? Thank you a lot, Michal 2014-11-07 20:16 GMT+01:00 Erick Erickson erickerick...@gmail.com: How did you create the replica? Does the admin screen show it attached to the proper shard? What I'd do is set up my SolrCloud instance with (presumably) a single node (leader) and insure my searches were working. Then (and only then) use the Collection API ADDREPLICA command. You should see your replica be updated and be good-to-go Best, Erick On Fri, Nov 7, 2014 at 9:13 AM, Michal Krajňanský michal.krajnan...@gmail.com wrote: Hi all, I have a Solrcloud setup with a manually created collection with the index obtained via other means than Solr (data come from Lucene). I created a replica for the index and expected to see the data being copied to the replica, which does not happen. In the Admin interface I see something like: Version Gen Size Master (Searching) 1415379668601 5853288 60.13 GB Master (Replicable) 1415379668601 5853288 - Slave (Searching) 1415379668601 3 1.84 KB The versions seem to match. But obviously the replica only contains a handful of documents I indexed AFTER the replica was created. How do I replicate the documents that were already in the index? Or am I missing something? Best, Michal Krajnansky
Re: Solrcloud solrconfig.xml
Hi Erick, Thank you for making this clearer (it helped me solve issue with replication I asked about in different thread). However I suspect I still do something wrong. I am running a single Tomcat instance with two instances of Solr. The shared solrconfig.xml contains: dataDir${solr.data.dir:data}/dataDir And the Tomcat contexts set the solr/home as follows: Environment name=solr/home type=java.lang.String value=.../solrcloud/solr1 override=true / Environment name=solr/home type=java.lang.String value=.../solrcloud/solr2 override=true / The directory structure is as follows: .../solrcloud/solr1/solr.xml .../solrcloud/solr1/core1 .../solrcloud/solr1/core1/core.properties .../solrcloud/solr1/core1/data .../solrcloud/solr2/solr.xml After having issued ADDREPLICA on the collection managed by core1, I would expect to see the new data dir under .../solrcloud/solr2/core2/data. However I have seen something like this: (the core names were a little different). ... .../solrcloud/solr2/solr.xml .../solrcloud/solr2/core2 .../solrcloud/solr2/core2/core.properties .../solrcloud/data(!) I.e. the new core data dir was created relative to the parent solrcloud folder. Makes me confused... Best, Michal Krajnansky 2014-11-07 19:59 GMT+01:00 Erick Erickson erickerick...@gmail.com: Each of those data dirs is relative to the instance in question. So if you're running on different machines, they're physically separate even though named identically. If you're running multiple nodes on a single machine a-la the getting started docs, then each one is in it's own directory (e.g. solr/node1, solr/node2) and since the dirs are relative to that directory, you get things like ..solr/node1/solr/gettingstarted_shard1_replica1/data ..solr/node2/solr/gettingstarted_shard1_replica1/data etc. Best, Erick On Fri, Nov 7, 2014 at 5:26 AM, Michal Krajňanský michal.krajnan...@gmail.com wrote: Hi Everyone, I am quite a bit confused about managing configuration files with Zookeeper for running Solr in cloud mode. To be precise, I was able to upload the config files (schema.xml, solrconfig.xml) into the Zookeeper and run Solrcloud. What confuses me are properties like data.dir, or replication request handlers. It seems like these should be different for each of the servers in the cloud. So how does it work? (I did google to understand the matter unsuccessfully.) Best, Michal
Solrcloud solrconfig.xml
Hi Everyone, I am quite a bit confused about managing configuration files with Zookeeper for running Solr in cloud mode. To be precise, I was able to upload the config files (schema.xml, solrconfig.xml) into the Zookeeper and run Solrcloud. What confuses me are properties like data.dir, or replication request handlers. It seems like these should be different for each of the servers in the cloud. So how does it work? (I did google to understand the matter unsuccessfully.) Best, Michal
Solrcloud replicas do not match
Hi all, I have a Solrcloud setup with a manually created collection with the index obtained via other means than Solr (data come from Lucene). I created a replica for the index and expected to see the data being copied to the replica, which does not happen. In the Admin interface I see something like: Version Gen Size Master (Searching) 1415379668601 5853288 60.13 GB Master (Replicable) 1415379668601 5853288 - Slave (Searching) 1415379668601 3 1.84 KB The versions seem to match. But obviously the replica only contains a handful of documents I indexed AFTER the replica was created. How do I replicate the documents that were already in the index? Or am I missing something? Best, Michal Krajnansky
Solr slow startup
Dear All, Sorry for the possibly newbie question as I have only recently started experimenting with Solr and Solrcloud. I am trying to import an index originally created with Lucene 2.x so Solr 4.10. What I did was: 1. upgrade index to version 3.x with IndexUpgrader 2. upgrade index to version 4.x with IndexUpgrader 3. created schema for Solr and used the default solrconfig (with some paths changes) 4. succesfully started Solr The sizes I am speaking about are in tens of gigabytes and the startup times are 5~10 minutes. I have read here: https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=2ved=0CCMQFjABurl=https%3A%2F%2Fwiki.apache.org%2Fsolr%2FSolrPerformanceProblemsei=AKNXVL7ULbGR7Abp7IDYCAusg=AFQjCNEtw2Zma8ST3JLGL3xw6nG2G_0YuAsig2=HmM8R1VYuVtXv8lQHsHPJQbvm=bv.78597519,bs.1,d.dGYcad=rja that it has possibly something to do with the updateHandler and enabled the autoCommit as suggested, however with no improvement. Such a long startup time feels odd when Lucene itself seems to load the same indexes in no time. I would very much appreciate any help with this issue. Best, Michal Krajnansky
Re: Solr slow startup
Hey Yonik, That (getting rid of the suggester) solved the issue! You saved me a lot of time and nerves. Best, Michal 2014-11-03 17:19 GMT+01:00 Yonik Seeley yo...@heliosearch.com: One possible cause of a slow startup with the default configs: https://issues.apache.org/jira/browse/SOLR-6679 -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data On Mon, Nov 3, 2014 at 11:05 AM, Michal Krajňanský michal.krajnan...@gmail.com wrote: Dear All, Sorry for the possibly newbie question as I have only recently started experimenting with Solr and Solrcloud. I am trying to import an index originally created with Lucene 2.x so Solr 4.10. What I did was: 1. upgrade index to version 3.x with IndexUpgrader 2. upgrade index to version 4.x with IndexUpgrader 3. created schema for Solr and used the default solrconfig (with some paths changes) 4. succesfully started Solr The sizes I am speaking about are in tens of gigabytes and the startup times are 5~10 minutes. I have read here: https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=2ved=0CCMQFjABurl=https%3A%2F%2Fwiki.apache.org%2Fsolr%2FSolrPerformanceProblemsei=AKNXVL7ULbGR7Abp7IDYCAusg=AFQjCNEtw2Zma8ST3JLGL3xw6nG2G_0YuAsig2=HmM8R1VYuVtXv8lQHsHPJQbvm=bv.78597519,bs.1,d.dGYcad=rja that it has possibly something to do with the updateHandler and enabled the autoCommit as suggested, however with no improvement. Such a long startup time feels odd when Lucene itself seems to load the same indexes in no time. I would very much appreciate any help with this issue. Best, Michal Krajnansky