Re: Lucene to Solrcloud migration

2014-11-11 Thread Michal Krajňanský
Hi Eric, Michael,

thank you both for your comments.

2014-11-11 5:05 GMT+01:00 Erick Erickson erickerick...@gmail.com:

 bq: - the documents are organized in shards according to date (integer)
 and
 language (a possibly extensible discrete set)

 bq: - the indexes are disjunct

 OK, I'm having a hard time getting my head around these two statements.

 If the indexes are disjunct in the sense that you only search one at a
 time,
 then they are different collections in SolrCloud jargon.


I just meant that every document is contained in a single one of the
indexes. I have a lot of Lucene indexes for various [language X timespan],
but logically we are speaking about a single huge index. That is why I
thought it would be natural to represent is as a single SolrCloud
collection.

If, on the other hand, these are a big collection and you want to search
 them all with a single query, I suggest that in SolrCloud land you don't
 want them to be discrete shards. My reasoning here is that let's say you
 have a bunch of documents for October, 2014 in Spanish. By putting these
 all on a single shard, your queries all have to be serviced by that one
 shard. You don't get any parallelism.


That is right. Actually the parallelization is not the main issue right
now. The queries are very sparse, currently our system does not support
load balancing at all. I imagined that in the future it could be achievable
via SolrCloud replication.

The main consideration is to be able to plug the indexes in and out on
demand. The total size of the data is in terabytes. We usually want to
search only the latest indexes but occassionally it is needed to plug in
one of the older ones.

Maybe (probably) I still have some misconceptions about the uses of
SolrCloud...

If it really does make sense in your case to route all the doc to a
 single shard,
 then Michael's comment is spot-on use compositeId router.


You confuse me here. I was not thinking about a single shard, on the
contrary, any [language X timespan] index would be itself a shard. I agree
that compositeId router seems to be natural for what I need. I am currently
searching for the way to convert my indexes in such way that my document
ID's have the composite format. Currently these are just unique integers,
so I would like to prefix all the document ID's of an index with it's
language and timespan. I do not know how, but I believe this should be
possible, as it is a constant operation that would not change the structure
of the index.

Best,

Michal



 Best,
 Erick

 On Mon, Nov 10, 2014 at 11:50 AM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:
  Hi Michal,
 
  Is there a particular reason to shard your collections like that? If it
 was
  mainly for ease of operations, I'd consider just using CompositeId to
  prevent specific types of queries hotspotting particular nodes.
 
  If your ingest rate is fast, you might also consider making each
  collection an alias that points to many actual collections, and
  periodically closing off a collection and starting a new one. This
 prevents
  cache churn and the impact of large merges.
 
  Michael
 
 
 
  On 11/10/14 08:03, Michal Krajňanský wrote:
 
  Hi All,
 
  I have been working on a project that has long employed Lucene indexer.
 
  Currently, the system implements a proprietary document routing and
 index
  plugging/unplugging on top of the Lucene and of course contains a great
  body of indexes. Recently an idea came up to migrate from Lucene to
  Solrcloud, which appears to be more powerfull that our proprietary
 system.
 
  Could you suggest the best way to seamlessly migrate the system to use
  Solrcloud, when the reindexing is not an option?
 
  - all the existing indexes represent a single collection in terms of
  Solrcloud
  - the documents are organized in shards according to date (integer)
 and
  language (a possibly extensible discrete set)
  - the indexes are disjunct
 
  I have been able to convert the existing indexes to the newest Lucene
  version and plug them individually into the Solrcloud. However, there is
  the question of routing, sharding etc.
 
  Any insight appreciated.
 
  Best,
 
 
  Michal Krajnansky
 
 



Re: Lucene to Solrcloud migration

2014-11-11 Thread Michal Krajňanský
Hm. So I found that one can update stored fields with atomic update
operation, however according to
http://stackoverflow.com/questions/19058795/it-is-possible-to-update-uniquekey-in-solr-4
this will not work for uniqueKey. So I guess with compositeId router I am
out of luck.

I have been also searching for a way to implement my own routing mechanism.
Anyway, this seem to be a cleaner solution -- I would not need to modify
existing index, just compute hash from the other (stored) fields than just
document id. Can you confirm that it is possible? The documentation is
however very modest (I only found that it is possible to specify custom
hash function).

Best,

Michal

2014-11-11 16:48 GMT+01:00 Michael Della Bitta 
michael.della.bi...@appinions.com:

 Yeah, Erick confused me a bit too, but I think what he's talking about
 takes for granted that you'd have your various indexes directly set up as
 individual collections.

 If instead you're considering one big collection, or a few collections
 based on aggregations of your individual indexes, having big, multisharded
 collections using compositeId should work, unless there's a use case we're
 not discussing.

 Michael


 On 11/11/14 10:27, Michal Krajňanský wrote:

 Hi Eric, Michael,

 thank you both for your comments.

 2014-11-11 5:05 GMT+01:00 Erick Erickson erickerick...@gmail.com:

  bq: - the documents are organized in shards according to date (integer)
 and
 language (a possibly extensible discrete set)

 bq: - the indexes are disjunct

 OK, I'm having a hard time getting my head around these two statements.

 If the indexes are disjunct in the sense that you only search one at a
 time,
 then they are different collections in SolrCloud jargon.


  I just meant that every document is contained in a single one of the
 indexes. I have a lot of Lucene indexes for various [language X timespan],
 but logically we are speaking about a single huge index. That is why I
 thought it would be natural to represent is as a single SolrCloud
 collection.

 If, on the other hand, these are a big collection and you want to search

 them all with a single query, I suggest that in SolrCloud land you don't
 want them to be discrete shards. My reasoning here is that let's say you
 have a bunch of documents for October, 2014 in Spanish. By putting these
 all on a single shard, your queries all have to be serviced by that one
 shard. You don't get any parallelism.


  That is right. Actually the parallelization is not the main issue right
 now. The queries are very sparse, currently our system does not support
 load balancing at all. I imagined that in the future it could be
 achievable
 via SolrCloud replication.

 The main consideration is to be able to plug the indexes in and out on
 demand. The total size of the data is in terabytes. We usually want to
 search only the latest indexes but occassionally it is needed to plug in
 one of the older ones.

 Maybe (probably) I still have some misconceptions about the uses of
 SolrCloud...

 If it really does make sense in your case to route all the doc to a

 single shard,
 then Michael's comment is spot-on use compositeId router.


  You confuse me here. I was not thinking about a single shard, on the
 contrary, any [language X timespan] index would be itself a shard. I agree
 that compositeId router seems to be natural for what I need. I am
 currently
 searching for the way to convert my indexes in such way that my document
 ID's have the composite format. Currently these are just unique integers,
 so I would like to prefix all the document ID's of an index with it's
 language and timespan. I do not know how, but I believe this should be
 possible, as it is a constant operation that would not change the
 structure
 of the index.

 Best,

 Michal



  Best,
 Erick

 On Mon, Nov 10, 2014 at 11:50 AM, Michael Della Bitta
 michael.della.bi...@appinions.com wrote:

 Hi Michal,

 Is there a particular reason to shard your collections like that? If it

 was

 mainly for ease of operations, I'd consider just using CompositeId to
 prevent specific types of queries hotspotting particular nodes.

 If your ingest rate is fast, you might also consider making each
 collection an alias that points to many actual collections, and
 periodically closing off a collection and starting a new one. This

 prevents

 cache churn and the impact of large merges.

 Michael



 On 11/10/14 08:03, Michal Krajňanský wrote:

 Hi All,

 I have been working on a project that has long employed Lucene indexer.

 Currently, the system implements a proprietary document routing and

 index

 plugging/unplugging on top of the Lucene and of course contains a great
 body of indexes. Recently an idea came up to migrate from Lucene to
 Solrcloud, which appears to be more powerfull that our proprietary

 system.

 Could you suggest the best way to seamlessly migrate the system to use
 Solrcloud, when the reindexing is not an option?

 - all the existing indexes

Lucene to Solrcloud migration

2014-11-10 Thread Michal Krajňanský
Hi All,

I have been working on a project that has long employed Lucene indexer.

Currently, the system implements a proprietary document routing and index
plugging/unplugging on top of the Lucene and of course contains a great
body of indexes. Recently an idea came up to migrate from Lucene to
Solrcloud, which appears to be more powerfull that our proprietary system.

Could you suggest the best way to seamlessly migrate the system to use
Solrcloud, when the reindexing is not an option?

- all the existing indexes represent a single collection in terms of
Solrcloud
- the documents are organized in shards according to date (integer) and
language (a possibly extensible discrete set)
- the indexes are disjunct

I have been able to convert the existing indexes to the newest Lucene
version and plug them individually into the Solrcloud. However, there is
the question of routing, sharding etc.

Any insight appreciated.

Best,


Michal Krajnansky


Re: Solrcloud replicas do not match

2014-11-08 Thread Michal Krajňanský
Hi Erick,

I found the issue to be related to my other question (about shared
solrconfig.xml) which you also answered.

Turns out that I had set data.dir variable in solrconfig.xml to an absolute
path that coincided with a different index. So replica tried to be created
there and something nasty probably happened. When removed the variable
value, the replica starts to be created where expected (and appropriatelly
grows in size).

During this recovery process (copying 60GB of data), the Solr Admin console
is unusable however. Anything I could do about this?

Thank you a lot,


Michal

2014-11-07 20:16 GMT+01:00 Erick Erickson erickerick...@gmail.com:

 How did you create the replica? Does the admin screen show it
 attached to the proper shard?

 What I'd do is set up my SolrCloud instance with (presumably)
 a single node (leader) and insure my searches were working.
 Then (and only then) use the Collection API ADDREPLICA
 command. You should see your replica be updated and
 be good-to-go

 Best,
 Erick

 On Fri, Nov 7, 2014 at 9:13 AM, Michal Krajňanský
 michal.krajnan...@gmail.com wrote:
  Hi all,
 
 
  I have a Solrcloud setup with a manually created collection with the
 index
  obtained via other means than Solr (data come from Lucene).
 
  I created a replica for the index and expected to see the data being
 copied
  to the replica, which does not happen. In the Admin interface I see
  something like:
 
 
  Version Gen Size   Master (Searching)
  1415379668601
  5853288
  60.13 GB
   Master (Replicable)
  1415379668601
  5853288
  -
   Slave (Searching)
  1415379668601
  3
  1.84 KB
 
  The versions seem to match. But obviously the replica only contains a
  handful of documents I indexed AFTER the replica was created.
 
  How do I replicate the documents that were already in the index? Or am I
  missing something?
 
  Best,
 
 
  Michal Krajnansky



Re: Solrcloud solrconfig.xml

2014-11-08 Thread Michal Krajňanský
Hi Erick,

Thank you for making this clearer (it helped me solve issue with
replication I asked about in different thread). However I suspect I still
do something wrong.

I am running a single Tomcat instance with two instances of Solr.

The shared solrconfig.xml contains:
dataDir${solr.data.dir:data}/dataDir

And the Tomcat contexts set the solr/home as follows:
Environment name=solr/home type=java.lang.String
value=.../solrcloud/solr1 override=true /
Environment name=solr/home type=java.lang.String
value=.../solrcloud/solr2 override=true /

The directory structure is as follows:

.../solrcloud/solr1/solr.xml
.../solrcloud/solr1/core1
.../solrcloud/solr1/core1/core.properties
.../solrcloud/solr1/core1/data

.../solrcloud/solr2/solr.xml

After having issued ADDREPLICA on the collection managed by core1, I would
expect to see the new data dir under .../solrcloud/solr2/core2/data.
However I have seen something like this: (the core names were a little
different).

...

.../solrcloud/solr2/solr.xml
.../solrcloud/solr2/core2
.../solrcloud/solr2/core2/core.properties
.../solrcloud/data(!)

I.e. the new core data dir was created relative to the parent solrcloud
folder. Makes me confused...

Best,

Michal Krajnansky


2014-11-07 19:59 GMT+01:00 Erick Erickson erickerick...@gmail.com:

 Each of those data dirs is relative to the instance in question.

 So if you're running on different machines, they're physically
 separate even though named identically.

 If you're running multiple nodes on a single machine a-la the
 getting started docs, then each one is in it's own directory
 (e.g. solr/node1, solr/node2) and since the dirs are relative
 to that directory, you get things like
 ..solr/node1/solr/gettingstarted_shard1_replica1/data
 ..solr/node2/solr/gettingstarted_shard1_replica1/data

 etc.

 Best,
 Erick

 On Fri, Nov 7, 2014 at 5:26 AM, Michal Krajňanský
 michal.krajnan...@gmail.com wrote:
  Hi Everyone,
 
 
  I am quite a bit confused about managing configuration files with
 Zookeeper
  for running Solr in cloud mode.
 
  To be precise, I was able to upload the config files (schema.xml,
  solrconfig.xml) into the Zookeeper and run Solrcloud.
 
  What confuses me are properties like data.dir, or replication request
  handlers. It seems like these should be different for each of the servers
  in the cloud. So how does it work?
 
  (I did google to understand the matter unsuccessfully.)
 
 
  Best,
 
  Michal



Solrcloud solrconfig.xml

2014-11-07 Thread Michal Krajňanský
Hi Everyone,


I am quite a bit confused about managing configuration files with Zookeeper
for running Solr in cloud mode.

To be precise, I was able to upload the config files (schema.xml,
solrconfig.xml) into the Zookeeper and run Solrcloud.

What confuses me are properties like data.dir, or replication request
handlers. It seems like these should be different for each of the servers
in the cloud. So how does it work?

(I did google to understand the matter unsuccessfully.)


Best,

Michal


Solrcloud replicas do not match

2014-11-07 Thread Michal Krajňanský
Hi all,


I have a Solrcloud setup with a manually created collection with the index
obtained via other means than Solr (data come from Lucene).

I created a replica for the index and expected to see the data being copied
to the replica, which does not happen. In the Admin interface I see
something like:


Version Gen Size   Master (Searching)
1415379668601
5853288
60.13 GB
 Master (Replicable)
1415379668601
5853288
-
 Slave (Searching)
1415379668601
3
1.84 KB

The versions seem to match. But obviously the replica only contains a
handful of documents I indexed AFTER the replica was created.

How do I replicate the documents that were already in the index? Or am I
missing something?

Best,


Michal Krajnansky


Solr slow startup

2014-11-03 Thread Michal Krajňanský
Dear All,


Sorry for the possibly newbie question as I have only recently started
experimenting with Solr and Solrcloud.


I am trying to import an index originally created with Lucene 2.x so Solr
4.10. What I did was:

1. upgrade index to version 3.x with IndexUpgrader
2. upgrade index to version 4.x with IndexUpgrader
3. created schema for Solr and used the default solrconfig (with some paths
changes)
4. succesfully started Solr

The sizes I am speaking about are in tens of gigabytes and the startup
times are 5~10 minutes.


I have read here:
https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=2ved=0CCMQFjABurl=https%3A%2F%2Fwiki.apache.org%2Fsolr%2FSolrPerformanceProblemsei=AKNXVL7ULbGR7Abp7IDYCAusg=AFQjCNEtw2Zma8ST3JLGL3xw6nG2G_0YuAsig2=HmM8R1VYuVtXv8lQHsHPJQbvm=bv.78597519,bs.1,d.dGYcad=rja
that it has possibly something to do with the updateHandler and enabled the
autoCommit as suggested, however with no improvement.

Such a long startup time feels odd when Lucene itself seems to load the
same indexes in no time.

I would very much appreciate any help with this issue.


Best,


Michal Krajnansky


Re: Solr slow startup

2014-11-03 Thread Michal Krajňanský
Hey Yonik,

That (getting rid of the suggester) solved the issue! You saved me a lot of
time and nerves.

Best,

Michal

2014-11-03 17:19 GMT+01:00 Yonik Seeley yo...@heliosearch.com:

 One possible cause of a slow startup with the default configs:
 https://issues.apache.org/jira/browse/SOLR-6679

 -Yonik
 http://heliosearch.org - native code faceting, facet functions,
 sub-facets, off-heap data


 On Mon, Nov 3, 2014 at 11:05 AM, Michal Krajňanský
 michal.krajnan...@gmail.com wrote:
  Dear All,
 
 
  Sorry for the possibly newbie question as I have only recently started
  experimenting with Solr and Solrcloud.
 
 
  I am trying to import an index originally created with Lucene 2.x so Solr
  4.10. What I did was:
 
  1. upgrade index to version 3.x with IndexUpgrader
  2. upgrade index to version 4.x with IndexUpgrader
  3. created schema for Solr and used the default solrconfig (with some
 paths
  changes)
  4. succesfully started Solr
 
  The sizes I am speaking about are in tens of gigabytes and the startup
  times are 5~10 minutes.
 
 
  I have read here:
 
 https://www.google.com/url?sa=trct=jq=esrc=ssource=webcd=2ved=0CCMQFjABurl=https%3A%2F%2Fwiki.apache.org%2Fsolr%2FSolrPerformanceProblemsei=AKNXVL7ULbGR7Abp7IDYCAusg=AFQjCNEtw2Zma8ST3JLGL3xw6nG2G_0YuAsig2=HmM8R1VYuVtXv8lQHsHPJQbvm=bv.78597519,bs.1,d.dGYcad=rja
  that it has possibly something to do with the updateHandler and enabled
 the
  autoCommit as suggested, however with no improvement.
 
  Such a long startup time feels odd when Lucene itself seems to load the
  same indexes in no time.
 
  I would very much appreciate any help with this issue.
 
 
  Best,
 
 
  Michal Krajnansky