Re: Configuring the Distributed

2011-12-05 Thread Jamie Johnson
What does the version field need to look like?  Something like?

   field name=_version_ type=string indexed=true stored=true
required=true /

On Sun, Dec 4, 2011 at 2:00 PM, Yonik Seeley yo...@lucidimagination.com wrote:
 On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote:
 You always want to use the distrib-update-chain. Eventually it will
 probably be part of the default chain and auto turn in zk mode.

 I'm working on this now...

 -Yonik
 http://www.lucidimagination.com


Re: Configuring the Distributed

2011-12-05 Thread Yonik Seeley
On Mon, Dec 5, 2011 at 9:21 AM, Jamie Johnson jej2...@gmail.com wrote:
 What does the version field need to look like?

It's in the example schema:
   field name=_version_ type=long indexed=true stored=true/

-Yonik
http://www.lucidimagination.com


Re: Configuring the Distributed

2011-12-05 Thread Jamie Johnson
Thanks Yonik, must have just missed it.

A question about adding a new shard to the index.  I am definitely not
a hashing expert, but the goal is to have a uniform distribution of
buckets based on what we're hashing.  If that happens then our shards
would reach capacity at approximately the same time.  In this
situation I don't think splitting one shard would help us we'd need to
split every shard to reduce the load on the burdened systems right?

On Mon, Dec 5, 2011 at 9:45 AM, Yonik Seeley yo...@lucidimagination.com wrote:
 On Mon, Dec 5, 2011 at 9:21 AM, Jamie Johnson jej2...@gmail.com wrote:
 What does the version field need to look like?

 It's in the example schema:
   field name=_version_ type=long indexed=true stored=true/

 -Yonik
 http://www.lucidimagination.com


Re: Configuring the Distributed

2011-12-05 Thread Yonik Seeley
On Mon, Dec 5, 2011 at 1:29 PM, Jamie Johnson jej2...@gmail.com wrote:
 In this
 situation I don't think splitting one shard would help us we'd need to
 split every shard to reduce the load on the burdened systems right?

Sure... but if you can split one, you can split them all :-)

-Yonik
http://www.lucidimagination.com


Re: Configuring the Distributed

2011-12-05 Thread Jamie Johnson
Yes completely agree, just wanted to make sure I wasn't missing the obvious :)

On Mon, Dec 5, 2011 at 1:39 PM, Yonik Seeley yo...@lucidimagination.com wrote:
 On Mon, Dec 5, 2011 at 1:29 PM, Jamie Johnson jej2...@gmail.com wrote:
 In this
 situation I don't think splitting one shard would help us we'd need to
 split every shard to reduce the load on the burdened systems right?

 Sure... but if you can split one, you can split them all :-)

 -Yonik
 http://www.lucidimagination.com


Re: Configuring the Distributed

2011-12-04 Thread Yonik Seeley
On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote:
 On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote:

 I am currently looking at the latest solrcloud branch and was
 wondering if there was any documentation on configuring the
 DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
 to be added/modified to make distributed indexing work?



 Hi Jaime - take a look at solrconfig-distrib-update.xml in
 solr/core/src/test-files

 You need to enable the update log, add an empty replication handler def,
 and an update chain with solr.DistributedUpdateProcessFactory in it.

One also needs an indexed _version_ field defined in schema.xml for
versioning to work.

-Yonik
http://www.lucidimagination.com


Re: Configuring the Distributed

2011-12-04 Thread Yonik Seeley
On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote:
 You always want to use the distrib-update-chain. Eventually it will
 probably be part of the default chain and auto turn in zk mode.

I'm working on this now...

-Yonik
http://www.lucidimagination.com


Re: Configuring the Distributed

2011-12-03 Thread Mark Miller
bq. A few questions if a master goes down does a replica get
promoted?

Right - if the leader goes down there is a leader election and one of the
replicas takes over.

bq.  If a new shard needs to be added is it just a matter of
starting a new solr instance with a higher numShards?

Eventually, that's the plan.

The idea is, you say something like, I want 3 shards. Now if you start up 9
instances, the first 3 end up as shard leaders - the next 6 evenly come up
as replicas for each shard.

To change the numShards, we will need some kind of micro shards / splitting
/ rebalancing.

bq. Last question, how do you change numShards?

I think this is somewhat a work in progress, but I think Sami just made it
so that numShards is stored on the collection node in zk (along with which
config set to use). So you would change it there presumably. Or perhaps
just start up a new server with an update numShards property and then it
would realize that needs to be a new leader - of course then you'd want to
rebalance probably - unless you fired up enough servers to add replicas
too...

bq. Is that right?

Yup, sounds about right.


On Fri, Dec 2, 2011 at 10:59 PM, Jamie Johnson jej2...@gmail.com wrote:

 So I just tried this out, seems like it does the things I asked about.

 Really really cool stuff, it's progressed quite a bit in the time
 since I took a snapshot of the branch.

 Last question, how do you change numShards?  Is there a command you
 can use to do this now? I understand there will be implications for
 the hashing algorithm, but once the hash ranges are stored in ZK (is
 there a separate JIRA for this or does this fall under 2358) I assume
 that it would be a relatively simple index split (JIRA 2595?) and
 updating the hash ranges in solr, essentially splitting the range
 between the new and existing shard.  Is that right?

 On Fri, Dec 2, 2011 at 10:08 PM, Jamie Johnson jej2...@gmail.com wrote:
  I think I see it.so if I understand this correctly you specify
  numShards as a system property, as new nodes come up they check ZK to
  see if they should be a new shard or a replica based on if numShards
  is met.  A few questions if a master goes down does a replica get
  promoted?  If a new shard needs to be added is it just a matter of
  starting a new solr instance with a higher numShards?  (understanding
  that index rebalancing does not happen automatically now, but
  presumably it could).
 
  On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson jej2...@gmail.com wrote:
  How does it determine the number of shards to create?  How many
  replicas to create?
 
  On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller markrmil...@gmail.com
 wrote:
  Ah, okay - you are setting the shards in solr.xml - thats still an
 option
  to force a node to a particular shard - but if you take that out,
 shards
  will be auto assigned.
 
  By the way, because of the version code, distrib deletes don't work at
 the
  moment - will get to that next week.
 
  - Mark
 
  On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  So I'm a fool.  I did set the numShards, the issue was so trivial it's
  embarrassing.  I did indeed have it setup as a replica, the shard
  names in solr.xml were both shard1.  This worked as I expected now.
 
  On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com
 wrote:
  
   They are unused params, so removing them wouldn't help anything.
  
   You might just want to wait till we are further along before playing
  with it.
  
   Or if you submit your full self contained test, I can see what's
 going
  on (eg its still unclear if you have started setting numShards?).
  
   I can do a similar set of actions in my tests and it works fine. The
  only reason I could see things working like this is if it thinks you
 have
  one shard - a leader and a replica.
  
   - Mark
  
   On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
  
   Glad to hear I don't need to set shards/self, but removing them
 didn't
   seem to change what I'm seeing.  Doing this still results in 2
   documents 1 on 8983 and 1 on 7574.
  
   String key = 1;
  
 SolrInputDocument solrDoc = new SolrInputDocument();
 solrDoc.setField(key, key);
  
 solrDoc.addField(content_mvtxt, initial value);
  
 SolrServer server = servers.get(
  http://localhost:8983/solr/collection1;);
  
 UpdateRequest ureq = new UpdateRequest();
 ureq.setParam(update.chain,
 distrib-update-chain);
 ureq.add(solrDoc);
 ureq.setAction(ACTION.COMMIT, true, true);
 server.request(ureq);
 server.commit();
  
 solrDoc = new SolrInputDocument();
 solrDoc.addField(key, key);
 solrDoc.addField(content_mvtxt, updated value);
  
 server = servers.get(
  http://localhost:7574/solr/collection1;);
  
 ureq = new UpdateRequest();

Re: Configuring the Distributed

2011-12-03 Thread Jamie Johnson
Again great stuff.  Once distributed update/delete works (sounds like
it's not far off) I'll have to reevaluate our current stack.

You had mentioned storing the shad hash assignments in ZK, is there a
JIRA around this?

I'll keep my eyes on the JIRA tickets.  Right now the distirbuted
updated/delete are big ones for me, the rebalancing of the cluster is
a nice to have, but hopefully I wouldn't need that capability anytime
in the near future.

One other questionmy current setup has replication done on a
polling setup, I didn't notice that in the updated solrconfig, how
does this work now?

On Sat, Dec 3, 2011 at 9:00 AM, Mark Miller markrmil...@gmail.com wrote:
 bq. A few questions if a master goes down does a replica get
 promoted?

 Right - if the leader goes down there is a leader election and one of the
 replicas takes over.

 bq.  If a new shard needs to be added is it just a matter of
 starting a new solr instance with a higher numShards?

 Eventually, that's the plan.

 The idea is, you say something like, I want 3 shards. Now if you start up 9
 instances, the first 3 end up as shard leaders - the next 6 evenly come up
 as replicas for each shard.

 To change the numShards, we will need some kind of micro shards / splitting
 / rebalancing.

 bq. Last question, how do you change numShards?

 I think this is somewhat a work in progress, but I think Sami just made it
 so that numShards is stored on the collection node in zk (along with which
 config set to use). So you would change it there presumably. Or perhaps
 just start up a new server with an update numShards property and then it
 would realize that needs to be a new leader - of course then you'd want to
 rebalance probably - unless you fired up enough servers to add replicas
 too...

 bq. Is that right?

 Yup, sounds about right.


 On Fri, Dec 2, 2011 at 10:59 PM, Jamie Johnson jej2...@gmail.com wrote:

 So I just tried this out, seems like it does the things I asked about.

 Really really cool stuff, it's progressed quite a bit in the time
 since I took a snapshot of the branch.

 Last question, how do you change numShards?  Is there a command you
 can use to do this now? I understand there will be implications for
 the hashing algorithm, but once the hash ranges are stored in ZK (is
 there a separate JIRA for this or does this fall under 2358) I assume
 that it would be a relatively simple index split (JIRA 2595?) and
 updating the hash ranges in solr, essentially splitting the range
 between the new and existing shard.  Is that right?

 On Fri, Dec 2, 2011 at 10:08 PM, Jamie Johnson jej2...@gmail.com wrote:
  I think I see it.so if I understand this correctly you specify
  numShards as a system property, as new nodes come up they check ZK to
  see if they should be a new shard or a replica based on if numShards
  is met.  A few questions if a master goes down does a replica get
  promoted?  If a new shard needs to be added is it just a matter of
  starting a new solr instance with a higher numShards?  (understanding
  that index rebalancing does not happen automatically now, but
  presumably it could).
 
  On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson jej2...@gmail.com wrote:
  How does it determine the number of shards to create?  How many
  replicas to create?
 
  On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller markrmil...@gmail.com
 wrote:
  Ah, okay - you are setting the shards in solr.xml - thats still an
 option
  to force a node to a particular shard - but if you take that out,
 shards
  will be auto assigned.
 
  By the way, because of the version code, distrib deletes don't work at
 the
  moment - will get to that next week.
 
  - Mark
 
  On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  So I'm a fool.  I did set the numShards, the issue was so trivial it's
  embarrassing.  I did indeed have it setup as a replica, the shard
  names in solr.xml were both shard1.  This worked as I expected now.
 
  On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com
 wrote:
  
   They are unused params, so removing them wouldn't help anything.
  
   You might just want to wait till we are further along before playing
  with it.
  
   Or if you submit your full self contained test, I can see what's
 going
  on (eg its still unclear if you have started setting numShards?).
  
   I can do a similar set of actions in my tests and it works fine. The
  only reason I could see things working like this is if it thinks you
 have
  one shard - a leader and a replica.
  
   - Mark
  
   On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
  
   Glad to hear I don't need to set shards/self, but removing them
 didn't
   seem to change what I'm seeing.  Doing this still results in 2
   documents 1 on 8983 and 1 on 7574.
  
   String key = 1;
  
                 SolrInputDocument solrDoc = new SolrInputDocument();
                 solrDoc.setField(key, key);
  
                 solrDoc.addField(content_mvtxt, initial 

Re: Configuring the Distributed

2011-12-03 Thread Mark Miller
On Sat, Dec 3, 2011 at 1:31 PM, Jamie Johnson jej2...@gmail.com wrote:

 Again great stuff.  Once distributed update/delete works (sounds like
 it's not far off)


Yeah, I only realized it was not working with the Version code on Friday as
I started adding tests for it - the work to fix it is not too difficult.


 I'll have to reevaluate our current stack.

 You had mentioned storing the shad hash assignments in ZK, is there a
 JIRA around this?


I don't think there is specifically one for that yet - though it could just
be part of the index splitting JIRA issue I suppose.



 I'll keep my eyes on the JIRA tickets.  Right now the distirbuted
 updated/delete are big ones for me, the rebalancing of the cluster is
 a nice to have, but hopefully I wouldn't need that capability anytime
 in the near future.

 One other questionmy current setup has replication done on a
 polling setup, I didn't notice that in the updated solrconfig, how
 does this work now?


Replication is only used for recovery, because it doesn't work with Near
Realtime. We want SolrCloud to work with NRT, so currently, the leader
versions documents and forward the docs to the replicas (these will let us
do optimistic locking as well). If you send a doc to a replica instead, its
first forwarded to the leader to get versioned. The SolrCloud solrj client
will likely be smart enough to just send to the leader first.

You need the replication handler defined for recovery - when a replica goes
down and then come back up, it starts buffering updates and replicates from
the leader - then it applies the buffered updates and ends up current with
the leader.

- Mark



 On Sat, Dec 3, 2011 at 9:00 AM, Mark Miller markrmil...@gmail.com wrote:
  bq. A few questions if a master goes down does a replica get
  promoted?
 
  Right - if the leader goes down there is a leader election and one of the
  replicas takes over.
 
  bq.  If a new shard needs to be added is it just a matter of
  starting a new solr instance with a higher numShards?
 
  Eventually, that's the plan.
 
  The idea is, you say something like, I want 3 shards. Now if you start
 up 9
  instances, the first 3 end up as shard leaders - the next 6 evenly come
 up
  as replicas for each shard.
 
  To change the numShards, we will need some kind of micro shards /
 splitting
  / rebalancing.
 
  bq. Last question, how do you change numShards?
 
  I think this is somewhat a work in progress, but I think Sami just made
 it
  so that numShards is stored on the collection node in zk (along with
 which
  config set to use). So you would change it there presumably. Or perhaps
  just start up a new server with an update numShards property and then it
  would realize that needs to be a new leader - of course then you'd want
 to
  rebalance probably - unless you fired up enough servers to add replicas
  too...
 
  bq. Is that right?
 
  Yup, sounds about right.
 
 
  On Fri, Dec 2, 2011 at 10:59 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  So I just tried this out, seems like it does the things I asked about.
 
  Really really cool stuff, it's progressed quite a bit in the time
  since I took a snapshot of the branch.
 
  Last question, how do you change numShards?  Is there a command you
  can use to do this now? I understand there will be implications for
  the hashing algorithm, but once the hash ranges are stored in ZK (is
  there a separate JIRA for this or does this fall under 2358) I assume
  that it would be a relatively simple index split (JIRA 2595?) and
  updating the hash ranges in solr, essentially splitting the range
  between the new and existing shard.  Is that right?
 
  On Fri, Dec 2, 2011 at 10:08 PM, Jamie Johnson jej2...@gmail.com
 wrote:
   I think I see it.so if I understand this correctly you specify
   numShards as a system property, as new nodes come up they check ZK to
   see if they should be a new shard or a replica based on if numShards
   is met.  A few questions if a master goes down does a replica get
   promoted?  If a new shard needs to be added is it just a matter of
   starting a new solr instance with a higher numShards?  (understanding
   that index rebalancing does not happen automatically now, but
   presumably it could).
  
   On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson jej2...@gmail.com
 wrote:
   How does it determine the number of shards to create?  How many
   replicas to create?
  
   On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller markrmil...@gmail.com
  wrote:
   Ah, okay - you are setting the shards in solr.xml - thats still an
  option
   to force a node to a particular shard - but if you take that out,
  shards
   will be auto assigned.
  
   By the way, because of the version code, distrib deletes don't work
 at
  the
   moment - will get to that next week.
  
   - Mark
  
   On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com
  wrote:
  
   So I'm a fool.  I did set the numShards, the issue was so trivial
 it's
   embarrassing.  I 

Re: Configuring the Distributed

2011-12-02 Thread Mark Miller
So I dunno. You are running a zk server and running in zk mode right?

You don't need to / shouldn't set a shards or self param. The shards are
figured out from Zookeeper.

You always want to use the distrib-update-chain. Eventually it will
probably be part of the default chain and auto turn in zk mode.

If you are running in zk mode attached to a zk server, this should work no
problem. You can add docs to any server and they will be forwarded to the
correct shard leader and then versioned and forwarded to replicas.

You can also use the CloudSolrServer solrj client - that way you don't even
have to choose a server to send docs too - in which case if it went down
you would have to choose another manually - CloudSolrServer automatically
finds one that is up through ZooKeeper. Eventually it will also be smart
and do the hashing itself so that it can send directly to the shard leader
that the doc would be forwarded to anyway.

- Mark

On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com wrote:

 Really just trying to do a simple add and update test, the chain
 missing is just proof of my not understanding exactly how this is
 supposed to work.  I modified the code to this

String key = 1;

SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.setField(key, key);

 solrDoc.addField(content_mvtxt, initial value);

SolrServer server = servers
.get(
 http://localhost:8983/solr/collection1;);

 UpdateRequest ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
ureq.add(solrDoc);
ureq.setParam(shards,

  localhost:8983/solr/collection1,localhost:7574/solr/collection1);
ureq.setParam(self, foo);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
 server.commit();

solrDoc = new SolrInputDocument();
solrDoc.addField(key, key);
 solrDoc.addField(content_mvtxt, updated value);

server = servers.get(
 http://localhost:7574/solr/collection1;);

 ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
 // ureq.deleteById(8060a9eb-9546-43ee-95bb-d18ea26a6285);
 ureq.add(solrDoc);
ureq.setParam(shards,

  localhost:8983/solr/collection1,localhost:7574/solr/collection1);
ureq.setParam(self, foo);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
 // server.add(solrDoc);
server.commit();
server = servers.get(
 http://localhost:8983/solr/collection1;);


server.commit();
System.out.println(done);

 but I'm still seeing the doc appear on both shards.After the first
 commit I see the doc on 8983 with initial value.  after the second
 commit I see the updated value on 7574 and the old on 8983.  After the
 final commit the doc on 8983 gets updated.

 Is there something wrong with my test?

 On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller markrmil...@gmail.com
 wrote:
  Getting late - didn't really pay attention to your code I guess - why
 are you adding the first doc without specifying the distrib update chain?
 This is not really supported. It's going to just go to the server you
 specified - even with everything setup right, the update might then go to
 that same server or the other one depending on how it hashes. You really
 want to just always use the distrib update chain.  I guess I don't yet
 understand what you are trying to test.
 
  Sent from my iPad
 
  On Dec 1, 2011, at 10:57 PM, Mark Miller markrmil...@gmail.com wrote:
 
  Not sure offhand - but things will be funky if you don't specify the
 correct numShards.
 
  The instance to shard assignment should be using numShards to assign.
 But then the hash to shard mapping actually goes on the number of shards it
 finds registered in ZK (it doesn't have to, but really these should be
 equal).
 
  So basically you are saying, I want 3 partitions, but you are only
 starting up 2 nodes, and the code is just not happy about that I'd guess.
 For the system to work properly, you have to fire up at least as many
 servers as numShards.
 
  What are you trying to do? 2 partitions with no replicas, or one
 partition with one replica?
 
  In either case, I think you will have better luck if you fire up at
 least as many servers as the numShards setting. Or lower the numShards
 setting.
 
  This is all a work in progress by the way - what you are trying to test
 should work if things are setup right though.
 
  - Mark
 
 
  On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
 
  Thanks for the quick response.  With that change (have not done
  numShards yet) shard1 got updated.  But now 

Re: Configuring the Distributed

2011-12-02 Thread Jamie Johnson
Glad to hear I don't need to set shards/self, but removing them didn't
seem to change what I'm seeing.  Doing this still results in 2
documents 1 on 8983 and 1 on 7574.

String key = 1;

SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.setField(key, key);

solrDoc.addField(content_mvtxt, initial value);

SolrServer server = 
servers.get(http://localhost:8983/solr/collection1;);

UpdateRequest ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
ureq.add(solrDoc);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
server.commit();

solrDoc = new SolrInputDocument();
solrDoc.addField(key, key);
solrDoc.addField(content_mvtxt, updated value);

server = servers.get(http://localhost:7574/solr/collection1;);

ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
ureq.add(solrDoc);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
server.commit();

server = servers.get(http://localhost:8983/solr/collection1;);


server.commit();
System.out.println(done);

On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote:
 So I dunno. You are running a zk server and running in zk mode right?

 You don't need to / shouldn't set a shards or self param. The shards are
 figured out from Zookeeper.

 You always want to use the distrib-update-chain. Eventually it will
 probably be part of the default chain and auto turn in zk mode.

 If you are running in zk mode attached to a zk server, this should work no
 problem. You can add docs to any server and they will be forwarded to the
 correct shard leader and then versioned and forwarded to replicas.

 You can also use the CloudSolrServer solrj client - that way you don't even
 have to choose a server to send docs too - in which case if it went down
 you would have to choose another manually - CloudSolrServer automatically
 finds one that is up through ZooKeeper. Eventually it will also be smart
 and do the hashing itself so that it can send directly to the shard leader
 that the doc would be forwarded to anyway.

 - Mark

 On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com wrote:

 Really just trying to do a simple add and update test, the chain
 missing is just proof of my not understanding exactly how this is
 supposed to work.  I modified the code to this

                String key = 1;

                SolrInputDocument solrDoc = new SolrInputDocument();
                solrDoc.setField(key, key);

                 solrDoc.addField(content_mvtxt, initial value);

                SolrServer server = servers
                                .get(
 http://localhost:8983/solr/collection1;);

                 UpdateRequest ureq = new UpdateRequest();
                ureq.setParam(update.chain, distrib-update-chain);
                ureq.add(solrDoc);
                ureq.setParam(shards,

  localhost:8983/solr/collection1,localhost:7574/solr/collection1);
                ureq.setParam(self, foo);
                ureq.setAction(ACTION.COMMIT, true, true);
                server.request(ureq);
                 server.commit();

                solrDoc = new SolrInputDocument();
                solrDoc.addField(key, key);
                 solrDoc.addField(content_mvtxt, updated value);

                server = servers.get(
 http://localhost:7574/solr/collection1;);

                 ureq = new UpdateRequest();
                ureq.setParam(update.chain, distrib-update-chain);
                 // ureq.deleteById(8060a9eb-9546-43ee-95bb-d18ea26a6285);
                 ureq.add(solrDoc);
                ureq.setParam(shards,

  localhost:8983/solr/collection1,localhost:7574/solr/collection1);
                ureq.setParam(self, foo);
                ureq.setAction(ACTION.COMMIT, true, true);
                server.request(ureq);
                 // server.add(solrDoc);
                server.commit();
                server = servers.get(
 http://localhost:8983/solr/collection1;);


                server.commit();
                System.out.println(done);

 but I'm still seeing the doc appear on both shards.    After the first
 commit I see the doc on 8983 with initial value.  after the second
 commit I see the updated value on 7574 and the old on 8983.  After the
 final commit the doc on 8983 gets updated.

 Is there something wrong with my test?

 On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller markrmil...@gmail.com
 wrote:
  Getting late - didn't really pay attention to your code I guess - why
 are you adding the first doc without specifying the distrib update chain?
 This is not really 

Re: Configuring the Distributed

2011-12-02 Thread Mark Miller

They are unused params, so removing them wouldn't help anything.

You might just want to wait till we are further along before playing with it.

Or if you submit your full self contained test, I can see what's going on (eg 
its still unclear if you have started setting numShards?).

I can do a similar set of actions in my tests and it works fine. The only 
reason I could see things working like this is if it thinks you have one shard 
- a leader and a replica.

- Mark

On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:

 Glad to hear I don't need to set shards/self, but removing them didn't
 seem to change what I'm seeing.  Doing this still results in 2
 documents 1 on 8983 and 1 on 7574.
 
 String key = 1;
 
   SolrInputDocument solrDoc = new SolrInputDocument();
   solrDoc.setField(key, key);
 
   solrDoc.addField(content_mvtxt, initial value);
 
   SolrServer server = 
 servers.get(http://localhost:8983/solr/collection1;);
 
   UpdateRequest ureq = new UpdateRequest();
   ureq.setParam(update.chain, distrib-update-chain);
   ureq.add(solrDoc);
   ureq.setAction(ACTION.COMMIT, true, true);
   server.request(ureq);
   server.commit();
 
   solrDoc = new SolrInputDocument();
   solrDoc.addField(key, key);
   solrDoc.addField(content_mvtxt, updated value);
 
   server = servers.get(http://localhost:7574/solr/collection1;);
 
   ureq = new UpdateRequest();
   ureq.setParam(update.chain, distrib-update-chain);
   ureq.add(solrDoc);
   ureq.setAction(ACTION.COMMIT, true, true);
   server.request(ureq);
   server.commit();
 
   server = servers.get(http://localhost:8983/solr/collection1;);
   
 
   server.commit();
   System.out.println(done);
 
 On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote:
 So I dunno. You are running a zk server and running in zk mode right?
 
 You don't need to / shouldn't set a shards or self param. The shards are
 figured out from Zookeeper.
 
 You always want to use the distrib-update-chain. Eventually it will
 probably be part of the default chain and auto turn in zk mode.
 
 If you are running in zk mode attached to a zk server, this should work no
 problem. You can add docs to any server and they will be forwarded to the
 correct shard leader and then versioned and forwarded to replicas.
 
 You can also use the CloudSolrServer solrj client - that way you don't even
 have to choose a server to send docs too - in which case if it went down
 you would have to choose another manually - CloudSolrServer automatically
 finds one that is up through ZooKeeper. Eventually it will also be smart
 and do the hashing itself so that it can send directly to the shard leader
 that the doc would be forwarded to anyway.
 
 - Mark
 
 On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com wrote:
 
 Really just trying to do a simple add and update test, the chain
 missing is just proof of my not understanding exactly how this is
 supposed to work.  I modified the code to this
 
String key = 1;
 
SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.setField(key, key);
 
 solrDoc.addField(content_mvtxt, initial value);
 
SolrServer server = servers
.get(
 http://localhost:8983/solr/collection1;);
 
 UpdateRequest ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
ureq.add(solrDoc);
ureq.setParam(shards,
 
  localhost:8983/solr/collection1,localhost:7574/solr/collection1);
ureq.setParam(self, foo);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
 server.commit();
 
solrDoc = new SolrInputDocument();
solrDoc.addField(key, key);
 solrDoc.addField(content_mvtxt, updated value);
 
server = servers.get(
 http://localhost:7574/solr/collection1;);
 
 ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
 // ureq.deleteById(8060a9eb-9546-43ee-95bb-d18ea26a6285);
 ureq.add(solrDoc);
ureq.setParam(shards,
 
  localhost:8983/solr/collection1,localhost:7574/solr/collection1);
ureq.setParam(self, foo);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
 // server.add(solrDoc);
server.commit();
server = servers.get(
 http://localhost:8983/solr/collection1;);
 
 
server.commit();
System.out.println(done);
 
 but I'm 

Re: Configuring the Distributed

2011-12-02 Thread Jamie Johnson
So I'm a fool.  I did set the numShards, the issue was so trivial it's
embarrassing.  I did indeed have it setup as a replica, the shard
names in solr.xml were both shard1.  This worked as I expected now.

On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com wrote:

 They are unused params, so removing them wouldn't help anything.

 You might just want to wait till we are further along before playing with it.

 Or if you submit your full self contained test, I can see what's going on (eg 
 its still unclear if you have started setting numShards?).

 I can do a similar set of actions in my tests and it works fine. The only 
 reason I could see things working like this is if it thinks you have one 
 shard - a leader and a replica.

 - Mark

 On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:

 Glad to hear I don't need to set shards/self, but removing them didn't
 seem to change what I'm seeing.  Doing this still results in 2
 documents 1 on 8983 and 1 on 7574.

 String key = 1;

               SolrInputDocument solrDoc = new SolrInputDocument();
               solrDoc.setField(key, key);

               solrDoc.addField(content_mvtxt, initial value);

               SolrServer server = 
 servers.get(http://localhost:8983/solr/collection1;);

               UpdateRequest ureq = new UpdateRequest();
               ureq.setParam(update.chain, distrib-update-chain);
               ureq.add(solrDoc);
               ureq.setAction(ACTION.COMMIT, true, true);
               server.request(ureq);
               server.commit();

               solrDoc = new SolrInputDocument();
               solrDoc.addField(key, key);
               solrDoc.addField(content_mvtxt, updated value);

               server = servers.get(http://localhost:7574/solr/collection1;);

               ureq = new UpdateRequest();
               ureq.setParam(update.chain, distrib-update-chain);
               ureq.add(solrDoc);
               ureq.setAction(ACTION.COMMIT, true, true);
               server.request(ureq);
               server.commit();

               server = servers.get(http://localhost:8983/solr/collection1;);


               server.commit();
               System.out.println(done);

 On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com wrote:
 So I dunno. You are running a zk server and running in zk mode right?

 You don't need to / shouldn't set a shards or self param. The shards are
 figured out from Zookeeper.

 You always want to use the distrib-update-chain. Eventually it will
 probably be part of the default chain and auto turn in zk mode.

 If you are running in zk mode attached to a zk server, this should work no
 problem. You can add docs to any server and they will be forwarded to the
 correct shard leader and then versioned and forwarded to replicas.

 You can also use the CloudSolrServer solrj client - that way you don't even
 have to choose a server to send docs too - in which case if it went down
 you would have to choose another manually - CloudSolrServer automatically
 finds one that is up through ZooKeeper. Eventually it will also be smart
 and do the hashing itself so that it can send directly to the shard leader
 that the doc would be forwarded to anyway.

 - Mark

 On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com wrote:

 Really just trying to do a simple add and update test, the chain
 missing is just proof of my not understanding exactly how this is
 supposed to work.  I modified the code to this

                String key = 1;

                SolrInputDocument solrDoc = new SolrInputDocument();
                solrDoc.setField(key, key);

                 solrDoc.addField(content_mvtxt, initial value);

                SolrServer server = servers
                                .get(
 http://localhost:8983/solr/collection1;);

                 UpdateRequest ureq = new UpdateRequest();
                ureq.setParam(update.chain, distrib-update-chain);
                ureq.add(solrDoc);
                ureq.setParam(shards,

  localhost:8983/solr/collection1,localhost:7574/solr/collection1);
                ureq.setParam(self, foo);
                ureq.setAction(ACTION.COMMIT, true, true);
                server.request(ureq);
                 server.commit();

                solrDoc = new SolrInputDocument();
                solrDoc.addField(key, key);
                 solrDoc.addField(content_mvtxt, updated value);

                server = servers.get(
 http://localhost:7574/solr/collection1;);

                 ureq = new UpdateRequest();
                ureq.setParam(update.chain, distrib-update-chain);
                 // ureq.deleteById(8060a9eb-9546-43ee-95bb-d18ea26a6285);
                 ureq.add(solrDoc);
                ureq.setParam(shards,

  localhost:8983/solr/collection1,localhost:7574/solr/collection1);
                ureq.setParam(self, foo);
                ureq.setAction(ACTION.COMMIT, true, true);
                

Re: Configuring the Distributed

2011-12-02 Thread Mark Miller
Ah, okay - you are setting the shards in solr.xml - thats still an option
to force a node to a particular shard - but if you take that out, shards
will be auto assigned.

By the way, because of the version code, distrib deletes don't work at the
moment - will get to that next week.

- Mark

On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com wrote:

 So I'm a fool.  I did set the numShards, the issue was so trivial it's
 embarrassing.  I did indeed have it setup as a replica, the shard
 names in solr.xml were both shard1.  This worked as I expected now.

 On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com wrote:
 
  They are unused params, so removing them wouldn't help anything.
 
  You might just want to wait till we are further along before playing
 with it.
 
  Or if you submit your full self contained test, I can see what's going
 on (eg its still unclear if you have started setting numShards?).
 
  I can do a similar set of actions in my tests and it works fine. The
 only reason I could see things working like this is if it thinks you have
 one shard - a leader and a replica.
 
  - Mark
 
  On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
 
  Glad to hear I don't need to set shards/self, but removing them didn't
  seem to change what I'm seeing.  Doing this still results in 2
  documents 1 on 8983 and 1 on 7574.
 
  String key = 1;
 
SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.setField(key, key);
 
solrDoc.addField(content_mvtxt, initial value);
 
SolrServer server = servers.get(
 http://localhost:8983/solr/collection1;);
 
UpdateRequest ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
ureq.add(solrDoc);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
server.commit();
 
solrDoc = new SolrInputDocument();
solrDoc.addField(key, key);
solrDoc.addField(content_mvtxt, updated value);
 
server = servers.get(
 http://localhost:7574/solr/collection1;);
 
ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
ureq.add(solrDoc);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
server.commit();
 
server = servers.get(
 http://localhost:8983/solr/collection1;);
 
 
server.commit();
System.out.println(done);
 
  On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com
 wrote:
  So I dunno. You are running a zk server and running in zk mode right?
 
  You don't need to / shouldn't set a shards or self param. The shards
 are
  figured out from Zookeeper.
 
  You always want to use the distrib-update-chain. Eventually it will
  probably be part of the default chain and auto turn in zk mode.
 
  If you are running in zk mode attached to a zk server, this should
 work no
  problem. You can add docs to any server and they will be forwarded to
 the
  correct shard leader and then versioned and forwarded to replicas.
 
  You can also use the CloudSolrServer solrj client - that way you don't
 even
  have to choose a server to send docs too - in which case if it went
 down
  you would have to choose another manually - CloudSolrServer
 automatically
  finds one that is up through ZooKeeper. Eventually it will also be
 smart
  and do the hashing itself so that it can send directly to the shard
 leader
  that the doc would be forwarded to anyway.
 
  - Mark
 
  On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  Really just trying to do a simple add and update test, the chain
  missing is just proof of my not understanding exactly how this is
  supposed to work.  I modified the code to this
 
 String key = 1;
 
 SolrInputDocument solrDoc = new SolrInputDocument();
 solrDoc.setField(key, key);
 
  solrDoc.addField(content_mvtxt, initial value);
 
 SolrServer server = servers
 .get(
  http://localhost:8983/solr/collection1;);
 
  UpdateRequest ureq = new UpdateRequest();
 ureq.setParam(update.chain, distrib-update-chain);
 ureq.add(solrDoc);
 ureq.setParam(shards,
 
   localhost:8983/solr/collection1,localhost:7574/solr/collection1);
 ureq.setParam(self, foo);
 ureq.setAction(ACTION.COMMIT, true, true);
 server.request(ureq);
  server.commit();
 
 solrDoc = new SolrInputDocument();
 solrDoc.addField(key, key);
  solrDoc.addField(content_mvtxt, updated value);
 
 server = servers.get(
  

Re: Configuring the Distributed

2011-12-02 Thread Jamie Johnson
How does it determine the number of shards to create?  How many
replicas to create?

On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller markrmil...@gmail.com wrote:
 Ah, okay - you are setting the shards in solr.xml - thats still an option
 to force a node to a particular shard - but if you take that out, shards
 will be auto assigned.

 By the way, because of the version code, distrib deletes don't work at the
 moment - will get to that next week.

 - Mark

 On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com wrote:

 So I'm a fool.  I did set the numShards, the issue was so trivial it's
 embarrassing.  I did indeed have it setup as a replica, the shard
 names in solr.xml were both shard1.  This worked as I expected now.

 On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com wrote:
 
  They are unused params, so removing them wouldn't help anything.
 
  You might just want to wait till we are further along before playing
 with it.
 
  Or if you submit your full self contained test, I can see what's going
 on (eg its still unclear if you have started setting numShards?).
 
  I can do a similar set of actions in my tests and it works fine. The
 only reason I could see things working like this is if it thinks you have
 one shard - a leader and a replica.
 
  - Mark
 
  On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
 
  Glad to hear I don't need to set shards/self, but removing them didn't
  seem to change what I'm seeing.  Doing this still results in 2
  documents 1 on 8983 and 1 on 7574.
 
  String key = 1;
 
                SolrInputDocument solrDoc = new SolrInputDocument();
                solrDoc.setField(key, key);
 
                solrDoc.addField(content_mvtxt, initial value);
 
                SolrServer server = servers.get(
 http://localhost:8983/solr/collection1;);
 
                UpdateRequest ureq = new UpdateRequest();
                ureq.setParam(update.chain, distrib-update-chain);
                ureq.add(solrDoc);
                ureq.setAction(ACTION.COMMIT, true, true);
                server.request(ureq);
                server.commit();
 
                solrDoc = new SolrInputDocument();
                solrDoc.addField(key, key);
                solrDoc.addField(content_mvtxt, updated value);
 
                server = servers.get(
 http://localhost:7574/solr/collection1;);
 
                ureq = new UpdateRequest();
                ureq.setParam(update.chain, distrib-update-chain);
                ureq.add(solrDoc);
                ureq.setAction(ACTION.COMMIT, true, true);
                server.request(ureq);
                server.commit();
 
                server = servers.get(
 http://localhost:8983/solr/collection1;);
 
 
                server.commit();
                System.out.println(done);
 
  On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com
 wrote:
  So I dunno. You are running a zk server and running in zk mode right?
 
  You don't need to / shouldn't set a shards or self param. The shards
 are
  figured out from Zookeeper.
 
  You always want to use the distrib-update-chain. Eventually it will
  probably be part of the default chain and auto turn in zk mode.
 
  If you are running in zk mode attached to a zk server, this should
 work no
  problem. You can add docs to any server and they will be forwarded to
 the
  correct shard leader and then versioned and forwarded to replicas.
 
  You can also use the CloudSolrServer solrj client - that way you don't
 even
  have to choose a server to send docs too - in which case if it went
 down
  you would have to choose another manually - CloudSolrServer
 automatically
  finds one that is up through ZooKeeper. Eventually it will also be
 smart
  and do the hashing itself so that it can send directly to the shard
 leader
  that the doc would be forwarded to anyway.
 
  - Mark
 
  On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  Really just trying to do a simple add and update test, the chain
  missing is just proof of my not understanding exactly how this is
  supposed to work.  I modified the code to this
 
                 String key = 1;
 
                 SolrInputDocument solrDoc = new SolrInputDocument();
                 solrDoc.setField(key, key);
 
                  solrDoc.addField(content_mvtxt, initial value);
 
                 SolrServer server = servers
                                 .get(
  http://localhost:8983/solr/collection1;);
 
                  UpdateRequest ureq = new UpdateRequest();
                 ureq.setParam(update.chain, distrib-update-chain);
                 ureq.add(solrDoc);
                 ureq.setParam(shards,
 
   localhost:8983/solr/collection1,localhost:7574/solr/collection1);
                 ureq.setParam(self, foo);
                 ureq.setAction(ACTION.COMMIT, true, true);
                 server.request(ureq);
                  server.commit();
 
                 solrDoc = new SolrInputDocument();
        

Re: Configuring the Distributed

2011-12-02 Thread Jamie Johnson
I think I see it.so if I understand this correctly you specify
numShards as a system property, as new nodes come up they check ZK to
see if they should be a new shard or a replica based on if numShards
is met.  A few questions if a master goes down does a replica get
promoted?  If a new shard needs to be added is it just a matter of
starting a new solr instance with a higher numShards?  (understanding
that index rebalancing does not happen automatically now, but
presumably it could).

On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson jej2...@gmail.com wrote:
 How does it determine the number of shards to create?  How many
 replicas to create?

 On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller markrmil...@gmail.com wrote:
 Ah, okay - you are setting the shards in solr.xml - thats still an option
 to force a node to a particular shard - but if you take that out, shards
 will be auto assigned.

 By the way, because of the version code, distrib deletes don't work at the
 moment - will get to that next week.

 - Mark

 On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com wrote:

 So I'm a fool.  I did set the numShards, the issue was so trivial it's
 embarrassing.  I did indeed have it setup as a replica, the shard
 names in solr.xml were both shard1.  This worked as I expected now.

 On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com wrote:
 
  They are unused params, so removing them wouldn't help anything.
 
  You might just want to wait till we are further along before playing
 with it.
 
  Or if you submit your full self contained test, I can see what's going
 on (eg its still unclear if you have started setting numShards?).
 
  I can do a similar set of actions in my tests and it works fine. The
 only reason I could see things working like this is if it thinks you have
 one shard - a leader and a replica.
 
  - Mark
 
  On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
 
  Glad to hear I don't need to set shards/self, but removing them didn't
  seem to change what I'm seeing.  Doing this still results in 2
  documents 1 on 8983 and 1 on 7574.
 
  String key = 1;
 
                SolrInputDocument solrDoc = new SolrInputDocument();
                solrDoc.setField(key, key);
 
                solrDoc.addField(content_mvtxt, initial value);
 
                SolrServer server = servers.get(
 http://localhost:8983/solr/collection1;);
 
                UpdateRequest ureq = new UpdateRequest();
                ureq.setParam(update.chain, distrib-update-chain);
                ureq.add(solrDoc);
                ureq.setAction(ACTION.COMMIT, true, true);
                server.request(ureq);
                server.commit();
 
                solrDoc = new SolrInputDocument();
                solrDoc.addField(key, key);
                solrDoc.addField(content_mvtxt, updated value);
 
                server = servers.get(
 http://localhost:7574/solr/collection1;);
 
                ureq = new UpdateRequest();
                ureq.setParam(update.chain, distrib-update-chain);
                ureq.add(solrDoc);
                ureq.setAction(ACTION.COMMIT, true, true);
                server.request(ureq);
                server.commit();
 
                server = servers.get(
 http://localhost:8983/solr/collection1;);
 
 
                server.commit();
                System.out.println(done);
 
  On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com
 wrote:
  So I dunno. You are running a zk server and running in zk mode right?
 
  You don't need to / shouldn't set a shards or self param. The shards
 are
  figured out from Zookeeper.
 
  You always want to use the distrib-update-chain. Eventually it will
  probably be part of the default chain and auto turn in zk mode.
 
  If you are running in zk mode attached to a zk server, this should
 work no
  problem. You can add docs to any server and they will be forwarded to
 the
  correct shard leader and then versioned and forwarded to replicas.
 
  You can also use the CloudSolrServer solrj client - that way you don't
 even
  have to choose a server to send docs too - in which case if it went
 down
  you would have to choose another manually - CloudSolrServer
 automatically
  finds one that is up through ZooKeeper. Eventually it will also be
 smart
  and do the hashing itself so that it can send directly to the shard
 leader
  that the doc would be forwarded to anyway.
 
  - Mark
 
  On Fri, Dec 2, 2011 at 12:09 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  Really just trying to do a simple add and update test, the chain
  missing is just proof of my not understanding exactly how this is
  supposed to work.  I modified the code to this
 
                 String key = 1;
 
                 SolrInputDocument solrDoc = new SolrInputDocument();
                 solrDoc.setField(key, key);
 
                  solrDoc.addField(content_mvtxt, initial value);
 
                 SolrServer server = servers
                              

Re: Configuring the Distributed

2011-12-02 Thread Jamie Johnson
So I just tried this out, seems like it does the things I asked about.

Really really cool stuff, it's progressed quite a bit in the time
since I took a snapshot of the branch.

Last question, how do you change numShards?  Is there a command you
can use to do this now? I understand there will be implications for
the hashing algorithm, but once the hash ranges are stored in ZK (is
there a separate JIRA for this or does this fall under 2358) I assume
that it would be a relatively simple index split (JIRA 2595?) and
updating the hash ranges in solr, essentially splitting the range
between the new and existing shard.  Is that right?

On Fri, Dec 2, 2011 at 10:08 PM, Jamie Johnson jej2...@gmail.com wrote:
 I think I see it.so if I understand this correctly you specify
 numShards as a system property, as new nodes come up they check ZK to
 see if they should be a new shard or a replica based on if numShards
 is met.  A few questions if a master goes down does a replica get
 promoted?  If a new shard needs to be added is it just a matter of
 starting a new solr instance with a higher numShards?  (understanding
 that index rebalancing does not happen automatically now, but
 presumably it could).

 On Fri, Dec 2, 2011 at 9:56 PM, Jamie Johnson jej2...@gmail.com wrote:
 How does it determine the number of shards to create?  How many
 replicas to create?

 On Fri, Dec 2, 2011 at 4:30 PM, Mark Miller markrmil...@gmail.com wrote:
 Ah, okay - you are setting the shards in solr.xml - thats still an option
 to force a node to a particular shard - but if you take that out, shards
 will be auto assigned.

 By the way, because of the version code, distrib deletes don't work at the
 moment - will get to that next week.

 - Mark

 On Fri, Dec 2, 2011 at 1:16 PM, Jamie Johnson jej2...@gmail.com wrote:

 So I'm a fool.  I did set the numShards, the issue was so trivial it's
 embarrassing.  I did indeed have it setup as a replica, the shard
 names in solr.xml were both shard1.  This worked as I expected now.

 On Fri, Dec 2, 2011 at 1:02 PM, Mark Miller markrmil...@gmail.com wrote:
 
  They are unused params, so removing them wouldn't help anything.
 
  You might just want to wait till we are further along before playing
 with it.
 
  Or if you submit your full self contained test, I can see what's going
 on (eg its still unclear if you have started setting numShards?).
 
  I can do a similar set of actions in my tests and it works fine. The
 only reason I could see things working like this is if it thinks you have
 one shard - a leader and a replica.
 
  - Mark
 
  On Dec 2, 2011, at 12:41 PM, Jamie Johnson wrote:
 
  Glad to hear I don't need to set shards/self, but removing them didn't
  seem to change what I'm seeing.  Doing this still results in 2
  documents 1 on 8983 and 1 on 7574.
 
  String key = 1;
 
                SolrInputDocument solrDoc = new SolrInputDocument();
                solrDoc.setField(key, key);
 
                solrDoc.addField(content_mvtxt, initial value);
 
                SolrServer server = servers.get(
 http://localhost:8983/solr/collection1;);
 
                UpdateRequest ureq = new UpdateRequest();
                ureq.setParam(update.chain, distrib-update-chain);
                ureq.add(solrDoc);
                ureq.setAction(ACTION.COMMIT, true, true);
                server.request(ureq);
                server.commit();
 
                solrDoc = new SolrInputDocument();
                solrDoc.addField(key, key);
                solrDoc.addField(content_mvtxt, updated value);
 
                server = servers.get(
 http://localhost:7574/solr/collection1;);
 
                ureq = new UpdateRequest();
                ureq.setParam(update.chain, distrib-update-chain);
                ureq.add(solrDoc);
                ureq.setAction(ACTION.COMMIT, true, true);
                server.request(ureq);
                server.commit();
 
                server = servers.get(
 http://localhost:8983/solr/collection1;);
 
 
                server.commit();
                System.out.println(done);
 
  On Fri, Dec 2, 2011 at 10:48 AM, Mark Miller markrmil...@gmail.com
 wrote:
  So I dunno. You are running a zk server and running in zk mode right?
 
  You don't need to / shouldn't set a shards or self param. The shards
 are
  figured out from Zookeeper.
 
  You always want to use the distrib-update-chain. Eventually it will
  probably be part of the default chain and auto turn in zk mode.
 
  If you are running in zk mode attached to a zk server, this should
 work no
  problem. You can add docs to any server and they will be forwarded to
 the
  correct shard leader and then versioned and forwarded to replicas.
 
  You can also use the CloudSolrServer solrj client - that way you don't
 even
  have to choose a server to send docs too - in which case if it went
 down
  you would have to choose another manually - CloudSolrServer
 automatically
  finds one that is up through ZooKeeper. 

Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote:

 I am currently looking at the latest solrcloud branch and was
 wondering if there was any documentation on configuring the
 DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
 to be added/modified to make distributed indexing work?



Hi Jaime - take a look at solrconfig-distrib-update.xml in
solr/core/src/test-files

You need to enable the update log, add an empty replication handler def,
and an update chain with solr.DistributedUpdateProcessFactory in it.

-- 
- Mark

http://www.lucidimagination.com


Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
Thanks I will try this first thing in the morning.

On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote:
 On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote:

 I am currently looking at the latest solrcloud branch and was
 wondering if there was any documentation on configuring the
 DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
 to be added/modified to make distributed indexing work?



 Hi Jaime - take a look at solrconfig-distrib-update.xml in
 solr/core/src/test-files

 You need to enable the update log, add an empty replication handler def,
 and an update chain with solr.DistributedUpdateProcessFactory in it.

 --
 - Mark

 http://www.lucidimagination.com



Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
Another question, is there any support for repartitioning of the index
if a new shard is added?  What is the recommended approach for
handling this?  It seemed that the hashing algorithm (and probably
any) would require the index to be repartitioned should a new shard be
added.

On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks I will try this first thing in the morning.

 On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com wrote:
 On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com wrote:

 I am currently looking at the latest solrcloud branch and was
 wondering if there was any documentation on configuring the
 DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
 to be added/modified to make distributed indexing work?



 Hi Jaime - take a look at solrconfig-distrib-update.xml in
 solr/core/src/test-files

 You need to enable the update log, add an empty replication handler def,
 and an update chain with solr.DistributedUpdateProcessFactory in it.

 --
 - Mark

 http://www.lucidimagination.com




Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
Not yet - we don't plan on working on this until a lot of other stuff is
working solid at this point. But someone else could jump in!

There are a couple ways to go about it that I know of:

A more long term solution may be to start using micro shards - each index
starts as multiple indexes. This makes it pretty fast to move mirco shards
around as you decide to change partitions. It's also less flexible as you
are limited by the number of micro shards you start with.

A more simple and likely first step is to use an index splitter . We
already have one in lucene contrib - we would just need to modify it so
that it splits based on the hash of the document id. This is super
flexible, but splitting will obviously take a little while on a huge index.
The current index splitter is a multi pass splitter - good enough to start
with, but most files under codec control these days, we may be able to make
a single pass splitter soon as well.

Eventually you could imagine using both options - micro shards that could
also be split as needed. Though I still wonder if micro shards will be
worth the extra complications myself...

Right now though, the idea is that you should pick a good number of
partitions to start given your expected data ;) Adding more replicas is
trivial though.

- Mark

On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote:

 Another question, is there any support for repartitioning of the index
 if a new shard is added?  What is the recommended approach for
 handling this?  It seemed that the hashing algorithm (and probably
 any) would require the index to be repartitioned should a new shard be
 added.

 On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote:
  Thanks I will try this first thing in the morning.
 
  On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com
 wrote:
  On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  I am currently looking at the latest solrcloud branch and was
  wondering if there was any documentation on configuring the
  DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
  to be added/modified to make distributed indexing work?
 
 
 
  Hi Jaime - take a look at solrconfig-distrib-update.xml in
  solr/core/src/test-files
 
  You need to enable the update log, add an empty replication handler def,
  and an update chain with solr.DistributedUpdateProcessFactory in it.
 
  --
  - Mark
 
  http://www.lucidimagination.com
 
 




-- 
- Mark

http://www.lucidimagination.com


Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
I am not familiar with the index splitter that is in contrib, but I'll
take a look at it soon.  So the process sounds like it would be to run
this on all of the current shards indexes based on the hash algorithm.
 Is there also an index merger in contrib which could be used to merge
indexes?  I'm assuming this would be the process?

On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote:
 Not yet - we don't plan on working on this until a lot of other stuff is
 working solid at this point. But someone else could jump in!

 There are a couple ways to go about it that I know of:

 A more long term solution may be to start using micro shards - each index
 starts as multiple indexes. This makes it pretty fast to move mirco shards
 around as you decide to change partitions. It's also less flexible as you
 are limited by the number of micro shards you start with.

 A more simple and likely first step is to use an index splitter . We
 already have one in lucene contrib - we would just need to modify it so
 that it splits based on the hash of the document id. This is super
 flexible, but splitting will obviously take a little while on a huge index.
 The current index splitter is a multi pass splitter - good enough to start
 with, but most files under codec control these days, we may be able to make
 a single pass splitter soon as well.

 Eventually you could imagine using both options - micro shards that could
 also be split as needed. Though I still wonder if micro shards will be
 worth the extra complications myself...

 Right now though, the idea is that you should pick a good number of
 partitions to start given your expected data ;) Adding more replicas is
 trivial though.

 - Mark

 On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote:

 Another question, is there any support for repartitioning of the index
 if a new shard is added?  What is the recommended approach for
 handling this?  It seemed that the hashing algorithm (and probably
 any) would require the index to be repartitioned should a new shard be
 added.

 On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote:
  Thanks I will try this first thing in the morning.
 
  On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com
 wrote:
  On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  I am currently looking at the latest solrcloud branch and was
  wondering if there was any documentation on configuring the
  DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
  to be added/modified to make distributed indexing work?
 
 
 
  Hi Jaime - take a look at solrconfig-distrib-update.xml in
  solr/core/src/test-files
 
  You need to enable the update log, add an empty replication handler def,
  and an update chain with solr.DistributedUpdateProcessFactory in it.
 
  --
  - Mark
 
  http://www.lucidimagination.com
 
 




 --
 - Mark

 http://www.lucidimagination.com



Re: Configuring the Distributed

2011-12-01 Thread Mark Miller

On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:

 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.

Not something I've thought deeply about myself yet, but I think the idea would 
be to split as many as you felt you needed to.

If you wanted to keep the full balance always, this would mean splitting every 
shard at once, yes. But this depends on how many boxes (partitions) you are 
willing/able to add at a time.

You might just split one index to start - now it's hash range would be handled 
by two shards instead of one (if you have 3 replicas per shard, this would mean 
adding 3 more boxes). When you needed to expand again, you would split another 
index that was still handling its full starting range. As you grow, once you 
split every original index, you'd start again, splitting one of the now half 
ranges.

 Is there also an index merger in contrib which could be used to merge
 indexes?  I'm assuming this would be the process?

You can merge with IndexWriter.addIndexes (Solr also has an admin command that 
can do this). But I'm not sure where this fits in?

- Mark

 
 On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote:
 Not yet - we don't plan on working on this until a lot of other stuff is
 working solid at this point. But someone else could jump in!
 
 There are a couple ways to go about it that I know of:
 
 A more long term solution may be to start using micro shards - each index
 starts as multiple indexes. This makes it pretty fast to move mirco shards
 around as you decide to change partitions. It's also less flexible as you
 are limited by the number of micro shards you start with.
 
 A more simple and likely first step is to use an index splitter . We
 already have one in lucene contrib - we would just need to modify it so
 that it splits based on the hash of the document id. This is super
 flexible, but splitting will obviously take a little while on a huge index.
 The current index splitter is a multi pass splitter - good enough to start
 with, but most files under codec control these days, we may be able to make
 a single pass splitter soon as well.
 
 Eventually you could imagine using both options - micro shards that could
 also be split as needed. Though I still wonder if micro shards will be
 worth the extra complications myself...
 
 Right now though, the idea is that you should pick a good number of
 partitions to start given your expected data ;) Adding more replicas is
 trivial though.
 
 - Mark
 
 On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Another question, is there any support for repartitioning of the index
 if a new shard is added?  What is the recommended approach for
 handling this?  It seemed that the hashing algorithm (and probably
 any) would require the index to be repartitioned should a new shard be
 added.
 
 On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks I will try this first thing in the morning.
 
 On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com
 wrote:
 On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com
 wrote:
 
 I am currently looking at the latest solrcloud branch and was
 wondering if there was any documentation on configuring the
 DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
 to be added/modified to make distributed indexing work?
 
 
 
 Hi Jaime - take a look at solrconfig-distrib-update.xml in
 solr/core/src/test-files
 
 You need to enable the update log, add an empty replication handler def,
 and an update chain with solr.DistributedUpdateProcessFactory in it.
 
 --
 - Mark
 
 http://www.lucidimagination.com
 
 
 
 
 
 
 --
 - Mark
 
 http://www.lucidimagination.com
 

- Mark Miller
lucidimagination.com













Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
hmmm.This doesn't sound like the hashing algorithm that's on the
branch, right?  The algorithm you're mentioning sounds like there is
some logic which is able to tell that a particular range should be
distributed between 2 shards instead of 1.  So seems like a trade off
between repartitioning the entire index (on every shard) and having a
custom hashing algorithm which is able to handle the situation where 2
or more shards map to a particular range.

On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote:

 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:

 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.

 Not something I've thought deeply about myself yet, but I think the idea 
 would be to split as many as you felt you needed to.

 If you wanted to keep the full balance always, this would mean splitting 
 every shard at once, yes. But this depends on how many boxes (partitions) you 
 are willing/able to add at a time.

 You might just split one index to start - now it's hash range would be 
 handled by two shards instead of one (if you have 3 replicas per shard, this 
 would mean adding 3 more boxes). When you needed to expand again, you would 
 split another index that was still handling its full starting range. As you 
 grow, once you split every original index, you'd start again, splitting one 
 of the now half ranges.

 Is there also an index merger in contrib which could be used to merge
 indexes?  I'm assuming this would be the process?

 You can merge with IndexWriter.addIndexes (Solr also has an admin command 
 that can do this). But I'm not sure where this fits in?

 - Mark


 On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote:
 Not yet - we don't plan on working on this until a lot of other stuff is
 working solid at this point. But someone else could jump in!

 There are a couple ways to go about it that I know of:

 A more long term solution may be to start using micro shards - each index
 starts as multiple indexes. This makes it pretty fast to move mirco shards
 around as you decide to change partitions. It's also less flexible as you
 are limited by the number of micro shards you start with.

 A more simple and likely first step is to use an index splitter . We
 already have one in lucene contrib - we would just need to modify it so
 that it splits based on the hash of the document id. This is super
 flexible, but splitting will obviously take a little while on a huge index.
 The current index splitter is a multi pass splitter - good enough to start
 with, but most files under codec control these days, we may be able to make
 a single pass splitter soon as well.

 Eventually you could imagine using both options - micro shards that could
 also be split as needed. Though I still wonder if micro shards will be
 worth the extra complications myself...

 Right now though, the idea is that you should pick a good number of
 partitions to start given your expected data ;) Adding more replicas is
 trivial though.

 - Mark

 On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote:

 Another question, is there any support for repartitioning of the index
 if a new shard is added?  What is the recommended approach for
 handling this?  It seemed that the hashing algorithm (and probably
 any) would require the index to be repartitioned should a new shard be
 added.

 On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks I will try this first thing in the morning.

 On Thu, Dec 1, 2011 at 3:39 PM, Mark Miller markrmil...@gmail.com
 wrote:
 On Thu, Dec 1, 2011 at 10:08 AM, Jamie Johnson jej2...@gmail.com
 wrote:

 I am currently looking at the latest solrcloud branch and was
 wondering if there was any documentation on configuring the
 DistributedUpdateProcessor?  What specifically in solrconfig.xml needs
 to be added/modified to make distributed indexing work?



 Hi Jaime - take a look at solrconfig-distrib-update.xml in
 solr/core/src/test-files

 You need to enable the update log, add an empty replication handler def,
 and an update chain with solr.DistributedUpdateProcessFactory in it.

 --
 - Mark

 http://www.lucidimagination.com






 --
 - Mark

 http://www.lucidimagination.com


 - Mark Miller
 lucidimagination.com














Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
Right now lets say you have one shard - everything there hashes to range X.

Now you want to split that shard with an Index Splitter.

You divide range X in two - giving you two ranges - then you start splitting. 
This is where the current Splitter needs a little modification. You decide 
which doc should go into which new index by rehashing each doc id in the index 
you are splitting - if its hash is greater than X/2, it goes into index1 - if 
its less, index2. I think there are a couple current Splitter impls, but one of 
them does something like, give me an id - now if the id's in the index are 
above that id, goto index1, if below, index2. We need to instead do a quick 
hash rather than simple id compare.
 
Why do you need to do this on every shard?

The other part we need that we dont have is to store hash range assignments in 
zookeeper - we don't do that yet because it's not needed yet. Instead we 
currently just simply calculate that on the fly (too often at the moment - on 
every request :) I intend to fix that of course).

At the start, zk would say, for range X, goto this shard. After the split, it 
would say, for range less than X/2 goto the old node, for range greater than 
X/2 goto the new node.

- Mark

On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:

 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.
 
 On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote:
 
 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.
 
 Not something I've thought deeply about myself yet, but I think the idea 
 would be to split as many as you felt you needed to.
 
 If you wanted to keep the full balance always, this would mean splitting 
 every shard at once, yes. But this depends on how many boxes (partitions) 
 you are willing/able to add at a time.
 
 You might just split one index to start - now it's hash range would be 
 handled by two shards instead of one (if you have 3 replicas per shard, this 
 would mean adding 3 more boxes). When you needed to expand again, you would 
 split another index that was still handling its full starting range. As you 
 grow, once you split every original index, you'd start again, splitting one 
 of the now half ranges.
 
 Is there also an index merger in contrib which could be used to merge
 indexes?  I'm assuming this would be the process?
 
 You can merge with IndexWriter.addIndexes (Solr also has an admin command 
 that can do this). But I'm not sure where this fits in?
 
 - Mark
 
 
 On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote:
 Not yet - we don't plan on working on this until a lot of other stuff is
 working solid at this point. But someone else could jump in!
 
 There are a couple ways to go about it that I know of:
 
 A more long term solution may be to start using micro shards - each index
 starts as multiple indexes. This makes it pretty fast to move mirco shards
 around as you decide to change partitions. It's also less flexible as you
 are limited by the number of micro shards you start with.
 
 A more simple and likely first step is to use an index splitter . We
 already have one in lucene contrib - we would just need to modify it so
 that it splits based on the hash of the document id. This is super
 flexible, but splitting will obviously take a little while on a huge index.
 The current index splitter is a multi pass splitter - good enough to start
 with, but most files under codec control these days, we may be able to make
 a single pass splitter soon as well.
 
 Eventually you could imagine using both options - micro shards that could
 also be split as needed. Though I still wonder if micro shards will be
 worth the extra complications myself...
 
 Right now though, the idea is that you should pick a good number of
 partitions to start given your expected data ;) Adding more replicas is
 trivial though.
 
 - Mark
 
 On Thu, Dec 1, 2011 at 6:35 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Another question, is there any support for repartitioning of the index
 if a new shard is added?  What is the recommended approach for
 handling this?  It seemed that the hashing algorithm (and probably
 any) would require the index to be repartitioned should a new shard be
 added.
 
 On Thu, Dec 1, 2011 at 6:32 PM, Jamie Johnson jej2...@gmail.com wrote:
 Thanks I will try this first thing in the morning.
 
 On 

Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
Yes, the ZK method seems much more flexible.  Adding a new shard would
be simply updating the range assignments in ZK.  Where is this
currently on the list of things to accomplish?  I don't have time to
work on this now, but if you (or anyone) could provide direction I'd
be willing to work on this when I had spare time.  I guess a JIRA
detailing where/how to do this could help.  Not sure if the design has
been thought out that far though.

On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range X.

 Now you want to split that shard with an Index Splitter.

 You divide range X in two - giving you two ranges - then you start splitting. 
 This is where the current Splitter needs a little modification. You decide 
 which doc should go into which new index by rehashing each doc id in the 
 index you are splitting - if its hash is greater than X/2, it goes into 
 index1 - if its less, index2. I think there are a couple current Splitter 
 impls, but one of them does something like, give me an id - now if the id's 
 in the index are above that id, goto index1, if below, index2. We need to 
 instead do a quick hash rather than simple id compare.

 Why do you need to do this on every shard?

 The other part we need that we dont have is to store hash range assignments 
 in zookeeper - we don't do that yet because it's not needed yet. Instead we 
 currently just simply calculate that on the fly (too often at the moment - on 
 every request :) I intend to fix that of course).

 At the start, zk would say, for range X, goto this shard. After the split, it 
 would say, for range less than X/2 goto the old node, for range greater than 
 X/2 goto the new node.

 - Mark

 On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:

 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.

 On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote:

 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:

 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.

 Not something I've thought deeply about myself yet, but I think the idea 
 would be to split as many as you felt you needed to.

 If you wanted to keep the full balance always, this would mean splitting 
 every shard at once, yes. But this depends on how many boxes (partitions) 
 you are willing/able to add at a time.

 You might just split one index to start - now it's hash range would be 
 handled by two shards instead of one (if you have 3 replicas per shard, 
 this would mean adding 3 more boxes). When you needed to expand again, you 
 would split another index that was still handling its full starting range. 
 As you grow, once you split every original index, you'd start again, 
 splitting one of the now half ranges.

 Is there also an index merger in contrib which could be used to merge
 indexes?  I'm assuming this would be the process?

 You can merge with IndexWriter.addIndexes (Solr also has an admin command 
 that can do this). But I'm not sure where this fits in?

 - Mark


 On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com wrote:
 Not yet - we don't plan on working on this until a lot of other stuff is
 working solid at this point. But someone else could jump in!

 There are a couple ways to go about it that I know of:

 A more long term solution may be to start using micro shards - each index
 starts as multiple indexes. This makes it pretty fast to move mirco shards
 around as you decide to change partitions. It's also less flexible as you
 are limited by the number of micro shards you start with.

 A more simple and likely first step is to use an index splitter . We
 already have one in lucene contrib - we would just need to modify it so
 that it splits based on the hash of the document id. This is super
 flexible, but splitting will obviously take a little while on a huge 
 index.
 The current index splitter is a multi pass splitter - good enough to start
 with, but most files under codec control these days, we may be able to 
 make
 a single pass splitter soon as well.

 Eventually you could imagine using both options - micro shards that could
 also be split as needed. Though I still wonder if micro shards will be
 worth the extra complications myself...

 Right now though, the idea is that you should pick a good number of
 partitions to start given your expected data ;) Adding more 

Re: Configuring the Distributed

2011-12-01 Thread Ted Dunning
Of course, resharding is almost never necessary if you use micro-shards.
 Micro-shards are shards small enough that you can fit 20 or more on a
node.  If you have that many on each node, then adding a new node consists
of moving some shards to the new machine rather than moving lots of little
documents.

Much faster.  As in thousands of times faster.

On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote:

 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.

 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
  Right now lets say you have one shard - everything there hashes to range
 X.
 
  Now you want to split that shard with an Index Splitter.
 
  You divide range X in two - giving you two ranges - then you start
 splitting. This is where the current Splitter needs a little modification.
 You decide which doc should go into which new index by rehashing each doc
 id in the index you are splitting - if its hash is greater than X/2, it
 goes into index1 - if its less, index2. I think there are a couple current
 Splitter impls, but one of them does something like, give me an id - now if
 the id's in the index are above that id, goto index1, if below, index2. We
 need to instead do a quick hash rather than simple id compare.
 
  Why do you need to do this on every shard?
 
  The other part we need that we dont have is to store hash range
 assignments in zookeeper - we don't do that yet because it's not needed
 yet. Instead we currently just simply calculate that on the fly (too often
 at the moment - on every request :) I intend to fix that of course).
 
  At the start, zk would say, for range X, goto this shard. After the
 split, it would say, for range less than X/2 goto the old node, for range
 greater than X/2 goto the new node.
 
  - Mark
 
  On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
 
  hmmm.This doesn't sound like the hashing algorithm that's on the
  branch, right?  The algorithm you're mentioning sounds like there is
  some logic which is able to tell that a particular range should be
  distributed between 2 shards instead of 1.  So seems like a trade off
  between repartitioning the entire index (on every shard) and having a
  custom hashing algorithm which is able to handle the situation where 2
  or more shards map to a particular range.
 
  On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
  I am not familiar with the index splitter that is in contrib, but I'll
  take a look at it soon.  So the process sounds like it would be to run
  this on all of the current shards indexes based on the hash algorithm.
 
  Not something I've thought deeply about myself yet, but I think the
 idea would be to split as many as you felt you needed to.
 
  If you wanted to keep the full balance always, this would mean
 splitting every shard at once, yes. But this depends on how many boxes
 (partitions) you are willing/able to add at a time.
 
  You might just split one index to start - now it's hash range would be
 handled by two shards instead of one (if you have 3 replicas per shard,
 this would mean adding 3 more boxes). When you needed to expand again, you
 would split another index that was still handling its full starting range.
 As you grow, once you split every original index, you'd start again,
 splitting one of the now half ranges.
 
  Is there also an index merger in contrib which could be used to merge
  indexes?  I'm assuming this would be the process?
 
  You can merge with IndexWriter.addIndexes (Solr also has an admin
 command that can do this). But I'm not sure where this fits in?
 
  - Mark
 
 
  On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com
 wrote:
  Not yet - we don't plan on working on this until a lot of other
 stuff is
  working solid at this point. But someone else could jump in!
 
  There are a couple ways to go about it that I know of:
 
  A more long term solution may be to start using micro shards - each
 index
  starts as multiple indexes. This makes it pretty fast to move mirco
 shards
  around as you decide to change partitions. It's also less flexible
 as you
  are limited by the number of micro shards you start with.
 
  A more simple and likely first step is to use an index splitter . We
  already have one in lucene contrib - we would just need to modify it
 so
  that it splits based on the hash of the document id. This is super
  flexible, but splitting will obviously take a little while on a huge
 index.
  The current index splitter is a multi 

Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
In this case we are still talking about moving a whole index at a time rather 
than lots of little documents. You split the index into two, and then ship one 
of them off.

The extra cost you can avoid with micro sharding will be the cost of splitting 
the index - which could be significant for a very large index. I have not done 
any tests though.

The cost of 20 micro-shards is that you will always have tons of segments 
unless you are very heavily merging - and even in the very unusual case of each 
micro shard being optimized, you have essentially 20 segments. Thats best case 
- normal case is likely in the hundreds.

This can be a fairly significant % hit at search time.

You also have the added complexity of managing 20 indexes per node in solr code.

I think that both options have there +/-'s and eventually we could perhaps 
support both.

To kick things off though, adding another partition should be a rare event if 
you plan carefully, and I think many will be able to handle the cost of 
splitting (you might even mark the replica you are splitting on so that it's 
not part of queries while its 'busy' splitting).

- Mark

On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote:

 Of course, resharding is almost never necessary if you use micro-shards.
 Micro-shards are shards small enough that you can fit 20 or more on a
 node.  If you have that many on each node, then adding a new node consists
 of moving some shards to the new machine rather than moving lots of little
 documents.
 
 Much faster.  As in thousands of times faster.
 
 On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.
 
 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range
 X.
 
 Now you want to split that shard with an Index Splitter.
 
 You divide range X in two - giving you two ranges - then you start
 splitting. This is where the current Splitter needs a little modification.
 You decide which doc should go into which new index by rehashing each doc
 id in the index you are splitting - if its hash is greater than X/2, it
 goes into index1 - if its less, index2. I think there are a couple current
 Splitter impls, but one of them does something like, give me an id - now if
 the id's in the index are above that id, goto index1, if below, index2. We
 need to instead do a quick hash rather than simple id compare.
 
 Why do you need to do this on every shard?
 
 The other part we need that we dont have is to store hash range
 assignments in zookeeper - we don't do that yet because it's not needed
 yet. Instead we currently just simply calculate that on the fly (too often
 at the moment - on every request :) I intend to fix that of course).
 
 At the start, zk would say, for range X, goto this shard. After the
 split, it would say, for range less than X/2 goto the old node, for range
 greater than X/2 goto the new node.
 
 - Mark
 
 On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
 
 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.
 
 On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.
 
 Not something I've thought deeply about myself yet, but I think the
 idea would be to split as many as you felt you needed to.
 
 If you wanted to keep the full balance always, this would mean
 splitting every shard at once, yes. But this depends on how many boxes
 (partitions) you are willing/able to add at a time.
 
 You might just split one index to start - now it's hash range would be
 handled by two shards instead of one (if you have 3 replicas per shard,
 this would mean adding 3 more boxes). When you needed to expand again, you
 would split another index that was still handling its full starting range.
 As you grow, once you split every original index, you'd start again,
 splitting one of the now half 

Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
Sorry - missed something - you also have the added cost of shipping the new 
half index to all of the replicas of the original shard with the splitting 
method. Unless you somehow split on every replica at the same time - then of 
course you wouldn't be able to avoid the 'busy' replica, and it would probably 
be fairly hard to juggle.


On Dec 1, 2011, at 9:37 PM, Mark Miller wrote:

 In this case we are still talking about moving a whole index at a time rather 
 than lots of little documents. You split the index into two, and then ship 
 one of them off.
 
 The extra cost you can avoid with micro sharding will be the cost of 
 splitting the index - which could be significant for a very large index. I 
 have not done any tests though.
 
 The cost of 20 micro-shards is that you will always have tons of segments 
 unless you are very heavily merging - and even in the very unusual case of 
 each micro shard being optimized, you have essentially 20 segments. Thats 
 best case - normal case is likely in the hundreds.
 
 This can be a fairly significant % hit at search time.
 
 You also have the added complexity of managing 20 indexes per node in solr 
 code.
 
 I think that both options have there +/-'s and eventually we could perhaps 
 support both.
 
 To kick things off though, adding another partition should be a rare event if 
 you plan carefully, and I think many will be able to handle the cost of 
 splitting (you might even mark the replica you are splitting on so that it's 
 not part of queries while its 'busy' splitting).
 
 - Mark
 
 On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote:
 
 Of course, resharding is almost never necessary if you use micro-shards.
 Micro-shards are shards small enough that you can fit 20 or more on a
 node.  If you have that many on each node, then adding a new node consists
 of moving some shards to the new machine rather than moving lots of little
 documents.
 
 Much faster.  As in thousands of times faster.
 
 On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.
 
 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range
 X.
 
 Now you want to split that shard with an Index Splitter.
 
 You divide range X in two - giving you two ranges - then you start
 splitting. This is where the current Splitter needs a little modification.
 You decide which doc should go into which new index by rehashing each doc
 id in the index you are splitting - if its hash is greater than X/2, it
 goes into index1 - if its less, index2. I think there are a couple current
 Splitter impls, but one of them does something like, give me an id - now if
 the id's in the index are above that id, goto index1, if below, index2. We
 need to instead do a quick hash rather than simple id compare.
 
 Why do you need to do this on every shard?
 
 The other part we need that we dont have is to store hash range
 assignments in zookeeper - we don't do that yet because it's not needed
 yet. Instead we currently just simply calculate that on the fly (too often
 at the moment - on every request :) I intend to fix that of course).
 
 At the start, zk would say, for range X, goto this shard. After the
 split, it would say, for range less than X/2 goto the old node, for range
 greater than X/2 goto the new node.
 
 - Mark
 
 On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
 
 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.
 
 On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.
 
 Not something I've thought deeply about myself yet, but I think the
 idea would be to split as many as you felt you needed to.
 
 If you wanted to keep the full balance always, this would mean
 splitting every shard at once, yes. But this depends on how many boxes
 (partitions) you are willing/able to 

Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
So I couldn't resist, I attempted to do this tonight, I used the
solrconfig you mentioned (as is, no modifications), I setup a 2 shard
cluster in collection1, I sent 1 doc to 1 of the shards, updated it
and sent the update to the other.  I don't see the modifications
though I only see the original document.  The following is the test

public void update() throws Exception {

String key = 1;

SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.setField(key, key);

solrDoc.addField(content, initial value);

SolrServer server = servers
.get(http://localhost:8983/solr/collection1;);
server.add(solrDoc);

server.commit();

solrDoc = new SolrInputDocument();
solrDoc.addField(key, key);
solrDoc.addField(content, updated value);

server = servers.get(http://localhost:7574/solr/collection1;);

UpdateRequest ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
ureq.add(solrDoc);
ureq.setParam(shards,

localhost:8983/solr/collection1,localhost:7574/solr/collection1);
ureq.setParam(self, foo);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
System.out.println(done);
}

key is my unique field in schema.xml

What am I doing wrong?

On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.

 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range X.

 Now you want to split that shard with an Index Splitter.

 You divide range X in two - giving you two ranges - then you start 
 splitting. This is where the current Splitter needs a little modification. 
 You decide which doc should go into which new index by rehashing each doc id 
 in the index you are splitting - if its hash is greater than X/2, it goes 
 into index1 - if its less, index2. I think there are a couple current 
 Splitter impls, but one of them does something like, give me an id - now if 
 the id's in the index are above that id, goto index1, if below, index2. We 
 need to instead do a quick hash rather than simple id compare.

 Why do you need to do this on every shard?

 The other part we need that we dont have is to store hash range assignments 
 in zookeeper - we don't do that yet because it's not needed yet. Instead we 
 currently just simply calculate that on the fly (too often at the moment - 
 on every request :) I intend to fix that of course).

 At the start, zk would say, for range X, goto this shard. After the split, 
 it would say, for range less than X/2 goto the old node, for range greater 
 than X/2 goto the new node.

 - Mark

 On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:

 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.

 On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote:

 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:

 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.

 Not something I've thought deeply about myself yet, but I think the idea 
 would be to split as many as you felt you needed to.

 If you wanted to keep the full balance always, this would mean splitting 
 every shard at once, yes. But this depends on how many boxes (partitions) 
 you are willing/able to add at a time.

 You might just split one index to start - now it's hash range would be 
 handled by two shards instead of one (if you have 3 replicas per shard, 
 this would mean adding 3 more boxes). When you needed to expand again, you 
 would split another index that was still handling its full starting range. 
 As you grow, once you split every original index, you'd start again, 
 splitting one of the now 

Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
It's not full of details yet, but there is a JIRA issue here:
https://issues.apache.org/jira/browse/SOLR-2595

On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote:

 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.

 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
  Right now lets say you have one shard - everything there hashes to range
 X.
 
  Now you want to split that shard with an Index Splitter.
 
  You divide range X in two - giving you two ranges - then you start
 splitting. This is where the current Splitter needs a little modification.
 You decide which doc should go into which new index by rehashing each doc
 id in the index you are splitting - if its hash is greater than X/2, it
 goes into index1 - if its less, index2. I think there are a couple current
 Splitter impls, but one of them does something like, give me an id - now if
 the id's in the index are above that id, goto index1, if below, index2. We
 need to instead do a quick hash rather than simple id compare.
 
  Why do you need to do this on every shard?
 
  The other part we need that we dont have is to store hash range
 assignments in zookeeper - we don't do that yet because it's not needed
 yet. Instead we currently just simply calculate that on the fly (too often
 at the moment - on every request :) I intend to fix that of course).
 
  At the start, zk would say, for range X, goto this shard. After the
 split, it would say, for range less than X/2 goto the old node, for range
 greater than X/2 goto the new node.
 
  - Mark
 
  On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
 
  hmmm.This doesn't sound like the hashing algorithm that's on the
  branch, right?  The algorithm you're mentioning sounds like there is
  some logic which is able to tell that a particular range should be
  distributed between 2 shards instead of 1.  So seems like a trade off
  between repartitioning the entire index (on every shard) and having a
  custom hashing algorithm which is able to handle the situation where 2
  or more shards map to a particular range.
 
  On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
  On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
  I am not familiar with the index splitter that is in contrib, but I'll
  take a look at it soon.  So the process sounds like it would be to run
  this on all of the current shards indexes based on the hash algorithm.
 
  Not something I've thought deeply about myself yet, but I think the
 idea would be to split as many as you felt you needed to.
 
  If you wanted to keep the full balance always, this would mean
 splitting every shard at once, yes. But this depends on how many boxes
 (partitions) you are willing/able to add at a time.
 
  You might just split one index to start - now it's hash range would be
 handled by two shards instead of one (if you have 3 replicas per shard,
 this would mean adding 3 more boxes). When you needed to expand again, you
 would split another index that was still handling its full starting range.
 As you grow, once you split every original index, you'd start again,
 splitting one of the now half ranges.
 
  Is there also an index merger in contrib which could be used to merge
  indexes?  I'm assuming this would be the process?
 
  You can merge with IndexWriter.addIndexes (Solr also has an admin
 command that can do this). But I'm not sure where this fits in?
 
  - Mark
 
 
  On Thu, Dec 1, 2011 at 7:18 PM, Mark Miller markrmil...@gmail.com
 wrote:
  Not yet - we don't plan on working on this until a lot of other
 stuff is
  working solid at this point. But someone else could jump in!
 
  There are a couple ways to go about it that I know of:
 
  A more long term solution may be to start using micro shards - each
 index
  starts as multiple indexes. This makes it pretty fast to move mirco
 shards
  around as you decide to change partitions. It's also less flexible
 as you
  are limited by the number of micro shards you start with.
 
  A more simple and likely first step is to use an index splitter . We
  already have one in lucene contrib - we would just need to modify it
 so
  that it splits based on the hash of the document id. This is super
  flexible, but splitting will obviously take a little while on a huge
 index.
  The current index splitter is a multi pass splitter - good enough to
 start
  with, but most files under codec control these days, we may be able
 to make
  a single pass splitter soon as well.
 
  Eventually you could imagine using both options - micro shards that
 could
  also 

Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
Hmm...sorry bout that - so my first guess is that right now we are not 
distributing a commit (easy to add, just have not done it).

Right now I explicitly commit on each server for tests.

Can you try explicitly committing on server1 after updating the doc on server 2?

I can start distributing commits tomorrow - been meaning to do it for my own 
convenience anyhow.

Also, you want to pass the sys property numShards=1 on startup. I think it 
defaults to 3. That will give you one leader and one replica.

- Mark

On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:

 So I couldn't resist, I attempted to do this tonight, I used the
 solrconfig you mentioned (as is, no modifications), I setup a 2 shard
 cluster in collection1, I sent 1 doc to 1 of the shards, updated it
 and sent the update to the other.  I don't see the modifications
 though I only see the original document.  The following is the test
 
 public void update() throws Exception {
 
   String key = 1;
 
   SolrInputDocument solrDoc = new SolrInputDocument();
   solrDoc.setField(key, key);
 
   solrDoc.addField(content, initial value);
 
   SolrServer server = servers
   .get(http://localhost:8983/solr/collection1;);
   server.add(solrDoc);
 
   server.commit();
 
   solrDoc = new SolrInputDocument();
   solrDoc.addField(key, key);
   solrDoc.addField(content, updated value);
 
   server = servers.get(http://localhost:7574/solr/collection1;);
 
   UpdateRequest ureq = new UpdateRequest();
   ureq.setParam(update.chain, distrib-update-chain);
   ureq.add(solrDoc);
   ureq.setParam(shards,
   
 localhost:8983/solr/collection1,localhost:7574/solr/collection1);
   ureq.setParam(self, foo);
   ureq.setAction(ACTION.COMMIT, true, true);
   server.request(ureq);
   System.out.println(done);
   }
 
 key is my unique field in schema.xml
 
 What am I doing wrong?
 
 On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.
 
 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range X.
 
 Now you want to split that shard with an Index Splitter.
 
 You divide range X in two - giving you two ranges - then you start 
 splitting. This is where the current Splitter needs a little modification. 
 You decide which doc should go into which new index by rehashing each doc 
 id in the index you are splitting - if its hash is greater than X/2, it 
 goes into index1 - if its less, index2. I think there are a couple current 
 Splitter impls, but one of them does something like, give me an id - now if 
 the id's in the index are above that id, goto index1, if below, index2. We 
 need to instead do a quick hash rather than simple id compare.
 
 Why do you need to do this on every shard?
 
 The other part we need that we dont have is to store hash range assignments 
 in zookeeper - we don't do that yet because it's not needed yet. Instead we 
 currently just simply calculate that on the fly (too often at the moment - 
 on every request :) I intend to fix that of course).
 
 At the start, zk would say, for range X, goto this shard. After the split, 
 it would say, for range less than X/2 goto the old node, for range greater 
 than X/2 goto the new node.
 
 - Mark
 
 On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
 
 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.
 
 On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com wrote:
 
 On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
 I am not familiar with the index splitter that is in contrib, but I'll
 take a look at it soon.  So the process sounds like it would be to run
 this on all of the current shards indexes based on the hash algorithm.
 
 Not something I've thought deeply about myself yet, but I think the idea 
 would be to split as many as you felt you needed to.
 
 If you 

Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
Thanks for the quick response.  With that change (have not done
numShards yet) shard1 got updated.  But now when executing the
following queries I get information back from both, which doesn't seem
right

http://localhost:7574/solr/select/?q=*:*
docstr name=key1/strstr name=content_mvtxtupdated value/str/doc

http://localhost:8983/solr/select?q=*:*
docstr name=key1/strstr name=content_mvtxtupdated value/str/doc



On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote:
 Hmm...sorry bout that - so my first guess is that right now we are not 
 distributing a commit (easy to add, just have not done it).

 Right now I explicitly commit on each server for tests.

 Can you try explicitly committing on server1 after updating the doc on server 
 2?

 I can start distributing commits tomorrow - been meaning to do it for my own 
 convenience anyhow.

 Also, you want to pass the sys property numShards=1 on startup. I think it 
 defaults to 3. That will give you one leader and one replica.

 - Mark

 On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:

 So I couldn't resist, I attempted to do this tonight, I used the
 solrconfig you mentioned (as is, no modifications), I setup a 2 shard
 cluster in collection1, I sent 1 doc to 1 of the shards, updated it
 and sent the update to the other.  I don't see the modifications
 though I only see the original document.  The following is the test

 public void update() throws Exception {

               String key = 1;

               SolrInputDocument solrDoc = new SolrInputDocument();
               solrDoc.setField(key, key);

               solrDoc.addField(content, initial value);

               SolrServer server = servers
                               .get(http://localhost:8983/solr/collection1;);
               server.add(solrDoc);

               server.commit();

               solrDoc = new SolrInputDocument();
               solrDoc.addField(key, key);
               solrDoc.addField(content, updated value);

               server = servers.get(http://localhost:7574/solr/collection1;);

               UpdateRequest ureq = new UpdateRequest();
               ureq.setParam(update.chain, distrib-update-chain);
               ureq.add(solrDoc);
               ureq.setParam(shards,
                               
 localhost:8983/solr/collection1,localhost:7574/solr/collection1);
               ureq.setParam(self, foo);
               ureq.setAction(ACTION.COMMIT, true, true);
               server.request(ureq);
               System.out.println(done);
       }

 key is my unique field in schema.xml

 What am I doing wrong?

 On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.

 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range X.

 Now you want to split that shard with an Index Splitter.

 You divide range X in two - giving you two ranges - then you start 
 splitting. This is where the current Splitter needs a little modification. 
 You decide which doc should go into which new index by rehashing each doc 
 id in the index you are splitting - if its hash is greater than X/2, it 
 goes into index1 - if its less, index2. I think there are a couple current 
 Splitter impls, but one of them does something like, give me an id - now 
 if the id's in the index are above that id, goto index1, if below, index2. 
 We need to instead do a quick hash rather than simple id compare.

 Why do you need to do this on every shard?

 The other part we need that we dont have is to store hash range 
 assignments in zookeeper - we don't do that yet because it's not needed 
 yet. Instead we currently just simply calculate that on the fly (too often 
 at the moment - on every request :) I intend to fix that of course).

 At the start, zk would say, for range X, goto this shard. After the split, 
 it would say, for range less than X/2 goto the old node, for range greater 
 than X/2 goto the new node.

 - Mark

 On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:

 hmmm.This doesn't sound like the hashing algorithm that's on the
 branch, right?  The algorithm you're mentioning sounds like there is
 some logic which is able to tell that a particular range should be
 distributed between 2 shards instead of 1.  So seems like a trade off
 between repartitioning the entire index (on every shard) and having a
 custom hashing algorithm which is able to handle the situation where 2
 or more shards map to a particular range.

 On 

Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
Not sure offhand - but things will be funky if you don't specify the correct 
numShards.

The instance to shard assignment should be using numShards to assign. But then 
the hash to shard mapping actually goes on the number of shards it finds 
registered in ZK (it doesn't have to, but really these should be equal).

So basically you are saying, I want 3 partitions, but you are only starting up 
2 nodes, and the code is just not happy about that I'd guess. For the system to 
work properly, you have to fire up at least as many servers as numShards.

What are you trying to do? 2 partitions with no replicas, or one partition with 
one replica?

In either case, I think you will have better luck if you fire up at least as 
many servers as the numShards setting. Or lower the numShards setting.

This is all a work in progress by the way - what you are trying to test should 
work if things are setup right though.

- Mark


On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:

 Thanks for the quick response.  With that change (have not done
 numShards yet) shard1 got updated.  But now when executing the
 following queries I get information back from both, which doesn't seem
 right
 
 http://localhost:7574/solr/select/?q=*:*
 docstr name=key1/strstr name=content_mvtxtupdated 
 value/str/doc
 
 http://localhost:8983/solr/select?q=*:*
 docstr name=key1/strstr name=content_mvtxtupdated 
 value/str/doc
 
 
 
 On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote:
 Hmm...sorry bout that - so my first guess is that right now we are not 
 distributing a commit (easy to add, just have not done it).
 
 Right now I explicitly commit on each server for tests.
 
 Can you try explicitly committing on server1 after updating the doc on 
 server 2?
 
 I can start distributing commits tomorrow - been meaning to do it for my own 
 convenience anyhow.
 
 Also, you want to pass the sys property numShards=1 on startup. I think it 
 defaults to 3. That will give you one leader and one replica.
 
 - Mark
 
 On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
 
 So I couldn't resist, I attempted to do this tonight, I used the
 solrconfig you mentioned (as is, no modifications), I setup a 2 shard
 cluster in collection1, I sent 1 doc to 1 of the shards, updated it
 and sent the update to the other.  I don't see the modifications
 though I only see the original document.  The following is the test
 
 public void update() throws Exception {
 
   String key = 1;
 
   SolrInputDocument solrDoc = new SolrInputDocument();
   solrDoc.setField(key, key);
 
   solrDoc.addField(content, initial value);
 
   SolrServer server = servers
   
 .get(http://localhost:8983/solr/collection1;);
   server.add(solrDoc);
 
   server.commit();
 
   solrDoc = new SolrInputDocument();
   solrDoc.addField(key, key);
   solrDoc.addField(content, updated value);
 
   server = 
 servers.get(http://localhost:7574/solr/collection1;);
 
   UpdateRequest ureq = new UpdateRequest();
   ureq.setParam(update.chain, distrib-update-chain);
   ureq.add(solrDoc);
   ureq.setParam(shards,
   
 localhost:8983/solr/collection1,localhost:7574/solr/collection1);
   ureq.setParam(self, foo);
   ureq.setAction(ACTION.COMMIT, true, true);
   server.request(ureq);
   System.out.println(done);
   }
 
 key is my unique field in schema.xml
 
 What am I doing wrong?
 
 On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.
 
 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range 
 X.
 
 Now you want to split that shard with an Index Splitter.
 
 You divide range X in two - giving you two ranges - then you start 
 splitting. This is where the current Splitter needs a little 
 modification. You decide which doc should go into which new index by 
 rehashing each doc id in the index you are splitting - if its hash is 
 greater than X/2, it goes into index1 - if its less, index2. I think 
 there are a couple current Splitter impls, but one of them does something 
 like, give me an id - now if the id's in the index are above that id, 
 goto index1, if below, index2. We need to instead do a quick hash rather 
 than simple id compare.
 
 Why do you need to 

Re: Configuring the Distributed

2011-12-01 Thread Mark Miller
Getting late - didn't really pay attention to your code I guess - why are you 
adding the first doc without specifying the distrib update chain? This is not 
really supported. It's going to just go to the server you specified - even with 
everything setup right, the update might then go to that same server or the 
other one depending on how it hashes. You really want to just always use the 
distrib update chain.  I guess I don't yet understand what you are trying to 
test. 

Sent from my iPad

On Dec 1, 2011, at 10:57 PM, Mark Miller markrmil...@gmail.com wrote:

 Not sure offhand - but things will be funky if you don't specify the correct 
 numShards.
 
 The instance to shard assignment should be using numShards to assign. But 
 then the hash to shard mapping actually goes on the number of shards it finds 
 registered in ZK (it doesn't have to, but really these should be equal).
 
 So basically you are saying, I want 3 partitions, but you are only starting 
 up 2 nodes, and the code is just not happy about that I'd guess. For the 
 system to work properly, you have to fire up at least as many servers as 
 numShards.
 
 What are you trying to do? 2 partitions with no replicas, or one partition 
 with one replica?
 
 In either case, I think you will have better luck if you fire up at least as 
 many servers as the numShards setting. Or lower the numShards setting.
 
 This is all a work in progress by the way - what you are trying to test 
 should work if things are setup right though.
 
 - Mark
 
 
 On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:
 
 Thanks for the quick response.  With that change (have not done
 numShards yet) shard1 got updated.  But now when executing the
 following queries I get information back from both, which doesn't seem
 right
 
 http://localhost:7574/solr/select/?q=*:*
 docstr name=key1/strstr name=content_mvtxtupdated 
 value/str/doc
 
 http://localhost:8983/solr/select?q=*:*
 docstr name=key1/strstr name=content_mvtxtupdated 
 value/str/doc
 
 
 
 On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote:
 Hmm...sorry bout that - so my first guess is that right now we are not 
 distributing a commit (easy to add, just have not done it).
 
 Right now I explicitly commit on each server for tests.
 
 Can you try explicitly committing on server1 after updating the doc on 
 server 2?
 
 I can start distributing commits tomorrow - been meaning to do it for my 
 own convenience anyhow.
 
 Also, you want to pass the sys property numShards=1 on startup. I think it 
 defaults to 3. That will give you one leader and one replica.
 
 - Mark
 
 On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:
 
 So I couldn't resist, I attempted to do this tonight, I used the
 solrconfig you mentioned (as is, no modifications), I setup a 2 shard
 cluster in collection1, I sent 1 doc to 1 of the shards, updated it
 and sent the update to the other.  I don't see the modifications
 though I only see the original document.  The following is the test
 
 public void update() throws Exception {
 
  String key = 1;
 
  SolrInputDocument solrDoc = new SolrInputDocument();
  solrDoc.setField(key, key);
 
  solrDoc.addField(content, initial value);
 
  SolrServer server = servers
  
 .get(http://localhost:8983/solr/collection1;);
  server.add(solrDoc);
 
  server.commit();
 
  solrDoc = new SolrInputDocument();
  solrDoc.addField(key, key);
  solrDoc.addField(content, updated value);
 
  server = 
 servers.get(http://localhost:7574/solr/collection1;);
 
  UpdateRequest ureq = new UpdateRequest();
  ureq.setParam(update.chain, distrib-update-chain);
  ureq.add(solrDoc);
  ureq.setParam(shards,
  
 localhost:8983/solr/collection1,localhost:7574/solr/collection1);
  ureq.setParam(self, foo);
  ureq.setAction(ACTION.COMMIT, true, true);
  server.request(ureq);
  System.out.println(done);
  }
 
 key is my unique field in schema.xml
 
 What am I doing wrong?
 
 On Thu, Dec 1, 2011 at 8:51 PM, Jamie Johnson jej2...@gmail.com wrote:
 Yes, the ZK method seems much more flexible.  Adding a new shard would
 be simply updating the range assignments in ZK.  Where is this
 currently on the list of things to accomplish?  I don't have time to
 work on this now, but if you (or anyone) could provide direction I'd
 be willing to work on this when I had spare time.  I guess a JIRA
 detailing where/how to do this could help.  Not sure if the design has
 been thought out that far though.
 
 On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com wrote:
 Right now lets say you have one shard - everything there hashes to range 
 X.
 
 Now you want to split that shard with an Index Splitter.
 
 You divide range X in two - giving you 

Re: Configuring the Distributed

2011-12-01 Thread Jamie Johnson
Really just trying to do a simple add and update test, the chain
missing is just proof of my not understanding exactly how this is
supposed to work.  I modified the code to this

String key = 1;

SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.setField(key, key);

solrDoc.addField(content_mvtxt, initial value);

SolrServer server = servers
.get(http://localhost:8983/solr/collection1;);

UpdateRequest ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
ureq.add(solrDoc);
ureq.setParam(shards,

localhost:8983/solr/collection1,localhost:7574/solr/collection1);
ureq.setParam(self, foo);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
server.commit();

solrDoc = new SolrInputDocument();
solrDoc.addField(key, key);
solrDoc.addField(content_mvtxt, updated value);

server = servers.get(http://localhost:7574/solr/collection1;);

ureq = new UpdateRequest();
ureq.setParam(update.chain, distrib-update-chain);
// ureq.deleteById(8060a9eb-9546-43ee-95bb-d18ea26a6285);
ureq.add(solrDoc);
ureq.setParam(shards,

localhost:8983/solr/collection1,localhost:7574/solr/collection1);
ureq.setParam(self, foo);
ureq.setAction(ACTION.COMMIT, true, true);
server.request(ureq);
// server.add(solrDoc);
server.commit();
server = servers.get(http://localhost:8983/solr/collection1;);


server.commit();
System.out.println(done);

but I'm still seeing the doc appear on both shards.After the first
commit I see the doc on 8983 with initial value.  after the second
commit I see the updated value on 7574 and the old on 8983.  After the
final commit the doc on 8983 gets updated.

Is there something wrong with my test?

On Thu, Dec 1, 2011 at 11:17 PM, Mark Miller markrmil...@gmail.com wrote:
 Getting late - didn't really pay attention to your code I guess - why are you 
 adding the first doc without specifying the distrib update chain? This is not 
 really supported. It's going to just go to the server you specified - even 
 with everything setup right, the update might then go to that same server or 
 the other one depending on how it hashes. You really want to just always use 
 the distrib update chain.  I guess I don't yet understand what you are trying 
 to test.

 Sent from my iPad

 On Dec 1, 2011, at 10:57 PM, Mark Miller markrmil...@gmail.com wrote:

 Not sure offhand - but things will be funky if you don't specify the correct 
 numShards.

 The instance to shard assignment should be using numShards to assign. But 
 then the hash to shard mapping actually goes on the number of shards it 
 finds registered in ZK (it doesn't have to, but really these should be 
 equal).

 So basically you are saying, I want 3 partitions, but you are only starting 
 up 2 nodes, and the code is just not happy about that I'd guess. For the 
 system to work properly, you have to fire up at least as many servers as 
 numShards.

 What are you trying to do? 2 partitions with no replicas, or one partition 
 with one replica?

 In either case, I think you will have better luck if you fire up at least as 
 many servers as the numShards setting. Or lower the numShards setting.

 This is all a work in progress by the way - what you are trying to test 
 should work if things are setup right though.

 - Mark


 On Dec 1, 2011, at 10:40 PM, Jamie Johnson wrote:

 Thanks for the quick response.  With that change (have not done
 numShards yet) shard1 got updated.  But now when executing the
 following queries I get information back from both, which doesn't seem
 right

 http://localhost:7574/solr/select/?q=*:*
 docstr name=key1/strstr name=content_mvtxtupdated 
 value/str/doc

 http://localhost:8983/solr/select?q=*:*
 docstr name=key1/strstr name=content_mvtxtupdated 
 value/str/doc



 On Thu, Dec 1, 2011 at 10:21 PM, Mark Miller markrmil...@gmail.com wrote:
 Hmm...sorry bout that - so my first guess is that right now we are not 
 distributing a commit (easy to add, just have not done it).

 Right now I explicitly commit on each server for tests.

 Can you try explicitly committing on server1 after updating the doc on 
 server 2?

 I can start distributing commits tomorrow - been meaning to do it for my 
 own convenience anyhow.

 Also, you want to pass the sys property numShards=1 on startup. I think it 
 defaults to 3. That will give you one leader and one replica.

 - Mark

 On Dec 1, 2011, at 9:56 PM, Jamie Johnson wrote:

 So I 

Re: Configuring the Distributed

2011-12-01 Thread Ted Dunning
Well, this goes both ways.

It is not that unusual to take a node down for maintenance of some kind or
even to have a node failure.  In that case, it is very nice to have the
load from the lost node be spread fairly evenly across the remaining
cluster.

Regarding the cost of having several micro-shards, they are also an
opportunity for threading the search.  Most sites don't have enough queries
coming in to occupy all of the cores in modern machines so threading each
query can actually be a substantial benefit in terms of query time.

On Thu, Dec 1, 2011 at 6:37 PM, Mark Miller markrmil...@gmail.com wrote:

 To kick things off though, adding another partition should be a rare event
 if you plan carefully, and I think many will be able to handle the cost of
 splitting (you might even mark the replica you are splitting on so that
 it's not part of queries while its 'busy' splitting).



Re: Configuring the Distributed

2011-12-01 Thread Ted Dunning
With micro-shards, you can use random numbers for all placements with minor
constraints like avoiding replicas sitting in the same rack.  Since the
number of shards never changes, things stay very simple.

On Thu, Dec 1, 2011 at 6:44 PM, Mark Miller markrmil...@gmail.com wrote:

 Sorry - missed something - you also have the added cost of shipping the
 new half index to all of the replicas of the original shard with the
 splitting method. Unless you somehow split on every replica at the same
 time - then of course you wouldn't be able to avoid the 'busy' replica, and
 it would probably be fairly hard to juggle.


 On Dec 1, 2011, at 9:37 PM, Mark Miller wrote:

  In this case we are still talking about moving a whole index at a time
 rather than lots of little documents. You split the index into two, and
 then ship one of them off.
 
  The extra cost you can avoid with micro sharding will be the cost of
 splitting the index - which could be significant for a very large index. I
 have not done any tests though.
 
  The cost of 20 micro-shards is that you will always have tons of
 segments unless you are very heavily merging - and even in the very unusual
 case of each micro shard being optimized, you have essentially 20 segments.
 Thats best case - normal case is likely in the hundreds.
 
  This can be a fairly significant % hit at search time.
 
  You also have the added complexity of managing 20 indexes per node in
 solr code.
 
  I think that both options have there +/-'s and eventually we could
 perhaps support both.
 
  To kick things off though, adding another partition should be a rare
 event if you plan carefully, and I think many will be able to handle the
 cost of splitting (you might even mark the replica you are splitting on so
 that it's not part of queries while its 'busy' splitting).
 
  - Mark
 
  On Dec 1, 2011, at 9:17 PM, Ted Dunning wrote:
 
  Of course, resharding is almost never necessary if you use micro-shards.
  Micro-shards are shards small enough that you can fit 20 or more on a
  node.  If you have that many on each node, then adding a new node
 consists
  of moving some shards to the new machine rather than moving lots of
 little
  documents.
 
  Much faster.  As in thousands of times faster.
 
  On Thu, Dec 1, 2011 at 5:51 PM, Jamie Johnson jej2...@gmail.com
 wrote:
 
  Yes, the ZK method seems much more flexible.  Adding a new shard would
  be simply updating the range assignments in ZK.  Where is this
  currently on the list of things to accomplish?  I don't have time to
  work on this now, but if you (or anyone) could provide direction I'd
  be willing to work on this when I had spare time.  I guess a JIRA
  detailing where/how to do this could help.  Not sure if the design has
  been thought out that far though.
 
  On Thu, Dec 1, 2011 at 8:15 PM, Mark Miller markrmil...@gmail.com
 wrote:
  Right now lets say you have one shard - everything there hashes to
 range
  X.
 
  Now you want to split that shard with an Index Splitter.
 
  You divide range X in two - giving you two ranges - then you start
  splitting. This is where the current Splitter needs a little
 modification.
  You decide which doc should go into which new index by rehashing each
 doc
  id in the index you are splitting - if its hash is greater than X/2, it
  goes into index1 - if its less, index2. I think there are a couple
 current
  Splitter impls, but one of them does something like, give me an id -
 now if
  the id's in the index are above that id, goto index1, if below,
 index2. We
  need to instead do a quick hash rather than simple id compare.
 
  Why do you need to do this on every shard?
 
  The other part we need that we dont have is to store hash range
  assignments in zookeeper - we don't do that yet because it's not needed
  yet. Instead we currently just simply calculate that on the fly (too
 often
  at the moment - on every request :) I intend to fix that of course).
 
  At the start, zk would say, for range X, goto this shard. After the
  split, it would say, for range less than X/2 goto the old node, for
 range
  greater than X/2 goto the new node.
 
  - Mark
 
  On Dec 1, 2011, at 7:44 PM, Jamie Johnson wrote:
 
  hmmm.This doesn't sound like the hashing algorithm that's on the
  branch, right?  The algorithm you're mentioning sounds like there is
  some logic which is able to tell that a particular range should be
  distributed between 2 shards instead of 1.  So seems like a trade off
  between repartitioning the entire index (on every shard) and having a
  custom hashing algorithm which is able to handle the situation where
 2
  or more shards map to a particular range.
 
  On Thu, Dec 1, 2011 at 7:34 PM, Mark Miller markrmil...@gmail.com
  wrote:
 
  On Dec 1, 2011, at 7:20 PM, Jamie Johnson wrote:
 
  I am not familiar with the index splitter that is in contrib, but
 I'll
  take a look at it soon.  So the process sounds like it would be to
 run
  this on all of the current shards