Thanks Shawn for the detailed instructions. About the router: it is implicit.
About the replicas: I followed the example at http://wiki.apache.org/solr/SolrCloud I start the shards with the following (paths and ports simplified): cd /.../solr/shard1/ /usr/bin/java -Djetty.port=1 -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun=localhost:0 -DnumShards=4 -jar start.jar > /.../log/shard_1.log cd /.../solr/shard2/ /usr/bin/java -Djetty.port=2 -DzkHost=localhost:0 -jar start.jar > /.../log/shard_2.log and same thing for the two other shards on their own ports. To post a document (CSV file), I use: curl http://localhost:shardport/solr/update --data-binary file.csv -H 'Content-type:text/csv; charset=ISO-8859-1' I just re-read the example page at http://wiki.apache.org/solr/SolrCloud and I see that there is no difference between starting a shard or a replicate. I must be missing something: >From exampleA (two shards): cd example2 java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar Fomr exampleB (two shards with replicates): cd exampleB java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar Thanks. Thierry On Mon, Aug 12, 2013 at 5:04 PM, Shawn Heisey <s...@elyograg.org> wrote: > On 8/12/2013 4:50 PM, Thierry Thelliez wrote: > >> Hello, I am trying to set a four shard system for the first time. I do >> not understand why all the shards data are growing at about the same rate >> when I push the documents to only one shard. >> >> The four shards represent four calendar years. And for now, on a >> development machine, these four shards run on four different ports. >> >> The first shard is started with Zookeeper. >> >> The log of the other shards is filed with something like: >> >> 7882051 [qtp1154079020-1245] INFO >> org.apache.solr.update.**processor.LogUpdateProcessor – [collection1] >> webapp=/solr path=/update params={distrib.from= >> http://x.y.z.4:50121/solr/**collection1/&update.distrib=** >> TOLEADER&wt=javabin&version=2<http://x.y.z.4:50121/solr/collection1/&update.distrib=TOLEADER&wt=javabin&version=2> >> } >> {add=[14939-96467-304 (1443204912169091072), 14939-96467-308 >> (1443204912179576832), 14939-96467-310 (1443204912185868288), >> 14939-96467-311 (1443204912192159744), 14939-96467-313 >> (1443204912204742656), 14939-96467-314 (1443204912220471296), >> 14939-96467-318 (1443204912239345664), 14939-96467-319 >> (1443204912250880000), 14939-96467-322 (1443204912257171456), >> 14939-96467-324 (1443204912263462912)]} 0 282 >> >> What is getting written to the other shards? Is a separate index computed >> on all four shards? I thought that when pushing a document to one shard, >> only that shard would update its index. >> > > There are two possibilities. > > 1) You don't have four shards, you have four replicas of one shard. If > this is happening, then they all will receive all documents. > > 2) You are using a router like compositeId instead of implicit. This will > calculate the hash of the id field and evenly divide the documents among > all the shards in the collection according to the hash value. If you > create the collection with the implicit router, then documents should be > indexed by the shard that received them. > > To see what router you have, click on Cloud in the admin UI, then click on > Tree. Click the arrow to the left of '/collections' to open it. Click on > collection1 (or whichever you are actually using) -- the actual name, not > the arrow. Underneath the table that appears to the right will be "router" > and its value. > > Thanks, > Shawn > >