Couple things to start:

By default SolrCloud distributes updates a doc at a time. So if you have 1 
shard, whatever node you index too, it will send updates to the other. 
Replication is only used for recovery, not distributing data. So for some 
reason, there is an IOException when it tries to forward.

The other issue is not something that Ive seen reported. Can/did you try and do 
another hard commit to make sure you had the latest search open when checking 
the # of docs on each node? There was previously a race around commit that 
could cause some issues around expected visibility. 

If you are able to, you might try out a nightly build - 4.1 will be ready very 
soon and has numerous bug fixes for SolrCloud.

- Mark

On Dec 13, 2012, at 9:53 AM, John Nielsen <j...@mcb.dk> wrote:

> Hi all,
> 
> We are seeing a strange problem on our 2-node solr4 cluster. This problem
> has resultet in data loss.
> 
> We have two servers, varnish01 and varnish02. Zookeeper is running on
> varnish02, but in a separate jvm.
> 
> We index directly to varnish02 and we read from varnish01. Data is thus
> replicated from varnish02 to varnish01.
> 
> I found this in the varnish01 log:
> 
> *INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2}
> status=0 QTime=42
> Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2}
> status=0 QTime=41
> Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2}
> status=0 QTime=33
> Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish02.lynero.net:8000/solr/default1_Norwegian/&update.distrib=TOLEADER&wt=javabin&version=2}
> status=0 QTime=33
> Dec 13, 2012 12:23:39 PM org.apache.solr.common.SolrException log
> SEVERE: shard update error StdNode:
> http://varnish02.lynero.net:8000/solr/default1_Norwegian/:org.apache.solr.client.solrj.SolrServerException:
> IOException occured when talking to server at:
> http://varnish02.lynero.net:8000/solr/default1_Norwegian
>    at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
>    at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>    at
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:335)
>    at
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:309)
>    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>    at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>    at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>    at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>    at java.lang.Thread.run(Thread.java:636)
> Caused by: org.apache.http.NoHttpResponseException: The target server
> failed to respond
>    at
> org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponseParser.java:101)
>    at
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:252)
>    at
> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:282)
>    at
> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:247)
>    at
> org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:216)
>    at
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:298)
>    at
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
>    at
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:647)
>    at
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:464)
>    at
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
>    at
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
>    at
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732)
>    at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352)
>    ... 11 more
> 
> Dec 13, 2012 12:23:39 PM
> org.apache.solr.update.processor.DistributedUpdateProcessor doFinish
> INFO: try and ask http://varnish02.lynero.net:8000/solr to recover*
> 
> It looks like it is sending updates from varnish01 to varnish02. I am not
> sure for what since we only index on varnish02. Updates should never be
> going from varnish01 to varnish02.
> 
> Meanwhile on varnish02:
> 
> *INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2}
> status=0 QTime=16
> Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2}
> status=0 QTime=15
> Dec 13, 2012 12:23:36 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2}
> status=0 QTime=16
> Dec 13, 2012 12:23:42 PM org.apache.solr.handler.admin.CoreAdminHandler
> handleRequestRecoveryAction
> INFO: It has been requested that we recover*
> *Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Danish] webapp=/solr path=/select
> params={facet=false&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&group.distributed.first=true&facet.limit=1000&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&version=2&df=text&fl=docid&shard.url=
> varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822111&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&facet.mincount=1&qf=%0a++++++++++text^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=0&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true}
> status=0 QTime=1
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Danish] webapp=/solr path=/select/
> params={fq=site_guid:(2810678)&q=win} hits=0 status=0 QTime=17
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Danish] webapp=/solr path=/select
> params={facet=on&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&group.distributed.second=true&version=2&df=text&fl=docid&shard.url=
> varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822111&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&qf=%0a++++++++++text^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=0&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true}
> status=0 QTime=1
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Danish] webapp=/solr path=/select
> params={facet=false&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&group.distributed.first=true&facet.limit=1000&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&version=2&df=text&fl=docid&shard.url=
> varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822138&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&facet.mincount=1&qf=%0a++++++++++text^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=40&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true}
> status=0 QTime=1
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Danish] webapp=/solr path=/select
> params={facet=on&sort=item_group_59700_name_int+asc,+variant_of_item_guid+asc&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&group.distributed.second=true&version=2&df=text&fl=docid&shard.url=
> varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397822138&group.field=groupby_variant_of_item_guid&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_59700_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&group.topgroups.groupby_variant_of_item_guid=2963217&group.topgroups.groupby_variant_of_item_guid=2963223&group.topgroups.groupby_variant_of_item_guid=2963219&group.topgroups.groupby_variant_of_item_guid=2963220&group.topgroups.groupby_variant_of_item_guid=2963221&group.topgroups.groupby_variant_of_item_guid=2963222&group.topgroups.groupby_variant_of_item_guid=2963224&group.topgroups.groupby_variant_of_item_guid=2963218&qf=%0a++++++++++text^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=40&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true}
> status=0 QTime=1
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2}
> status=0 QTime=26
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2}
> status=0 QTime=22
> Dec 13, 2012 12:23:42 PM org.apache.solr.update.DefaultSolrCoreState
> doRecovery
> Dec 13, 2012 12:23:42 PM org.apache.solr.update.DefaultSolrCoreState
> doRecovery
> INFO: Running recovery - first canceling any ongoing recovery
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2}
> status=0 QTime=25
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2}
> status=0 QTime=24
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2}
> status=0 QTime=20
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2}
> status=0 QTime=25
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2}
> status=0 QTime=23
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2}
> status=0 QTime=21
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2}
> status=0 QTime=23
> Dec 13, 2012 12:23:42 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Norwegian] webapp=/solr path=/update params={distrib.from=
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/&update.distrib=FROMLEADER&wt=javabin&version=2}
> status=0 QTime=16
> Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy run
> INFO: Starting recovery process.  core=default1_Norwegian
> recoveringAfterStartup=false
> Dec 13, 2012 12:23:42 PM org.apache.solr.common.cloud.ZkStateReader
> updateClusterState
> INFO: Updating cloud state from ZooKeeper...
> Dec 13, 2012 12:23:42 PM
> org.apache.solr.update.processor.LogUpdateProcessor finish*
> 
> And less than a second later:
> 
> *Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> INFO: Attempting to PeerSync from
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/core=default1_Norwegian
> - recoveringAfterStartup=false
> Dec 13, 2012 12:23:42 PM org.apache.solr.update.PeerSync sync
> INFO: PeerSync: core=default1_Norwegian url=
> http://varnish02.lynero.net:8000/solr START replicas=[
> http://varnish01.lynero.net:8000/solr/default1_Norwegian/] nUpdates=100
> Dec 13, 2012 12:23:42 PM org.apache.solr.update.PeerSync sync
> WARNING: PeerSync: core=default1_Norwegian url=
> http://varnish02.lynero.net:8000/solr too many updates received since start
> - startingUpdates no longer overlaps with our currentUpdates
> Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> INFO: PeerSync Recovery was not successful - trying replication.
> core=default1_Norwegian
> Dec 13, 2012 12:23:42 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> INFO: Starting Replication Recovery. core=default1_Norwegian
> Dec 13, 2012 12:23:42 PM org.apache.solr.client.solrj.impl.HttpClientUtil
> createClient
> INFO: Creating new http client,
> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
> Dec 13, 2012 12:23:42 PM org.apache.solr.common.cloud.ZkStateReader$2
> process
> INFO: A cluster state change has occurred - updating...*
> 
> State change on varnish01 at the same time:
> 
> *Dec 13, 2012 12:23:42 PM org.apache.solr.common.cloud.ZkStateReader$2
> process
> INFO: A cluster state change has occurred - updating...*
> *
> *And a few seconds later on varnish02, the recovery finishes:
> *
> Dec 13, 2012 12:23:48 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> INFO: Replication Recovery was successful - registering as Active.
> core=default1_Norwegian
> Dec 13, 2012 12:23:48 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> INFO: Finished recovery process. core=default1_Norwegian
> Dec 13, 2012 12:23:48 PM org.apache.solr.core.SolrCore execute
> INFO: [default1_Danish] webapp=/solr path=/select
> params={facet=false&sort=item_group_56823_name_int+asc,+variant_of_item_guid+asc&group.distributed.first=true&facet.limit=1000&q.alt=*:*&q.alt=*:*&distrib=false&facet.method=enum&version=2&df=text&fl=docid&shard.url=
> varnish02.lynero.net:8000/solr/default1_Danish/|varnish01.lynero.net:8000/solr/default1_Danish/&NOW=1355397828395&group.field=groupby_variant_of_item_guid&facet.field=itemgroups_int_mv&fq=site_guid:(11440)&fq=item_type:(PRODUCT)&fq=language_guid:(1)&fq=item_group_56823_combination:(*)&fq=item_group_45879_combination:(*)&fq=is_searchable:(True)&querytype=Technical&mm=100%25&facet.missing=on&group.ngroups=true&facet.mincount=1&qf=%0a++++++++++text^0.5+name^1.2+searchable_text^0.8+typeahead_text^1.0+keywords^1.1+item_no^5.0%0a++++++++++ranking1_text^1.0+ranking2_text^2.0+ranking3_text^3.0%0a+++++++&wt=javabin&group.facet=true&defType=edismax&rows=0&facet.sort=lex&start=0&group=true&group.sort=name+asc&isShard=true}
> status=0 QTime=8
> Dec 13, 2012 12:23:48 PM org.apache.solr.common.cloud.ZkStateReader
> updateClusterState
> INFO: Updating cloud state from ZooKeeper... *
> 
> Which is picked up on varnish01:
> 
> *Dec 13, 2012 12:23:48 PM org.apache.solr.common.cloud.ZkStateReader$2
> process
> INFO: A cluster state change has occurred - updating...*
> 
> It looks like it replicated successfully, only it didnt. The
> default1_Norwegian core on varnish01 now has 55.071 docs and the same core
> on varnish02 has 35.088 docs.
> 
> I checked the log files for both JVM's and no stop-the-world GC were taking
> place.
> 
> There is also nothing in the zookeeper log of interest that I can see.
> 
> 
> -- 
> Med venlig hilsen / Best regards
> 
> *John Nielsen*
> Programmer
> 
> 
> 
> *MCB A/S*
> Enghaven 15
> DK-7500 Holstebro
> 
> Kundeservice: +45 9610 2824
> p...@mcb.dk
> www.mcb.dk

Reply via email to