Re: Solrcloud Index corruption
Ahhh, ok. When you reloaded the cores, did you do it core-by-core? Yes, but maybe we reloaded the wrong core or something like that. We also noticed that the startTime doesn't update in the admin-ui while switching between cores (you have to reload the page). We still use 4.8.1, so maybe it is fixed in a later version. We will see after our next upgrade, if not we will add an issue for it. Martin Erick Erickson schreef op 10.03.2015 18:21: Ahhh, ok. When you reloaded the cores, did you do it core-by-core? I can see how something could get dropped in that case. However, if you used the Collections API and two cores mysteriously failed to reload that would be a bug. Assuming the replicas in question were up and running at the time you reloaded. Thanks for letting us know what's going on. Erick On Tue, Mar 10, 2015 at 4:34 AM, Martin de Vries wrote: Hi, this _sounds_ like you somehow don't have indexed="true" set for the field in question. We investigated a lot more. The CheckIndex tool didn't find any error. We now think the following happened: - We changed the schema two months ago: we changed a field to indexed="true". We reloaded the cores, but two of them doesn't seem to be reloaded (maybe we forgot). - We reindexed all content. The new field worked fine. - We think the leader changed to a server that didn't reload the core - After that we field stopped working for new indexed documents Thanks for your help. Martin Erick Erickson schreef op 06.03.2015 17:02: bq: You say in our case some docs didn't made it to the node, but that's not really true: the docs can be found on the corrupted nodes when I search on ID. The docs are also complete. The problem is that the docs do not appear when I filter on certain fields this _sounds_ like you somehow don't have indexed="true" set for the field in question. But it also sounds like you're saying that search on that field works on some nodes but not on others, I'm assuming you're adding "&distrib=false" to verify this. It shouldn't be possible to have different schema.xml files on the different nodes, but you might try checking through the admin UI. Network burps shouldn't be related here. If the content is stored, then the info made it to Solr intact, so this issue shouldn't be related to that. Sounds like it may just be the bugs Mark is referencing, sorry I don't have the JIRA numbers right off. Best, Erick On Thu, Mar 5, 2015 at 4:46 PM, Shawn Heisey wrote: On 3/5/2015 3:13 PM, Martin de Vries wrote: I understand there is not a "master" in SolrCloud. In our case we use haproxy as a load balancer for every request. So when indexing every document will be sent to a different solr server, immediately after each other. Maybe SolrCloud is not able to handle that correctly? SolrCloud can handle that correctly, but currently sending index updates to a core that is not the leader of the shard will incur a significant performance hit, compared to always sending updates to the correct core. A small performance penalty would be understandable, because the request must be redirected, but what actually happens is a much larger penalty than anyone expected. We have an issue in Jira to investigate that performance issue and make it work as efficiently as possible. Indexing batches of documents is recommended, not sending one document per update request. General performance problems with Solr itself can lead to extremely odd and unpredictable behavior from SolrCloud. Most often these kinds of performance problems are related in some way to memory, either the java heap or available memory in the system. http://wiki.apache.org/solr/SolrPerformanceProblems [1] [1] Thanks, Shawn Links: -- [1] http://wiki.apache.org/solr/SolrPerformanceProblems [3] Links: -- [1] http://wiki.apache.org/solr/SolrPerformanceProblems [2] mailto:apa...@elyograg.org [3] http://wiki.apache.org/solr/SolrPerformanceProblems
Re: Solrcloud Index corruption
Hi, this _sounds_ like you somehow don't have indexed="true" set for the field in question. We investigated a lot more. The CheckIndex tool didn't find any error. We now think the following happened: - We changed the schema two months ago: we changed a field to indexed="true". We reloaded the cores, but two of them doesn't seem to be reloaded (maybe we forgot). - We reindexed all content. The new field worked fine. - We think the leader changed to a server that didn't reload the core - After that we field stopped working for new indexed documents Thanks for your help. Martin Erick Erickson schreef op 06.03.2015 17:02: bq: You say in our case some docs didn't made it to the node, but that's not really true: the docs can be found on the corrupted nodes when I search on ID. The docs are also complete. The problem is that the docs do not appear when I filter on certain fields this _sounds_ like you somehow don't have indexed="true" set for the field in question. But it also sounds like you're saying that search on that field works on some nodes but not on others, I'm assuming you're adding "&distrib=false" to verify this. It shouldn't be possible to have different schema.xml files on the different nodes, but you might try checking through the admin UI. Network burps shouldn't be related here. If the content is stored, then the info made it to Solr intact, so this issue shouldn't be related to that. Sounds like it may just be the bugs Mark is referencing, sorry I don't have the JIRA numbers right off. Best, Erick On Thu, Mar 5, 2015 at 4:46 PM, Shawn Heisey wrote: On 3/5/2015 3:13 PM, Martin de Vries wrote: I understand there is not a "master" in SolrCloud. In our case we use haproxy as a load balancer for every request. So when indexing every document will be sent to a different solr server, immediately after each other. Maybe SolrCloud is not able to handle that correctly? SolrCloud can handle that correctly, but currently sending index updates to a core that is not the leader of the shard will incur a significant performance hit, compared to always sending updates to the correct core. A small performance penalty would be understandable, because the request must be redirected, but what actually happens is a much larger penalty than anyone expected. We have an issue in Jira to investigate that performance issue and make it work as efficiently as possible. Indexing batches of documents is recommended, not sending one document per update request. General performance problems with Solr itself can lead to extremely odd and unpredictable behavior from SolrCloud. Most often these kinds of performance problems are related in some way to memory, either the java heap or available memory in the system. http://wiki.apache.org/solr/SolrPerformanceProblems [1] Thanks, Shawn Links: -- [1] http://wiki.apache.org/solr/SolrPerformanceProblems
Re: Solrcloud Index corruption
Hi Erick, Thank you for your detailed reply. You say in our case some docs didn't made it to the node, but that's not really true: the docs can be found on the corrupted nodes when I search on ID. The docs are also complete. The problem is that the docs do not appear when I filter on certain fields (however the fields are in the doc and have the right value when I search on ID). So something seems to be corrupt in the filter index. We will try the checkindex, hopefully it is able to identify the problematic cores. I understand there is not a "master" in SolrCloud. In our case we use haproxy as a load balancer for every request. So when indexing every document will be sent to a different solr server, immediately after each other. Maybe SolrCloud is not able to handle that correctly? Thanks, Martin Erick Erickson schreef op 05.03.2015 19:00: Wait up. There's no "master" index in SolrCloud. Raw documents are forwarded to each replica, indexed and put in the local tlog. If a replica falls too far out of synch (say you take it offline), then the entire index _can_ be replicated from the leader and, if the leader's index was incomplete then that might propagate the error. The practical consequence of this is that if _any_ replica has a complete index, you can recover. Before going there though, the brute-force approach is to just re-index everything from scratch. That's likely easier, especially on indexes this size. Here's what I'd do. Assuming you have the Collections API calls for ADDREPLICA and DELETEREPLICA, then: 0> Identify the complete replicas. If you're lucky you have at least one for each shard. 1> Copy 1 good index from each shard somewhere just to have a backup. 2> DELETEREPLICA on all the incomplete replicas 2.5> I might shut down all the nodes at this point and check that all the cores I'd deleted were gone. If any remnants exist, 'rm -rf deleted_core_dir'. 3> ADDREPLICA to get the ones removed in back. should copy the entire index from the leader for each replica. As you do the leadership will change and after you've deleted all the incomplete replicas, one of the complete ones will be the leader and you should be OK. If you don't want to/can't use the Collections API, then 0> Identify the complete replicas. If you're lucky you have at least one for each shard. 1> Shut 'em all down. 2> Copy the good index somewhere just to have a backup. 3> 'rm -rf data' for all the incomplete cores. 4> Bring up the good cores. 5> Bring up the cores that you deleted the data dirs from. What should do is replicate the entire index from the leader. When you restart the good cores (step 4 above), they'll _become_ the leader. bq: Is it possible to make Solrcloud invulnerable for network problems I'm a little surprised that this is happening. It sounds like the network problems were such that some nodes weren't out of touch long enough for Zookeeper to sense that they were down and put them into recovery. Not sure there's any way to secure against that. bq: Is it possible to see if a core is corrupt? There's "CheckIndex", here's at least one link: http://java.dzone.com/news/lucene-and-solrs-checkindex What you're describing, though, is that docs just didn't make it to the node, _not_ that the index has unexpected bits, bad disk sectors and the like so CheckIndex can't detect that. How would it know what _should_ have been in the index? bq: I noticed a difference in the "Gen" column on Overview - Replication. Does this mean there is something wrong? You cannot infer anything from this. In particular, the merging will be significantly different between a single full-reindex and what the state of segment merges is in an incrementally built index. The admin UI screen is rooted in the pre-cloud days, the Master/Slave thing is entirely misleading. In SolrCloud, since all the raw data is forwarded to all replicas, and any auto commits that happen may very well be slightly out of sync, the index size, number of segments, generations, and all that are pretty safely ignored. Best, Erick On Thu, Mar 5, 2015 at 6:50 AM, Martin de Vries wrote: Hi Andrew, Even our master index is corrupt, so I'm afraid this won't help in our case. Martin Andrew Butkus schreef op 05.03.2015 16:45: Force a fetchindex on slave from master command: http://slave_host:port/solr/replication?command=fetchindex - from http://wiki.apache.org/solr/SolrReplication [1] The above command will download the whole index from master to slave, there are configuration options in solr to make this problem happen less often (allowing it to recover from new documents added and only send the changes with a wider gap) - but I cant remember what those were. Links: -- [1] http://wiki.apache.org/solr/SolrReplication
RE: Solrcloud Index corruption
Hi Andrew, Even our master index is corrupt, so I'm afraid this won't help in our case. Martin Andrew Butkus schreef op 05.03.2015 16:45: Force a fetchindex on slave from master command: http://slave_host:port/solr/replication?command=fetchindex - from http://wiki.apache.org/solr/SolrReplication The above command will download the whole index from master to slave, there are configuration options in solr to make this problem happen less often (allowing it to recover from new documents added and only send the changes with a wider gap) - but I cant remember what those were.
Solrcloud Index corruption
Hi, We have index corruption on some cores on our Solrcloud running version 4.8.1. The index is corrupt on several servers. (for example: when we do an fq search we get results on some servers, on other servers we don't, while the stored document contains the field on all servers). A full re-index of the content didn't help, so we created a new core and did the reindex on that one. We think the index corruption is caused by network issues we had a few weeks ago. I hope someone can help us with some questions: - Is it possible to make Solrcloud invulnerable for network problems like packet loss or connection errors? Will it for example help to use an SSL connection between the Solr servers? - Is it possible to see if a core is corrupt? We now noticed because we didn't find some documents while searching on the website, but don't know if other cores are corrupt. I noticed a difference in the "Gen" column on Overview - Replication. Does this mean there is something wrong? Or is there any other way to see the corruption? Corrupt core: Version Gen Size Master (Searching) 1425565575249 2023309 472.41 MB Master (Replicable) 1425566098510 2023310 - Slave (Searching) 1425565575253 2023308 472.38 MB Re-created core: Version Gen Size Master (Searching) 1425566108174 35 283.98 MB Master (Replicable) 1425566108174 35 - Slave (Searching) 1425566106674 35 288.24 MB Kind regards, Martin
Different Solr versions in Solrcloud
Hi, I have two questions about upgrading Solr: - We upgrade Solr often, to match the latest version. We have a number of servers in a Solrcloud and prefer to upgrade one or two servers first and upgrade the other server a few weeks later when we are sure everything is stable. Is this the recommended way? Can Solr run different versions next to each other in a cloud? - Do we need to adjust the luceneMatchVersion with every upgrade and do we need a reindex after every upgrade? (it takes a lot of time to reindex all cores) Kind regards, Martin
Re: SolrCloud constantly crashes after upgrading to Solr 4.7
We are running stable now for a full day, so the bug has been fixed. Many thanks! Martin
Re: SolrCloud constantly crashes after upgrading to Solr 4.7
Martin, I’ve committed the SOLR-5875 fix, including to the lucene_solr_4_7 branch. Any chance you could test the fix? Hi Steve, I'm very happy you found the bug. We are running the version from SVN on one server and it's already running fine for 5 hours. If it's still stable tomorrow than we are absolutely sure, I will report it here. Marijn
Re: SolrCloud constantly crashes after upgrading to Solr 4.7
Hi, When our server crashes the memory fills up fast. So I think it might be a specific query that causes our servers to crash. I think the query won't be logged because it doesn't finish. Is there anything we can do to see the currently running queries in de Solr server (so when can see them just before the crash)? A debug log might be another option, but I'm afraid our servers are to busy to find it in there. Martin
Re: SolrCloud constantly crashes after upgrading to Solr 4.7
The memory leak seems to be in: org.apache.solr.handler.component.ShardFieldSortedHitQueue I think our issue might be related to this one, because this change has been introduced in 4.7 and has changes to ShardFieldSortedHitQueue: https://issues.apache.org/jira/browse/SOLR-5354 Is the memory leak a bug, or should a full reindex help? Martin
Re: SolrCloud constantly crashes after upgrading to Solr 4.7
> IndexSchema is using 62% of the memory That seems odd. Can you see what objects are taking all the RAM in the IndexSchema? We investigated this and found out that a dictionary was loaded for each core, taking loads of memory. We the the config to shareSchema=true. The memory usage decreased a lot and Solr is crashing less often, but the problem still exists. We sometimes see a "GC overhead limit exceeded" log entry now. We made a new memory dump. It's about 4GB. The strange this is: Eclipse Memory Analyzer talks about "Size: 799,2 MB". It seems the rest is in the "Unreachable Objects" (2,5 GB). The "Unreachable Objects" are full of byte[] and BytesRef objects: https://www.dropbox.com/s/6kysc21rkmr67r7/Screenshot%202014-03-07%2015.38.51.png I think this is the memory leak? We'd need to actually see a large chunk of the end of the actual logfile. I put the log here (anonymised some shard names): https://www.dropbox.com/s/0seosviys5wrvzh/catalina.log Are there any messages in the operating system logs? No, not at all. Full details about the computer, operating system, Dell PowerEdge servers, 16GB RAM, Debian Wheezy Solr startup options -server -verbose:gc -Xloggc:/var/log/jvm.log -Xmx4096m -Dcom.sun.management.jmxremote -Djava.awt.headless=true -DzkHost=192.168.40.30:2181,192.168.40.33:2181,192.168.40.37:2181/solr and your index About 70 cores, 5 servers, 12GB indexes in total (every core has 2 shards, so it's 6 GB per server). The most used schema is: https://www.dropbox.com/s/6fhlvsh6v1rxyck/schema.xml Thanks, Martin
Re: SolrCloud constantly crashes after upgrading to Solr 4.7
We parsed the "Unreachable Objects" of the memory dump. The memory leak seems to be in: org.apache.solr.handler.component.ShardFieldSortedHitQueue https://www.dropbox.com/s/hdv49xlb4g4wi03/Screenshot%202014-03-07%2016.51.56.png Martin
SolrCloud constantly crashes after upgrading to Solr 4.7
Hi, We have 5 Solr servers in a Cloud with about 70 cores and 12GB indexes in total (every core has 2 shards, so it's 6 GB per server). After upgrade to Solr 4.7 the Solr servers are crashing constantly (each server about one time per hour). We currently don't have any clue about the reason. We tried loads of different settings, but nothing works out. When a server crashes the last log item is (most times) a "Broken pipe" error. The last queries / used cores are completely random (as far as we can see). We are running with the -Xloggc switch and during a crash it says: 10838.015: [Full GC 3141724K->3141724K(3522560K), 1.6936710 secs] 10839.710: [Full GC 3141724K->3141724K(3522560K), 1.5682250 secs] 10841.279: [Full GC 3141728K->3141726K(3522560K), 1.5735450 secs] 10842.854: [Full GC 3141727K->3141727K(3522560K), 1.5773380 secs] 10844.433: [Full GC 3141732K->3141687K(3522560K), 1.5696950 secs] 10846.003: [Full GC 3141698K->3141687K(3522560K), 1.5766940 secs] 10847.581: [Full GC 3141695K->3141688K(3522560K), 1.5879360 secs] 10849.170: [Full GC 3141695K->3141691K(3522560K), 1.5698630 secs] 10850.741: [Full GC 3141695K->3141689K(3522560K), 1.5643990 secs] 10852.307: [Full GC 3141693K->3141650K(3522560K), 1.5759150 secs] We tried to increase the memory, but that didn't help. We increased the zkClientTimeout to 60 seconds, but that didn't help. We made a memory dump with jmap. The IndexSchema is using 62% of the memory but we don't know if that's a problem: https://www.dropbox.com/s/eyom5c48vhl0q9i/Screenshot%202014-03-06%2023.32.41.png [1] Tomorrow we will downgrade each server to Solr 4.6.1, we need to reindex every core to do that unless we have a solution. Does anyone have a clue what the problem can be? Thanks! Martin Links: -- [1] https://www.dropbox.com/s/eyom5c48vhl0q9i/Screenshot%202014-03-06%2023.32.41.png
Re: SolrCloud unstable
We did some more monitoring and have some new information: Before the issue happens the garbage collector's "collection count" increases a lot. The increase seems to start about an hour before the real problem occurs: http://www.analyticsforapplications.com/GC.png [1] We tried both the g1 garbage collector and the regular one, the problem happens with both of them. We use Java 1.6 on some servers. Will Java 1.7 be better? Martin Martin de Vries schreef op 12.11.2013 10:45: > Hi, > > We have: > > Solr 4.5.1 - 5 servers > 36 cores, 2 shards each, 2 servers per shard (every core is on 4 > servers) > about 4.5 GB total data on disk per server > 4GB JVM-Memory per server, 3GB average in use > Zookeeper 3.3.5 - 3 servers (one shared with Solr) > haproxy load balancing > > Our Solrcloud is very unstable. About one time a week some cores go in > recovery state or down state. Many timeouts occur and we have to restart > servers to get them back to work. The failover doesn't work in many > cases, because one server has the core in down state, the other in > recovering state. Other cores work fine. When the cloud is stable I > sometimes see log messages like: > - shard update error StdNode: > http://033.downnotifier.com:8983/solr/dntest_shard2_replica1/:org.apache.solr.client.solrj.SolrServerException: > IOException occured when talking to server at: > http://033.downnotifier.com:8983/solr/dntest_shard2_replica1 > - forwarding update to > http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed - > retrying ... > - null:ClientAbortException: java.io.IOException: Broken pipe > > Before the the cloud problems start there are many large Qtime's in the > log (sometimes over 50 seconds), but there are no other errors until the > recovery problems start. > > Any clue about what can be wrong? > > Kinds regards, > > Martin Links: -- [1] http://www.analyticsforapplications.com/GC.png
SolrCloud unstable
Hi, We have: Solr 4.5.1 - 5 servers 36 cores, 2 shards each, 2 servers per shard (every core is on 4 servers) about 4.5 GB total data on disk per server 4GB JVM-Memory per server, 3GB average in use Zookeeper 3.3.5 - 3 servers (one shared with Solr) haproxy load balancing Our Solrcloud is very unstable. About one time a week some cores go in recovery state or down state. Many timeouts occur and we have to restart servers to get them back to work. The failover doesn't work in many cases, because one server has the core in down state, the other in recovering state. Other cores work fine. When the cloud is stable I sometimes see log messages like: - shard update error StdNode: http://033.downnotifier.com:8983/solr/dntest_shard2_replica1/:org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://033.downnotifier.com:8983/solr/dntest_shard2_replica1 - forwarding update to http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed - retrying ... - null:ClientAbortException: java.io.IOException: Broken pipe Before the the cloud problems start there are many large Qtime's in the log (sometimes over 50 seconds), but there are no other errors until the recovery problems start. Any clue about what can be wrong? Kinds regards, Martin