Re: Solrcloud Index corruption

2015-03-12 Thread Martin de Vries

Ahhh, ok. When you reloaded the cores, did you do it core-by-core?


Yes, but maybe we reloaded the wrong core or something like that. We 
also noticed that the startTime doesn't update in the admin-ui while 
switching between cores (you have to reload the page). We still use 
4.8.1, so maybe it is fixed in a later version. We will see after our 
next upgrade, if not we will add an issue for it.



Martin



Erick Erickson schreef op 10.03.2015 18:21:


Ahhh, ok. When you reloaded the cores, did you do it core-by-core?
I can see how something could get dropped in that case.

However, if you used the Collections API and two cores mysteriously
failed to reload that would be a bug. Assuming the replicas in 
question

were up and running at the time you reloaded.

Thanks for letting us know what's going on.
Erick

On Tue, Mar 10, 2015 at 4:34 AM, Martin de Vries
 wrote:


Hi,

this _sounds_ like you somehow don't have indexed="true" set for 
the

field in question.
We investigated a lot more. The CheckIndex tool didn't find any 
error.
We now think the following happened: - We changed the schema two 
months
ago: we changed a field to indexed="true". We reloaded the cores, 
but

two of them doesn't seem to be reloaded (maybe we forgot). - We
reindexed all content. The new field worked fine. - We think the 
leader
changed to a server that didn't reload the core - After that we 
field
stopped working for new indexed documents Thanks for your help. 
Martin

Erick Erickson schreef op 06.03.2015 17:02:


bq: You say in our case some docs didn't made it to the node, but
that's not really true: the docs can be found on the corrupted 
nodes
when I search on ID. The docs are also complete. The problem is 
that
the docs do not appear when I filter on certain fields this 
_sounds_

like you somehow don't have indexed="true" set for the field in
question. But it also sounds like you're saying that search on that
field works on some nodes but not on others, I'm assuming you're
adding "&distrib=false" to verify this. It shouldn't be possible to
have different schema.xml files on the different nodes, but you 
might
try checking through the admin UI. Network burps shouldn't be 
related
here. If the content is stored, then the info made it to Solr 
intact,

so this issue shouldn't be related to that. Sounds like it may just
be the bugs Mark is referencing, sorry I don't have the JIRA 
numbers

right off. Best, Erick On Thu, Mar 5, 2015 at 4:46 PM, Shawn Heisey
 wrote:


On 3/5/2015 3:13 PM, Martin de Vries wrote:


I understand there is not a "master" in SolrCloud. In our case we
use haproxy as a load balancer for every request. So when
indexing every document will be sent to a different solr server,
immediately after each other. Maybe SolrCloud is not able to
handle that correctly?

SolrCloud can handle that correctly, but currently sending index
updates to a core that is not the leader of the shard will incur a
significant performance hit, compared to always sending updates to
the correct core. A small performance penalty would be
understandable, because the request must be redirected, but what
actually happens is a much larger penalty than anyone expected. We
have an issue in Jira to investigate that performance issue and
make it work as efficiently as possible. Indexing batches of
documents is recommended, not sending one document per update
request. General performance problems with Solr itself can lead to
extremely odd and unpredictable behavior from SolrCloud. Most 
often

these kinds of performance problems are related in some way to
memory, either the java heap or available memory in the system.
http://wiki.apache.org/solr/SolrPerformanceProblems [1] [1] 
Thanks,

Shawn
Links: -- [1] 
http://wiki.apache.org/solr/SolrPerformanceProblems

[3]




Links:
--
[1] http://wiki.apache.org/solr/SolrPerformanceProblems
[2] mailto:apa...@elyograg.org
[3] http://wiki.apache.org/solr/SolrPerformanceProblems


Re: Solrcloud Index corruption

2015-03-10 Thread Martin de Vries

Hi,


this _sounds_ like you somehow don't have indexed="true" set for the
field in question.


We investigated a lot more. The CheckIndex tool didn't find any error. 
We now think the following happened:
- We changed the schema two months ago: we changed a field to 
indexed="true". We reloaded the cores, but two of them doesn't seem to 
be reloaded (maybe we forgot).

- We reindexed all content. The new field worked fine.
- We think the leader changed to a server that didn't reload the core
- After that we field stopped working for new indexed documents

Thanks for your help.


Martin




Erick Erickson schreef op 06.03.2015 17:02:


bq: You say in our case some docs didn't made it to the node, but
that's not really true: the docs can be found on the corrupted nodes
when I search on ID. The docs are also complete. The problem is that
the docs do not appear when I filter on certain fields

this _sounds_ like you somehow don't have indexed="true" set for the
field in question. But it also sounds like you're saying that search
on that field works on some nodes but not on others, I'm assuming
you're adding "&distrib=false" to verify this. It shouldn't be
possible to have different schema.xml files on the different nodes,
but you might try checking through the admin UI.

Network burps shouldn't be related here. If the content is stored,
then the info made it to Solr intact, so this issue shouldn't be
related to that.

Sounds like it may just be the bugs Mark is referencing, sorry I 
don't

have the JIRA numbers right off.

Best,
Erick

On Thu, Mar 5, 2015 at 4:46 PM, Shawn Heisey  
wrote:



On 3/5/2015 3:13 PM, Martin de Vries wrote:

I understand there is not a "master" in SolrCloud. In our case we 
use
haproxy as a load balancer for every request. So when indexing 
every

document will be sent to a different solr server, immediately after
each other. Maybe SolrCloud is not able to handle that correctly?

SolrCloud can handle that correctly, but currently sending index
updates to a core that is not the leader of the shard will incur a
significant performance hit, compared to always sending updates to 
the

correct core. A small performance penalty would be understandable,
because the request must be redirected, but what actually happens is 
a
much larger penalty than anyone expected. We have an issue in Jira 
to
investigate that performance issue and make it work as efficiently 
as
possible. Indexing batches of documents is recommended, not sending 
one

document per update request. General performance problems with Solr
itself can lead to extremely odd and unpredictable behavior from
SolrCloud. Most often these kinds of performance problems are 
related
in some way to memory, either the java heap or available memory in 
the
system. http://wiki.apache.org/solr/SolrPerformanceProblems [1] 
Thanks,

Shawn




Links:
--
[1] http://wiki.apache.org/solr/SolrPerformanceProblems


Re: Solrcloud Index corruption

2015-03-05 Thread Martin de Vries

Hi Erick,

Thank you for your detailed reply.

You say in our case some docs didn't made it to the node, but that's 
not really true: the docs can be found on the corrupted nodes when I 
search on ID. The docs are also complete. The problem is that the docs 
do not appear when I filter on certain fields (however the fields are in 
the doc and have the right value when I search on ID). So something 
seems to be corrupt in the filter index. We will try the checkindex, 
hopefully it is able to identify the problematic cores.


I understand there is not a "master" in SolrCloud. In our case we use 
haproxy as a load balancer for every request. So when indexing every 
document will be sent to a different solr server, immediately after each 
other. Maybe SolrCloud is not able to handle that correctly?



Thanks,

Martin




Erick Erickson schreef op 05.03.2015 19:00:


Wait up. There's no "master" index in SolrCloud. Raw documents are
forwarded to each replica, indexed and put in the local tlog. If a
replica falls too far out of synch (say you take it offline), then 
the

entire index _can_ be replicated from the leader and, if the leader's
index was incomplete then that might propagate the error.

The practical consequence of this is that if _any_ replica has a
complete index, you can recover. Before going there though, the
brute-force approach is to just re-index everything from scratch.
That's likely easier, especially on indexes this size.

Here's what I'd do.

Assuming you have the Collections API calls for ADDREPLICA and
DELETEREPLICA, then:
0> Identify the complete replicas. If you're lucky you have at least
one for each shard.
1> Copy 1 good index from each shard somewhere just to have a backup.
2> DELETEREPLICA on all the incomplete replicas
2.5> I might shut down all the nodes at this point and check that all
the cores I'd deleted were gone. If any remnants exist, 'rm -rf
deleted_core_dir'.
3> ADDREPLICA to get the ones removed in back.

should copy the entire index from the leader for each replica. As
you do the leadership will change and after you've deleted all the
incomplete replicas, one of the complete ones will be the leader and
you should be OK.

If you don't want to/can't use the Collections API, then
0> Identify the complete replicas. If you're lucky you have at least
one for each shard.
1> Shut 'em all down.
2> Copy the good index somewhere just to have a backup.
3> 'rm -rf data' for all the incomplete cores.
4> Bring up the good cores.
5> Bring up the cores that you deleted the data dirs from.

What should do is replicate the entire index from the leader. When
you restart the good cores (step 4 above), they'll _become_ the
leader.

bq: Is it possible to make Solrcloud invulnerable for network 
problems

I'm a little surprised that this is happening. It sounds like the
network problems were such that some nodes weren't out of touch long
enough for Zookeeper to sense that they were down and put them into
recovery. Not sure there's any way to secure against that.

bq: Is it possible to see if a core is corrupt?
There's "CheckIndex", here's at least one link:
http://java.dzone.com/news/lucene-and-solrs-checkindex
What you're describing, though, is that docs just didn't make it to
the node, _not_ that the index has unexpected bits, bad disk sectors
and the like so CheckIndex can't detect that. How would it know what
_should_ have been in the index?

bq: I noticed a difference in the "Gen" column on Overview -
Replication. Does this mean there is something wrong?
You cannot infer anything from this. In particular, the merging will
be significantly different between a single full-reindex and what the
state of segment merges is in an incrementally built index.

The admin UI screen is rooted in the pre-cloud days, the Master/Slave
thing is entirely misleading. In SolrCloud, since all the raw data is
forwarded to all replicas, and any auto commits that happen may very
well be slightly out of sync, the index size, number of segments,
generations, and all that are pretty safely ignored.

Best,
Erick

On Thu, Mar 5, 2015 at 6:50 AM, Martin de Vries 


wrote:

Hi Andrew, Even our master index is corrupt, so I'm afraid this 
won't

help in our case. Martin Andrew Butkus schreef op 05.03.2015 16:45:


Force a fetchindex on slave from master command:
http://slave_host:port/solr/replication?command=fetchindex - from
http://wiki.apache.org/solr/SolrReplication [1] The above command
will download the whole index from master to slave, there are
configuration options in solr to make this problem happen less 
often

(allowing it to recover from new documents added and only send the
changes with a wider gap) - but I cant remember what those were.




Links:
--
[1] http://wiki.apache.org/solr/SolrReplication


RE: Solrcloud Index corruption

2015-03-05 Thread Martin de Vries

Hi Andrew,

Even our master index is corrupt, so I'm afraid this won't help in our 
case.


Martin


Andrew Butkus schreef op 05.03.2015 16:45:


Force a fetchindex on slave from master command:
http://slave_host:port/solr/replication?command=fetchindex - from
http://wiki.apache.org/solr/SolrReplication

The above command will download the whole index from master to slave,
there are configuration options in solr to make this problem happen 
less
often (allowing it to recover from new documents added and only send 
the

changes with a wider gap) - but I cant remember what those were.




Solrcloud Index corruption

2015-03-05 Thread Martin de Vries

Hi,

We have index corruption on some cores on our Solrcloud running version 
4.8.1. The index is corrupt on several servers. (for example: when we do 
an fq search we get results on some servers, on other servers we don't, 
while the stored document contains the field on all servers).


A full re-index of the content didn't help, so we created a new core 
and did the reindex on that one.


We think the index corruption is caused by network issues we had a few 
weeks ago. I hope someone can help us with some questions:
- Is it possible to make Solrcloud invulnerable for network problems 
like packet loss or connection errors? Will it for example help to use 
an SSL connection between the Solr servers?
- Is it possible to see if a core is corrupt? We now noticed because we 
didn't find some documents while searching on the website, but don't 
know if other cores are corrupt. I noticed a difference in the "Gen" 
column on Overview - Replication. Does this mean there is something 
wrong? Or is there any other way to see the corruption?


Corrupt core:
Version Gen Size
Master (Searching)  1425565575249   2023309 472.41 MB
Master (Replicable) 1425566098510   2023310 -
Slave (Searching)   1425565575253   2023308 472.38 MB

Re-created core:
Version Gen Size
Master (Searching)  1425566108174   35  283.98 MB
Master (Replicable) 1425566108174   35  -
Slave (Searching)   1425566106674   35  288.24 MB



Kind regards,

Martin




Different Solr versions in Solrcloud

2014-06-03 Thread Martin de Vries

Hi,

I have two questions about upgrading Solr:

- We upgrade Solr often, to match the latest version. We have a number 
of servers in a Solrcloud and prefer to upgrade one or two servers first 
and upgrade the other server a few weeks later when we are sure 
everything is stable. Is this the recommended way? Can Solr run 
different versions next to each other in a cloud?


- Do we need to adjust the luceneMatchVersion with every upgrade and do 
we need a reindex after every upgrade? (it takes a lot of time to 
reindex all cores)



Kind regards,

Martin


Re: SolrCloud constantly crashes after upgrading to Solr 4.7

2014-03-19 Thread Martin de Vries

We are running stable now for a full day, so the bug has been fixed.

Many thanks!

Martin


Re: SolrCloud constantly crashes after upgrading to Solr 4.7

2014-03-18 Thread Martin de Vries
Martin, I’ve committed the SOLR-5875 fix, including to the 
lucene_solr_4_7 branch.


Any chance you could test the fix?


Hi Steve,

I'm very happy you found the bug. We are running the version from SVN 
on one server and it's already running fine for 5 hours. If it's still 
stable tomorrow than we are absolutely sure, I will report it here.




Marijn


Re: SolrCloud constantly crashes after upgrading to Solr 4.7

2014-03-10 Thread Martin de Vries

Hi,

When our server crashes the memory fills up fast. So I think it might 
be a specific query that causes our servers to crash. I think the query 
won't be logged because it doesn't finish. Is there anything we can do 
to see the currently running queries in de Solr server (so when can see 
them just before the crash)? A debug log might be another option, but 
I'm afraid our servers are to busy to find it in there.



Martin


Re: SolrCloud constantly crashes after upgrading to Solr 4.7

2014-03-07 Thread Martin de Vries

The memory leak seems to be in:
org.apache.solr.handler.component.ShardFieldSortedHitQueue


I think our issue might be related to this one, because this change has 
been introduced in 4.7 and has changes to ShardFieldSortedHitQueue:


https://issues.apache.org/jira/browse/SOLR-5354


Is the memory leak a bug, or should a full reindex help?


Martin


Re: SolrCloud constantly crashes after upgrading to Solr 4.7

2014-03-07 Thread Martin de Vries

> IndexSchema is using 62% of the memory

That seems odd. Can you see what objects are taking all the RAM in 
the

IndexSchema?


We investigated this and found out that a dictionary was loaded for 
each core, taking loads of memory. We the the config to 
shareSchema=true. The memory usage decreased a lot and Solr is crashing 
less often, but the problem still exists. We sometimes see a "GC 
overhead limit exceeded" log entry now.


We made a new memory dump. It's about 4GB. The strange this is: Eclipse 
Memory Analyzer talks about "Size: 799,2 MB". It seems the rest is in 
the "Unreachable Objects" (2,5 GB). The "Unreachable Objects" are full 
of byte[] and BytesRef objects:

https://www.dropbox.com/s/6kysc21rkmr67r7/Screenshot%202014-03-07%2015.38.51.png

I think this is the memory leak?



We'd need to actually see a large chunk of the end of the actual
logfile.


I put the log here (anonymised some shard names):
https://www.dropbox.com/s/0seosviys5wrvzh/catalina.log



Are there any messages in the operating system logs?


No, not at all.



Full details about the computer, operating system,


Dell PowerEdge servers, 16GB RAM, Debian Wheezy



Solr startup options


-server -verbose:gc -Xloggc:/var/log/jvm.log -Xmx4096m 
-Dcom.sun.management.jmxremote -Djava.awt.headless=true 
-DzkHost=192.168.40.30:2181,192.168.40.33:2181,192.168.40.37:2181/solr




and your index


About 70 cores, 5 servers, 12GB indexes in total (every core has 2 
shards, so it's 6 GB per server). The most used schema is:


https://www.dropbox.com/s/6fhlvsh6v1rxyck/schema.xml



Thanks,

Martin


Re: SolrCloud constantly crashes after upgrading to Solr 4.7

2014-03-07 Thread Martin de Vries

We parsed the "Unreachable Objects" of the memory dump.

The memory leak seems to be in:
org.apache.solr.handler.component.ShardFieldSortedHitQueue

https://www.dropbox.com/s/hdv49xlb4g4wi03/Screenshot%202014-03-07%2016.51.56.png


Martin


SolrCloud constantly crashes after upgrading to Solr 4.7

2014-03-06 Thread Martin de Vries
 

Hi, 

We have 5 Solr servers in a Cloud with about 70 cores and 12GB
indexes in total (every core has 2 shards, so it's 6 GB per server).


After upgrade to Solr 4.7 the Solr servers are crashing constantly
(each server about one time per hour). We currently don't have any clue
about the reason. We tried loads of different settings, but nothing
works out. 

When a server crashes the last log item is (most times) a
"Broken pipe" error. The last queries / used cores are completely random
(as far as we can see). 

We are running with the -Xloggc switch and
during a crash it says: 

10838.015: [Full GC
3141724K->3141724K(3522560K), 1.6936710 secs]
10839.710: [Full GC
3141724K->3141724K(3522560K), 1.5682250 secs]
10841.279: [Full GC
3141728K->3141726K(3522560K), 1.5735450 secs]
10842.854: [Full GC
3141727K->3141727K(3522560K), 1.5773380 secs]
10844.433: [Full GC
3141732K->3141687K(3522560K), 1.5696950 secs]
10846.003: [Full GC
3141698K->3141687K(3522560K), 1.5766940 secs]
10847.581: [Full GC
3141695K->3141688K(3522560K), 1.5879360 secs]
10849.170: [Full GC
3141695K->3141691K(3522560K), 1.5698630 secs]
10850.741: [Full GC
3141695K->3141689K(3522560K), 1.5643990 secs]
10852.307: [Full GC
3141693K->3141650K(3522560K), 1.5759150 secs]

We tried to increase the
memory, but that didn't help. We increased the zkClientTimeout to 60
seconds, but that didn't help. 

We made a memory dump with jmap. The
IndexSchema is using 62% of the memory but we don't know if that's a
problem:
https://www.dropbox.com/s/eyom5c48vhl0q9i/Screenshot%202014-03-06%2023.32.41.png
[1] 

Tomorrow we will downgrade each server to Solr 4.6.1, we need to
reindex every core to do that unless we have a solution. 

Does
anyone have a clue what the problem can be? 

Thanks! 

Martin 




Links:
--
[1]
https://www.dropbox.com/s/eyom5c48vhl0q9i/Screenshot%202014-03-06%2023.32.41.png


Re: SolrCloud unstable

2013-11-22 Thread Martin de Vries
 

We did some more monitoring and have some new information: 

Before
the issue happens the garbage collector's "collection count" increases a
lot. The increase seems to start about an hour before the real problem
occurs: 

http://www.analyticsforapplications.com/GC.png [1] 

We tried
both the g1 garbage collector and the regular one, the problem happens
with both of them. 

We use Java 1.6 on some servers. Will Java 1.7 be
better? 

Martin 

Martin de Vries schreef op 12.11.2013 10:45: 

>
Hi,
> 
> We have:
> 
> Solr 4.5.1 - 5 servers
> 36 cores, 2 shards each,
2 servers per shard (every core is on 4 
> servers)
> about 4.5 GB total
data on disk per server
> 4GB JVM-Memory per server, 3GB average in
use
> Zookeeper 3.3.5 - 3 servers (one shared with Solr)
> haproxy load
balancing
> 
> Our Solrcloud is very unstable. About one time a week
some cores go in 
> recovery state or down state. Many timeouts occur
and we have to restart 
> servers to get them back to work. The failover
doesn't work in many 
> cases, because one server has the core in down
state, the other in 
> recovering state. Other cores work fine. When the
cloud is stable I 
> sometimes see log messages like:
> - shard update
error StdNode: 
>
http://033.downnotifier.com:8983/solr/dntest_shard2_replica1/:org.apache.solr.client.solrj.SolrServerException:

> IOException occured when talking to server at: 
>
http://033.downnotifier.com:8983/solr/dntest_shard2_replica1
> -
forwarding update to 
>
http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed - 
>
retrying ...
> - null:ClientAbortException: java.io.IOException: Broken
pipe
> 
> Before the the cloud problems start there are many large
Qtime's in the 
> log (sometimes over 50 seconds), but there are no
other errors until the 
> recovery problems start.
> 
> Any clue about
what can be wrong?
> 
> Kinds regards,
> 
> Martin

 

Links:
--
[1]
http://www.analyticsforapplications.com/GC.png


SolrCloud unstable

2013-11-12 Thread Martin de Vries

Hi,

We have:

Solr 4.5.1 - 5 servers
36 cores, 2 shards each, 2 servers per shard (every core is on 4 
servers)

about 4.5 GB total data on disk per server
4GB JVM-Memory per server, 3GB average in use
Zookeeper 3.3.5 - 3 servers (one shared with Solr)
haproxy load balancing

Our Solrcloud is very unstable. About one time a week some cores go in 
recovery state or down state. Many timeouts occur and we have to restart 
servers to get them back to work. The failover doesn't work in many 
cases, because one server has the core in down state, the other in 
recovering state. Other cores work fine. When the cloud is stable I 
sometimes see log messages like:
- shard update error StdNode: 
http://033.downnotifier.com:8983/solr/dntest_shard2_replica1/:org.apache.solr.client.solrj.SolrServerException: 
IOException occured when talking to server at: 
http://033.downnotifier.com:8983/solr/dntest_shard2_replica1
- forwarding update to 
http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed - 
retrying ...

- null:ClientAbortException: java.io.IOException: Broken pipe

Before the the cloud problems start there are many large Qtime's in the 
log (sometimes over 50 seconds), but there are no other errors until the 
recovery problems start.



Any clue about what can be wrong?


Kinds regards,

Martin