[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13752699#comment-13752699 ] Mike Schrag commented on SOLR-5081: --- I think we tracked this down on our side. We noticed when testing another part of the system that we had SYN flood warnings in the system logs. I believe the kernel was blocking traffic to the Solr port once it believed that Hadoop was attacking it. By turning off net.ipv4.tcp_syncookies and increasing the net.ipv4.tcp_max_syn_backlog, the problem seems to have gone away. This also explains why I was able to connect to Solr and insert still from another machine even when accessed died from the Hadoop cluster. Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13752814#comment-13752814 ] Erick Erickson commented on SOLR-5081: -- Mike: Thanks for letting us know! This is a tricky one Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13748047#comment-13748047 ] Kevin Osborn commented on SOLR-5081: I may have this issue as well. I am posting batches of 1000 through SolrJ. I have autoCommit set to 15000 with openSearcher=false. autoSoftCommit is set to 3. During my initial testing, I was able to recreate it after just a couple updates. I then change the limit of the number of open files for the process from 4096 to 15000. This seemed to help, but only to a point. If all my updates are at once, it seems to succeed. But if I have pauses between updates, it seems to have problems. I have also only seen this error when I have more than 1 node in my SolrCloud cluster. I also took a look at netstat. There seemed to be a lot of connections between my two nodes. Could the the frequency of my updates be overwhelming the connection from the leader to the replica? Deletes also fail, but queries still seem to work. Restarting the nodes fixes the problem. Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730995#comment-13730995 ] Yago Riveiro commented on SOLR-5081: I have this problem too, but in my case, Solr hangs and I can done more insertions without restart the nodes. I do the insertion using a culr post in json format like: curl http://127.0.0.1:8983/solr/collection/update --data-binary @data -H 'Content-type:application/json' Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13730269#comment-13730269 ] Hoss Man commented on SOLR-5081: bq. I actually did this exact test when I was in this state originally, and the insert worked, which totally confused the situation for me. ok ... hold up ... basically what you're saying is the first time i saw this problem (solrcloud hangs and is deadlocked under heavy document insertion load) i tried to insert a single document and it worked. ...which makes no sense to me because if that's the case, then what exactly do you mean by hangs and deadlocked ? So let's back up: * what do you observe about your system that leads you to believe there is a problem? * what aspect of your observations doesn't match what you expect? * what do you expect to observe? * how are you making these observations? Wild shot in the dark: what does your indexng code look like? is it possible that your indexing code is encountering some deadlock of it's own, independent of anything happening in solr? If you are using solrj, can you get thread dumps from your indexing client apps when you observe this deadlock sitaution (again: this info is useless unless we have a better understanding of what exactly you are observing that you think indicates a problem) Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726746#comment-13726746 ] Noble Paul commented on SOLR-5081: -- [~mikeschrag] COuld you get any more thread dumps? Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726801#comment-13726801 ] Mike Schrag commented on SOLR-5081: --- I grabbed more and they all look basically the same as the attached, which is to say, it sort of looks like Solr isn't doing ANYTHING. I'm going to look into whether I'm crushing ZooKeeper, and maybe my requests aren't even getting to Solr. Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726831#comment-13726831 ] Erick Erickson commented on SOLR-5081: -- Yeah, that is odd. The stack traces you sent basically showed no deadlocks, nothing interesting at all. I suspect pursuing whether anything is getting to Solr or not is a good idea H, blunt-instrument test when the cluster is hung. What happens if you, say, submit a query directly to one of the nodes? Does it respond or do you see anything in the solr log on that node? Tip: adding distrib=false to the _query_ will not try to send sub-queries to other shards. And I wonder what happens if you, say, use post.jar (comes with the example) to try to send a doc to Solr when it's hung, anything? Clearly I'm grasping at straws here, but I'm kind of out of good ideas. Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13726848#comment-13726848 ] Mike Schrag commented on SOLR-5081: --- I actually did this exact test when I was in this state originally, and the insert _worked_, which totally confused the situation for me. However, in light of seeing nothing in the traces, it supports the theory that the cluster isn't hung, but rather I'm somehow not even getting that far in the Hadoop cluster. ZK was my best guess as something that maybe could be an earlier stage failure, but even that I would expect to have hang the test-insert. So I need to do a little more forensics here and see if I can get a better picture of wtf is going on. Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723853#comment-13723853 ] Mark Miller commented on SOLR-5081: --- This is likely the same issue that has come up before - and it has nothing to do with cloudsolrserver - its more likely how we limit the number of threads that are used to forward on updates- and the nodes can talk back and forth to each other, run out of threads, and deadlock. It's similar to the distrib deadlock issue. It's been a known issue for many months, just have not had a chance to look into it closely yet. Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723856#comment-13723856 ] Erick Erickson commented on SOLR-5081: -- Agreed, although we should be able to see the deadlock on the semaphore that we saw before in SolrCmdDistributor in here somewhere, and it's not in the stack trace we've seen so far. Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723878#comment-13723878 ] Mark Miller commented on SOLR-5081: --- bq. the stack trace we've seen so far. Those traces are suspect for the problem described I think. Regardless, for this type of thing, it would be great to get the traces from a couple machines rather than just one. Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723996#comment-13723996 ] Mike Schrag commented on SOLR-5081: --- I'll kill it again today and grab traces from a few of the nodes. Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723378#comment-13723378 ] Noble Paul commented on SOLR-5081: -- Can you please throw some more light into the system # numShards # Replication factor # maxShardsPerNode (I guess it is 1) # Average size per doc # VM startup params (-Xmx -Xms, GC params etc) # How are you indexing? Are you using SolrJ and the CloudSolrServer? How many clients are used to index the data? Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723402#comment-13723402 ] Mike Schrag commented on SOLR-5081: --- 1. numShards=20 2. RF=3 3. maxShardsPerNode=1000 (aka just a big number .. we overcommit shards in this environment) 4. not very big ... maybe 0.5-1k 5. -Xms10g -Xmx10g -XX:MaxPermSize=1G -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:CMSInitiatingOccupancy Fraction=60 -XX:-OmitStackTraceInFastThrow 6. SolrJ + CloudSolrServer + when you say clients, do you mean threads, or actual client JVM instances? Talking more generically in terms of threads, I know it works at around 15-20 threads, but 100 threads makes it go sadfaced Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721756#comment-13721756 ] Erick Erickson commented on SOLR-5081: -- Can you do a jstack on one or more of your Solr servers? There's some distributed deadlock possibility and it would be good to see if this is the same problem. Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721761#comment-13721761 ] Mike Schrag commented on SOLR-5081: --- I attached a jstack of one of them (threads.txt) during this event. Do you want me to reproduce and grab more? Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721764#comment-13721764 ] Mike Schrag commented on SOLR-5081: --- btw, I dropped the hadoop cluster to doing single-record batches in the run corresponding to that stack dump. I saw a note somewhere (I think from you?) that suggested increasing the semaphore permits, which I was about to test, too. It's not clear what a reasonable value is, but I jacked it up from *16 to *1024 and figured I'd go for broke :) Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721767#comment-13721767 ] Erick Erickson commented on SOLR-5081: -- You really want to go for broke? Try SOLR-4816 (note assuming you're indexing from SolrJ). The deadlock I've seen has to do with intra-shard routing, essentially forwarding the packets to other shards, if there are enough packets, can lead to this situation. That JIRA is about having SolrJ just send the documents to the right leader so it will not have to route the docs to other shards. We'd be really interested to see if that worked in the real world... NOTE: I'm not sure what the current state of that patch is, I think it was ready to rock-n-roll but just missed the cut for 4.4. Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721781#comment-13721781 ] Mike Schrag commented on SOLR-5081: --- (that's with the latest SOLR-4816 patch applied) Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5081) Highly parallel document insertion hangs SolrCloud
[ https://issues.apache.org/jira/browse/SOLR-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13721780#comment-13721780 ] Mike Schrag commented on SOLR-5081: --- No luck :( Whatever this hang is doesn't appear to be the same as that. Highly parallel document insertion hangs SolrCloud -- Key: SOLR-5081 URL: https://issues.apache.org/jira/browse/SOLR-5081 Project: Solr Issue Type: Bug Components: SolrCloud Affects Versions: 4.3.1 Reporter: Mike Schrag Attachments: threads.txt If I do a highly parallel document load using a Hadoop cluster into an 18 node solrcloud cluster, I can deadlock solr every time. The ulimits on the nodes are: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1031181 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 515590 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited The open file count is only around 4000 when this happens. If I bounce all the servers, things start working again, which makes me think this is Solr and not ZK. I'll attach the stack trace from one of the servers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org