[jira] [Comment Edited] (SOLR-3685) Solr Cloud sometimes skipped peersync attempt and replicated instead due to tlog flags not being cleared when no updates were buffered during a previous replication.

Markus Jelsma (JIRA) Thu, 16 Aug 2012 11:49:41 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436214#comment-13436214
 ]


Markus Jelsma edited comment on SOLR-3685 at 8/17/12 5:48 AM:
--------------------------------------------------------------

We didn't think mmap could be the cause but nevertheless we tried that once on 
a smaller cluster and got a lot of memory consumption again, after which it got 
killed.
I can see if i can run one or two of the nodes with NIOFS but let the other run 
with mmap. We don't automatically restart cores so it should run fine if we 
temporarily change the config in zookeeper and restart two nodes.

edit: each core has a ~2.5GB index.
                
      was (Author: markus17):
    We didn't think mmap could be the cause but nevertheless we tried that once 
on a smaller cluster and got a lot of memory consumption again, after which it 
got killed.
I can see if i can run one or two of the nodes with NIOFS but let the other run 
with mmap. We don't automatically restart cores so it should run fine if we 
temporarily change the config in zookeeper and restart two nodes.
                  
> Solr Cloud sometimes skipped peersync attempt and replicated instead due to 
> tlog flags not being cleared when no updates were buffered during a previous 
> replication.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3685
>                 URL: https://issues.apache.org/jira/browse/SOLR-3685
>             Project: Solr
>          Issue Type: Bug
>          Components: replication (java), SolrCloud
>    Affects Versions: 4.0-ALPHA
>         Environment: Debian GNU/Linux Squeeze 64bit
> Solr 5.0-SNAPSHOT 1365667M - markus - 2012-07-25 19:09:43
>            Reporter: Markus Jelsma
>            Assignee: Yonik Seeley
>            Priority: Critical
>             Fix For: 4.0, 5.0
>
>         Attachments: info.log, oom-killer.log
>
>
> There's a serious problem with restarting nodes, not cleaning old or unused 
> index directories and sudden replication and Java being killed by the OS due 
> to excessive memory allocation. Since SOLR-1781 was fixed index directories 
> get cleaned up when a node is being restarted cleanly, however, old or unused 
> index directories still pile up if Solr crashes or is being killed by the OS, 
> happening here.
> We have a six-node 64-bit Linux test cluster with each node having two 
> shards. There's 512MB RAM available and no swap. Each index is roughly 27MB 
> so about 50MB per node, this fits easily and works fine. However, if a node 
> is being restarted, Solr will consistently crash because it immediately eats 
> up all RAM. If swap is enabled Solr will eat an additional few 100MB's right 
> after start up.
> This cannot be solved by restarting Solr, it will just crash again and leave 
> index directories in place until the disk is full. The only way i can restart 
> a node safely is to delete the index directories and have it replicate from 
> another node. If i then restart the node it will crash almost consistently.
> I'll attach a log of one of the nodes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-3685) Solr Cloud sometimes skipped peersync attempt and replicated instead due to tlog flags not being cleared when no updates were buffered during a previous replication.

Reply via email to