Well, such checks could be put in, but they don't get past the basic problem.

bq: If the segments are out of date and we are pulling from another
node before coming "online" why aren't the old segments deleted?

because you run the risk of losing _all_ your data and having nothing
at all. The process is
1> pull all the segments down
2> rewrite the segments file

Until <2>, you can still use your old index. Also consider a full
synch in master/slave mode. I optimize on the master and Solr then
detects that it'll be a full sync and....deletes the entire active
index.

bq: Is this something that can be enabled in the master solrconfig.xml file?
no

bq: ...is there a reason a basic disk space check isn't done ....
That would not be very robust. Consider the index is 1G and I have
1.5G of free space. Now replication makes the check and starts.
However, during that time segments are merged consuming .75G. Boom,
disk full again.

Additionally, any checks would be per core. What if 10 cores start
replication as above at once? Which would absolutely happen if you
have 10 replicas for the same shard in one JVM...

And all this masks your real problem; you didn't have enough disk
space to optimize in the first place. Even during regular indexing w/o
optimizing, Lucene segment merging can always theoretically merge all
your segments at once. Therefore you always need at _least_ as much
free space on your disks as all your indexes occupy to be sure you
won't hit a disk-full problem. The rest would be band-aids. Although I
suppose refusing to even start if there wasn't enough free disk space
isn't a bad idea, it's not foolproof though

Best,
Erick


On Mon, Nov 28, 2016 at 8:39 AM, Michael Joyner <mich...@newsrx.com> wrote:
> Hello all,
>
> I'm running out of spacing when trying to restart nodes to get a cluster
> back up fully operational where a node ran out of space during an optimize.
>
> It appears to be trying to do a full sync from another node, but doesn't
> take care to check available space before starting downloads and doesn't
> delete the out of date segment files before attempting to do the full sync.
>
> If the segments are out of date and we are pulling from another node before
> coming "online" why aren't the old segments deleted? Is this something that
> can be enabled in the master solrconfig.xml file?
>
> It seems to know what size the segments are before they are transferred, is
> there a reason a basic disk space check isn't done for the target partition
> with an immediate abort done if the destination's space looks like it would
> go negative before attempting sync? Is this something that can be enabled in
> the master solrconfig.xml file? This would be a lot more useful (IMHO) than
> waiting for a full sync to complete only to run out of space after several
> hundred gigs of data is transferred with automatic cluster recovery failing
> as a result.
>
> This happens when doing a 'sudo service solr restart'
>
> (Workaround, shutdown offending node, manually delete segment index folders
> and tlog files, start node)
>
> Exception:
>
> WARN  - 2016-11-28 16:15:16.291;
> org.apache.solr.handler.IndexFetcher$FileFetcher; Error in fetching file:
> _2f6i.cfs (downloaded 2317352960 of 5257809205 bytes)
> java.io.IOException: No space left on device
>     at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>     at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
>     at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>     at sun.nio.ch.IOUtil.write(IOUtil.java:65)
>     at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:211)
>     at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
>     at java.nio.channels.Channels.writeFully(Channels.java:101)
>     at java.nio.channels.Channels.access$000(Channels.java:61)
>     at java.nio.channels.Channels$1.write(Channels.java:174)
>     at
> org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:419)
>     at java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:73)
>     at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>     at
> org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53)
>     at
> org.apache.solr.handler.IndexFetcher$DirectoryFile.write(IndexFetcher.java:1634)
>     at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1491)
>     at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1429)
>     at
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:855)
>     at
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:434)
>     at
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:251)
>     at
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:397)
>     at
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:156)
>     at
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:408)
>     at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
>     at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
>
> -Mike
>

Reply via email to