Re: No space left on device exception

Ognen Duzlevski Mon, 24 Mar 2014 07:34:35 -0700

Another thing I have noticed is that out of my master+15 slaves, twoslaves always carry a higher inode load. So for example right now I amrunning an intensive job that takes about an hour to finish and twoslaves have been showing an increase in inode consumption (they areabout 10% above the rest of the slaves+master) and increasing.


Ognen


On 3/24/14, 7:00 AM, Ognen Duzlevski wrote:

Patrick, correct. I have a 16 node cluster. On 14 machines out of 16,the inode usage was about 50%. On two of the slaves, one had inodeusage of 96% and on the other it was 100%. When i went into /tmp onthese two nodes - there were a bunch of /tmp/spark* subdirectorieswhich I deleted. This resulted in the inode consumption falling backdown to 50% and the job running successfully to completion. The slavewith the 100% inode usage had the spark/work/app/<number>/stdout withthe message that the filesystem is running out of disk space (which Iposted in the original email that started this thread).
What is interesting is that only two out of the 16 slaves had thisproblem :)
Ognen

On 3/24/14, 12:57 AM, Patrick Wendell wrote:
Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a "No space left on device" error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.

On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com> wrote:
I would love to work on this (and other) stuff if I can bothersomeone with
questions offline or on a dev mailing list.
Ognen


On 3/23/14, 10:04 PM, Aaron Davidson wrote:
Thanks for bringing this up, 100% inode utilization is an issue Ihaven'tseen raised before and this raises another issue which is not on ourcurrentroadmap for state cleanup (cleaning up data which was not fullycleaned up
from a crashed process).


On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
<og...@plainvanillagames.com> wrote:
Bleh, strike that, one of my slaves was at 100% inode utilizationon the
file system. It was /tmp/spark* leftovers that apparently did not get
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.

Thanks (and sorry for the noise)!
Ognen


On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
Aaron, thanks for replying. I am very much puzzled as to what isgoing on.A job that used to run on the same cluster is failing with thismysteriousmessage about not having enough disk space when in fact I can seethrough"watch df -h" that the free space is always hovering around 3+GB onthe diskand the free inodes are at 50% (this is on master). I went througheachslave and the spark/work/app*/stderr and stdout and spark/logs/*outfilesand no mention of too many open files failures on any of the slavesnor on
the master :(

Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:

By default, with P partitions (for both the pre-shuffle stage and
post-shuffle), there are P^2 files created. With
spark.shuffle.consolidateFiles turned on, we would instead createonly Pfiles. Disk space consumption is largely unaffected, however. bythe number
of partitions unless each partition is particularly small.
You might look at the actual executors' logs, as it's possible thatthiserror was caused by an earlier exception, such as "too many openfiles".
On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
<og...@plainvanillagames.com> wrote:
On 3/23/14, 5:49 PM, Matei Zaharia wrote:
You can set spark.local.dir to put this data somewhere other than/tmp if/tmp is full. Actually it's recommended to have multiple localdisks and set
to to a comma-separated list of directories, one per disk.
Matei, does the number of tasks/partitions in a transformationinfluence
something in terms of disk space consumption? Or inode consumption?

Thanks,
Ognen
--
"A distributed system is one in which the failure of a computer youdidn't
even know existed can render your own computer unusable"
-- Leslie Lamport
--
"No matter what they ever do to us, we must always act for the loveof ourpeople and the earth. We must not react out of hatred against thosewho have
no sense."
-- John Trudell


--
“A distributed system is one in which the failure of a computer you didn’t even 
know existed can render your own computer unusable”
-- Leslie Lamport

Re: No space left on device exception

Reply via email to