I would love to work on this (and other) stuff if I can bother someone
with questions offline or on a dev mailing list.
Ognen
On 3/23/14, 10:04 PM, Aaron Davidson wrote:
Thanks for bringing this up, 100% inode utilization is an issue I
haven't seen raised before and this raises another issue which is not
on our current roadmap for state cleanup (cleaning up data which was
not fully cleaned up from a crashed process).
On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
<og...@plainvanillagames.com <mailto:og...@plainvanillagames.com>> wrote:
Bleh, strike that, one of my slaves was at 100% inode utilization
on the file system. It was /tmp/spark* leftovers that apparently
did not get cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.
Thanks (and sorry for the noise)!
Ognen
On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
Aaron, thanks for replying. I am very much puzzled as to what is
going on. A job that used to run on the same cluster is failing
with this mysterious message about not having enough disk space
when in fact I can see through "watch df -h" that the free space
is always hovering around 3+GB on the disk and the free inodes
are at 50% (this is on master). I went through each slave and the
spark/work/app*/stderr and stdout and spark/logs/*out files and
no mention of too many open files failures on any of the slaves
nor on the master :(
Thanks
Ognen
On 3/23/14, 8:38 PM, Aaron Davidson wrote:
By default, with P partitions (for both the pre-shuffle stage
and post-shuffle), there are P^2 files created.
With spark.shuffle.consolidateFiles turned on, we would instead
create only P files. Disk space consumption is largely
unaffected, however. by the number of partitions unless each
partition is particularly small.
You might look at the actual executors' logs, as it's possible
that this error was caused by an earlier exception, such as "too
many open files".
On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
<og...@plainvanillagames.com
<mailto:og...@plainvanillagames.com>> wrote:
On 3/23/14, 5:49 PM, Matei Zaharia wrote:
You can set spark.local.dir to put this data somewhere
other than /tmp if /tmp is full. Actually it's recommended
to have multiple local disks and set to to a
comma-separated list of directories, one per disk.
Matei, does the number of tasks/partitions in a
transformation influence something in terms of disk space
consumption? Or inode consumption?
Thanks,
Ognen
--
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable"
-- Leslie Lamport
--
"No matter what they ever do to us, we must always act for the love of our people
and the earth. We must not react out of hatred against those who have no sense."
-- John Trudell