Another thing I have noticed is that out of my master+15 slaves, two
slaves always carry a higher inode load. So for example right now I am
running an intensive job that takes about an hour to finish and two
slaves have been showing an increase in inode consumption (they are
about 10% above the rest of the slaves+master) and increasing.
Ognen
On 3/24/14, 7:00 AM, Ognen Duzlevski wrote:
Patrick, correct. I have a 16 node cluster. On 14 machines out of 16,
the inode usage was about 50%. On two of the slaves, one had inode
usage of 96% and on the other it was 100%. When i went into /tmp on
these two nodes - there were a bunch of /tmp/spark* subdirectories
which I deleted. This resulted in the inode consumption falling back
down to 50% and the job running successfully to completion. The slave
with the 100% inode usage had the spark/work/app/<number>/stdout with
the message that the filesystem is running out of disk space (which I
posted in the original email that started this thread).
What is interesting is that only two out of the 16 slaves had this
problem :)
Ognen
On 3/24/14, 12:57 AM, Patrick Wendell wrote:
Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a "No space left on device" error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.
On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com> wrote:
I would love to work on this (and other) stuff if I can bother
someone with
questions offline or on a dev mailing list.
Ognen
On 3/23/14, 10:04 PM, Aaron Davidson wrote:
Thanks for bringing this up, 100% inode utilization is an issue I
haven't
seen raised before and this raises another issue which is not on our
current
roadmap for state cleanup (cleaning up data which was not fully
cleaned up
from a crashed process).
On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
<og...@plainvanillagames.com> wrote:
Bleh, strike that, one of my slaves was at 100% inode utilization
on the
file system. It was /tmp/spark* leftovers that apparently did not get
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.
Thanks (and sorry for the noise)!
Ognen
On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
Aaron, thanks for replying. I am very much puzzled as to what is
going on.
A job that used to run on the same cluster is failing with this
mysterious
message about not having enough disk space when in fact I can see
through
"watch df -h" that the free space is always hovering around 3+GB on
the disk
and the free inodes are at 50% (this is on master). I went through
each
slave and the spark/work/app*/stderr and stdout and spark/logs/*out
files
and no mention of too many open files failures on any of the slaves
nor on
the master :(
Thanks
Ognen
On 3/23/14, 8:38 PM, Aaron Davidson wrote:
By default, with P partitions (for both the pre-shuffle stage and
post-shuffle), there are P^2 files created. With
spark.shuffle.consolidateFiles turned on, we would instead create
only P
files. Disk space consumption is largely unaffected, however. by
the number
of partitions unless each partition is particularly small.
You might look at the actual executors' logs, as it's possible that
this
error was caused by an earlier exception, such as "too many open
files".
On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
<og...@plainvanillagames.com> wrote:
On 3/23/14, 5:49 PM, Matei Zaharia wrote:
You can set spark.local.dir to put this data somewhere other than
/tmp if
/tmp is full. Actually it's recommended to have multiple local
disks and set
to to a comma-separated list of directories, one per disk.
Matei, does the number of tasks/partitions in a transformation
influence
something in terms of disk space consumption? Or inode consumption?
Thanks,
Ognen
--
"A distributed system is one in which the failure of a computer you
didn't
even know existed can render your own computer unusable"
-- Leslie Lamport
--
"No matter what they ever do to us, we must always act for the love
of our
people and the earth. We must not react out of hatred against those
who have
no sense."
-- John Trudell
--
“A distributed system is one in which the failure of a computer you didn’t even
know existed can render your own computer unusable”
-- Leslie Lamport