I would love to work on this (and other) stuff if I can bother someone with questions offline or on a dev mailing list.
Ognen

On 3/23/14, 10:04 PM, Aaron Davidson wrote:
Thanks for bringing this up, 100% inode utilization is an issue I haven't seen raised before and this raises another issue which is not on our current roadmap for state cleanup (cleaning up data which was not fully cleaned up from a crashed process).


On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski <og...@plainvanillagames.com <mailto:og...@plainvanillagames.com>> wrote:

    Bleh, strike that, one of my slaves was at 100% inode utilization
    on the file system. It was /tmp/spark* leftovers that apparently
    did not get cleaned up properly after failed or interrupted jobs.
    Mental note - run a cron job on all slaves and master to clean up
    /tmp/spark* regularly.

    Thanks (and sorry for the noise)!
    Ognen


    On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
    Aaron, thanks for replying. I am very much puzzled as to what is
    going on. A job that used to run on the same cluster is failing
    with this mysterious message about not having enough disk space
    when in fact I can see through "watch df -h" that the free space
    is always hovering around 3+GB on the disk and the free inodes
    are at 50% (this is on master). I went through each slave and the
    spark/work/app*/stderr and stdout and spark/logs/*out files and
    no mention of too many open files failures on any of the slaves
    nor on the master :(

    Thanks
    Ognen

    On 3/23/14, 8:38 PM, Aaron Davidson wrote:
    By default, with P partitions (for both the pre-shuffle stage
    and post-shuffle), there are P^2 files created.
    With spark.shuffle.consolidateFiles turned on, we would instead
    create only P files. Disk space consumption is largely
    unaffected, however. by the number of partitions unless each
    partition is particularly small.

    You might look at the actual executors' logs, as it's possible
    that this error was caused by an earlier exception, such as "too
    many open files".


    On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
    <og...@plainvanillagames.com
    <mailto:og...@plainvanillagames.com>> wrote:

        On 3/23/14, 5:49 PM, Matei Zaharia wrote:
        You can set spark.local.dir to put this data somewhere
        other than /tmp if /tmp is full. Actually it's recommended
        to have multiple local disks and set to to a
        comma-separated list of directories, one per disk.
        Matei, does the number of tasks/partitions in a
        transformation influence something in terms of disk space
        consumption? Or inode consumption?

        Thanks,
        Ognen



-- "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable"
    -- Leslie Lamport



--
"No matter what they ever do to us, we must always act for the love of our people 
and the earth. We must not react out of hatred against those who have no sense."
-- John Trudell

Reply via email to