Thanks for bringing this up, 100% inode utilization is an issue I haven't seen raised before and this raises another issue which is not on our current roadmap for state cleanup (cleaning up data which was not fully cleaned up from a crashed process).
On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski < og...@plainvanillagames.com> wrote: > Bleh, strike that, one of my slaves was at 100% inode utilization on the > file system. It was /tmp/spark* leftovers that apparently did not get > cleaned up properly after failed or interrupted jobs. > Mental note - run a cron job on all slaves and master to clean up > /tmp/spark* regularly. > > Thanks (and sorry for the noise)! > Ognen > > > On 3/23/14, 9:52 PM, Ognen Duzlevski wrote: > > Aaron, thanks for replying. I am very much puzzled as to what is going on. > A job that used to run on the same cluster is failing with this mysterious > message about not having enough disk space when in fact I can see through > "watch df -h" that the free space is always hovering around 3+GB on the > disk and the free inodes are at 50% (this is on master). I went through > each slave and the spark/work/app*/stderr and stdout and spark/logs/*out > files and no mention of too many open files failures on any of the slaves > nor on the master :( > > Thanks > Ognen > > On 3/23/14, 8:38 PM, Aaron Davidson wrote: > > By default, with P partitions (for both the pre-shuffle stage and > post-shuffle), there are P^2 files created. > With spark.shuffle.consolidateFiles turned on, we would instead create only > P files. Disk space consumption is largely unaffected, however. by the > number of partitions unless each partition is particularly small. > > You might look at the actual executors' logs, as it's possible that this > error was caused by an earlier exception, such as "too many open files". > > > On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski < > og...@plainvanillagames.com> wrote: > >> On 3/23/14, 5:49 PM, Matei Zaharia wrote: >> >> You can set spark.local.dir to put this data somewhere other than /tmp if >> /tmp is full. Actually it's recommended to have multiple local disks and >> set to to a comma-separated list of directories, one per disk. >> >> Matei, does the number of tasks/partitions in a transformation influence >> something in terms of disk space consumption? Or inode consumption? >> >> Thanks, >> Ognen >> > > > -- > "A distributed system is one in which the failure of a computer you didn't > even know existed can render your own computer unusable" > -- Leslie Lamport > >