Hi, Can you paste the output of "df -h" command here.
Regards Sidharth On Wednesday, April 12, 2017, Albert Chu <ch...@llnl.gov> wrote: > Hi, > > I have a cluster where we have a parallel networked file system for our > major data storage and our nodes have ~750G of local SSD space. To > speed up things, we configure yarn.nodemanager.local-dirs to use the > local SSD for local caching. > > Recently, I've been trying to do a terasort of 2 terabytes of data over > 8 nodes w/ Hadoop 2.7.3. So that's about 6000 gigs of local SSD space > for caching, or 5400 gigs when hadoop uses its 90% disk full checking > limit. > > I always get diskfull errors such as the below when running: > > 2017-04-11 12:31:44,062 WARN > org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: > Directory /l/ssd/achutest/localstore/yarn-nm error, used space above > threshold of 90.0%, removing from list of valid directories > 2017-04-11 12:31:44,063 INFO > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: > Disk(s) failed: 1/1 local-dirs are bad: /l/ssd/achutest/localstore/ > yarn-nm; > 2017-04-11 12:31:44,063 ERROR > org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: > Most of the disks failed. 1/1 local-dirs are bad: > /l/ssd/achutest/localstore/yarn-nm; > > What I don't understand is how I am getting diskfull errors. Within > terasort, I should have at most 2000 gigs of mapped intermediate data > and at most 2000 gigs of merged data in reducers. Even assuming some > overhead from Hadoop, I should have more than enough space for this > benchmark to complete given maps and reducers are spread out evenly > across nodes. > > So my assumption is something else is being cached in local-dirs that > I'm not accounting for. Is there any other data I should consider when > coming up with my estimates? > > One guess I had. Is it possible spilled data from reducer merges are > not deleted until a reducer completes? Given my example above, the > total amount of merged data in reducers may exceed 2000 gigs at some > point? > > Al > > -- > Albert Chu > ch...@llnl.gov <javascript:;> > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org <javascript:;> > For additional commands, e-mail: user-h...@hadoop.apache.org > <javascript:;> > > -- Regards Sidharth Kumar | Mob: +91 8197 555 599 | LinkedIn <https://www.linkedin.com/in/sidharthkumar2792/>