HDFS in Hadoop 1.0 only times out a bad DataNodes after 20 minutes by default. 
Until that time the NameNode will happily direct requests to a bad DataNode and 
each request then has to time out individually.

So after a while (or a clean restart of everything) this should have fixed 
itself. Did it?


In Hadoop 2.0 (or 2.2?) our own Nicolas Liochon has another state to DataNodes 
where after 30s (or so) of unreachability the NameNode no longer directs 
request to such DataNodes unless no other DataNode is available for the block 
in question.

-- Lars



________________________________
 From: Amit Sela <[email protected]>
To: [email protected]; lars hofhansl <[email protected]> 
Sent: Tuesday, July 8, 2014 9:38 AM
Subject: Re: FileNotFoundException in bulk load
 

I think Lars is right. We ended up with errors in the RAID on that
regionserver the next day.

Still, shouldn't HDFS supply one of the replicas ? Why did the audit log
show a successful open, successful rename and then retried to open where it
finally throw the exception ?





On Sun, Jul 6, 2014 at 8:17 PM, lars hofhansl <[email protected]> wrote:

> If we do further discussion there we should reopen the jira.
> Fine if the exception is identical, or open a new one if this is a
> different one.
>
> At first blush this looks a bit like a temporary unavailability of HDFS.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Ted Yu <[email protected]>
> To: "[email protected]" <[email protected]>
> Sent: Sunday, July 6, 2014 8:01 AM
> Subject: Re: FileNotFoundException in bulk load
>
>
> The IOExceptions likely came from store.assertBulkLoadHFileOk() call.
>
> HBASE-4030 seems to be better place for future discussion since you can
> attach regionserver log(s) there.
>
> Cheers
>
>
>
> On Sun, Jul 6, 2014 at 5:23 AM, Amit Sela <[email protected]> wrote:
>
> > Audit log shows that the same regionserver is opening one of the regions,
> > renaming (moving from MR output dir into hbase region directory) and
> trying
> > to open again from MR output dir (repeating 10 times).
> > Open-Rename-10xOpen  appears in that order in the audit log, with a msec
> > difference all in the same region server.
> >
> >
> > On Sun, Jul 6, 2014 at 2:38 PM, Ted Yu <[email protected]> wrote:
> >
> > > Have you checked audit log from NameNode to see which client deleted
> the
> > > files ?
> > >
> > > Thanks
> > >
>
> > > On Jul 6, 2014, at 4:19 AM, Amit Sela <[email protected]> wrote:
> > >
> > > > I have a bulk load job running daily for months, when suddenly I got
> > > > a FileNotFoundException.
> > > >
> > > > Googling it I found HBASE-4030 and I noticed someone reporting it
> > started
> > > > to re-appear at 0.94.8.
>
> > > >
> > > > I'm running with Hadoop 1.0.4 and 0.94.12.
> > > >
> > > > Anyone else encountered this problem lately  ?
> > > >
> > > > Re-open the Jira ?
> > > >
> > > > Thanks,
> > > >
> > > > Amit.
> > > >
> > > > *On the client side this is the Excpetion:*
> > > >
> > > > java.net.SocketTimeoutException: Call to
> > > node.xxx.com/xxx.xxx.xxx.xxx:PORT
> > > > failed on socket timeout exception: java.net.SocketTimeoutException:
> > > 60000
> > > > millis timeout while waiting for channel to be ready for read. ch :
> > > > java.nio.channels.SocketChannel[connected
> > > > local=/xxx.xxx.xxx.xxx:PORT remote=node.xxx.com/xxx.xxx.xxx.xxx:PORT
> ]
> > > > org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles$3@29f2a6e3,
> > > > org.apache.hadoop.ipc.RemoteException:
> > > > org.apache.hadoop.io.MultipleIOException: 6 exceptions
> > > > [java.io.FileNotFoundException: File does not exist:
> > > >
> > >
> >
> /data/output_jobs/output_websites/HFiles_20140705/metadata/88fd743853cf4f8a862fb19646027a48,
> > > > java.io.FileNotFoundException: File does not exist:
> > > >
> > >
> >
> /data/output_jobs/output_websites/HFiles_20140705/gen/31c4c5cea9b348dbb6bb94115a483877,
> > > > java.io.FileNotFoundException: File does not exist:
> > > >
> > >
> >
> /data/output_jobs/output_websites/HFiles_20140705/gen/5762c45aaf4f408ba748a989f7be9647,
> > > > java.io.FileNotFoundException: File does not exist:
> > > >
> > >
> >
> /data/output_jobs/output_websites/HFiles_20140705/gen1/2ee02a005b654704a092d16c5c713373,
> > > > java.io.FileNotFoundException: File does not exist:
> > > >
> > >
> >
> /data/output_jobs/output_websites/HFiles_20140705/gen1/618251330a1842a797de4b304d341a02,
> > > > java.io.FileNotFoundException: File does not exist:
> > > >
> > >
> >
> /data/output_jobs/output_websites/HFiles_20140705/metadata/3955039392ce4f49aee5f58218a61be1]
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.io.MultipleIOException.createIOException(MultipleIOException.java:47)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3673)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3622)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.bulkLoadHFiles(HRegionServer.java:2930)
> > > > at sun.reflect.GeneratedMethodAccessor70.invoke(Unknown Source)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:601)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> > > > at
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> > > >
> > > > *On the regionserver:*
> > > >
> > > > ERROR org.apache.hadoop.hbase.regionserver.HRegion: There were one or
> > > more
> > > > IO errors when checking if the bulk load is ok.
> > > > org.apache.hadoop.io.MultipleIOException: 6 exceptions
> > > > [java.io.FileNotFoundException: File does not exist:
> > > >
> > >
> >
> /data/output_jobs/output_websites/HFiles_20140705/metadata/88fd743853cf4f8a862fb19646027a48,
> > > > java.io.FileNotFoundException: File does not exist:
> > > >
> > >
> >
> /data/output_jobs/output_websites/HFiles_20140705/gen/31c4c5cea9b348dbb6bb94115a483877,
> > > > java.io.FileNotFoundException: File does not exist:
> > > >
> > >
> >
> /data/output_jobs/output_websites/HFiles_20140705/gen/5762c45aaf4f408ba748a989f7be9647,
> > > > java.io.FileNotFoundException: File does not exist:
> > > >
> > >
> >
> /data/output_jobs/output_websites/HFiles_20140705/gen1/2ee02a005b654704a092d16c5c713373,
> > > > java.io.FileNotFoundException: File does not exist:
> > > >
> > >
> >
> /data/output_jobs/output_websites/HFiles_20140705/gen1/618251330a1842a797de4b304d341a02,
> > > > java.io.FileNotFoundException: File does not exist:
> > > >
> > >
> >
> /data/output_jobs/output_websites/HFiles_20140705/metadata/3955039392ce4f49aee5f58218a61be1]
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.io.MultipleIOException.createIOException(MultipleIOException.java:47)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3673)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.bulkLoadHFiles(HRegion.java:3622)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.bulkLoadHFiles(HRegionServer.java:2930)
> > > >        at sun.reflect.GeneratedMethodAccessor70.invoke(Unknown
> Source)
> > > >        at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > >        at java.lang.reflect.Method.invoke(Method.java:601)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> > > >
> > > > followed by:
> > > >
> > > > ERROR org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > > org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting
> call
> > > > next(4522610431482097770, 250), rpc version=1, client version=29,
> > > > methodsFingerPrint=-1368823753 from x <http://82.80.29.145:51311
> > > >xx.xxx.xxx.xxx
> > > > after 12507 ms, since caller disconnected
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:436)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3980)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3890)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3880)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2648)
> > > >        at sun.reflect.GeneratedMethodAccessor60.invoke(Unknown
> Source)
> > > >        at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > >        at java.lang.reflect.Method.invoke(Method.java:601)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> > > > 2014-07-06 03:52:14,278 [IPC Server handler 28 on 8041] ERROR
> > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > > org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting
> call
> > > > next(7354511084312054096, 250), rpc version=1, client version=29,
> > > > methodsFingerPrint=-1368823753 from x
> > > > <http://82.80.29.145:51311/>xx.xxx.xxx.xxx after
> > > > 9476 ms, since caller disconnected
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Call.throwExceptionIfCallerDisconnected(HBaseServer.java:436)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3980)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3890)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3880)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2648)
> > > >        at sun.reflect.GeneratedMethodAccessor60.invoke(Unknown
> Source)
> > > >        at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > >        at java.lang.reflect.Method.invoke(Method.java:601)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426)
> > >
> >
>

Reply via email to