Re: Skippin those gost darn 0 byte diles

Edward Capriolo Tue, 22 Jul 2014 15:17:59 -0700

Here is the stack trace...

 Caused by: java.io.EOFException
  at java.io.DataInputStream.readByte(DataInputStream.java:267)
  at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
  at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
  at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072)
  at 
org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139)
  at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214)
  at 
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)
  at 
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)
  at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
  ... 15 more





On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <[email protected]>
wrote:

> Currently using:
>
>     <dependency>
>             <groupId>org.apache.hadoop</groupId>
>             <artifactId>hadoop-hdfs</artifactId>
>             <version>2.3.0</version>
>         </dependency>
>
>
> I have this piece of code that does.
>
> writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class,
> CompressionType.BLOCK, codec);
>
> Then I have a piece of code like this...
>
>   public static final long SYNC_EVERY_LINES = 1000;
>  if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){
>         meta.getWriter().sync();
>       }
>
>
> And I commonly see:
>
> [ecapriolo@staging-hadoop-cdh-67-14 ~]$ hadoop dfs -ls  /user/beacon/
> 2014072117
> DEPRECATED: Use of this script to execute hdfs command is deprecated.
> Instead use the hdfs command for it.
>
> Found 12 items
> -rw-r--r--   3 service-igor supergroup    1065682 2014-07-21 17:50
> /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1
> -rw-r--r--   3 service-igor supergroup    1029041 2014-07-21 17:40
> /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93
> -rw-r--r--   3 service-igor supergroup    1002096 2014-07-21 17:10
> /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d1b
> -rw-r--r--   3 service-igor supergroup    1028450 2014-07-21 17:30
> /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92
> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
> /user/beacon/2014072117/5450f246-7623-4bbd-8c97-8176a0c30351
> -rw-r--r--   3 service-igor supergroup    1084873 2014-07-21 17:30
> /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2
> -rw-r--r--   3 service-igor supergroup    1043108 2014-07-21 17:20
> /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e9b
> -rw-r--r--   3 service-igor supergroup     986866 2014-07-21 17:10
> /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f6fbd
> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
> /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9
> -rw-r--r--   3 service-igor supergroup    1040931 2014-07-21 17:50
> /user/beacon/2014072117/bba6a677-226c-4982-8fb2-4b136108baf1
> -rw-r--r--   3 service-igor supergroup    1012137 2014-07-21 17:40
> /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba47
> -rw-r--r--   3 service-igor supergroup    1028467 2014-07-21 17:20
> /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f25577b
>
> Sometimes even though they show as 0 bytes you can read data from them.
> Sometimes it blows up with a stack trace I have lost.
>
>
> On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <[email protected]>
> wrote:
>
>> I looked at the source by curiosity, for the latest version (2.4), the
>> header is flushed during the writer creation. Of course, key/value classes
>> are provided. By 0-bytes, you really mean even without the header? Or 0
>> bytes of payload?
>>
>>
>> On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux <[email protected]>
>> wrote:
>>
>>> The header is expected to have the full name of the key class and value
>>> class so if it is only detected with the first record (?) indeed the file
>>> can not respect its own format.
>>>
>>> I haven't tried it but LazyOutputFormat should solve your problem.
>>>
>>> https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/LazyOutputFormat.html
>>>
>>> Regards
>>>
>>> Bertrand Dechoux
>>>
>>>
>>> Bertrand Dechoux
>>>
>>>
>>> On Tue, Jul 22, 2014 at 10:39 PM, Edward Capriolo <[email protected]
>>> > wrote:
>>>
>>>> I have two processes. One that writes sequence files directly to hdfs,
>>>> the other that is a hive table that reads these files.
>>>>
>>>> All works well with the exception that I am only flushing the files
>>>> periodically. SequenceFile input format gets angry when it encounters
>>>> 0-bytes seq files.
>>>>
>>>> I was considering flush and sync on first record write. Also was
>>>> thinking should just be able to hack sequence file input format to skip 0
>>>> byte files and not throw exception on readFully() which it sometimes does.
>>>>
>>>> Anyone ever tackled this?
>>>>
>>>
>>>
>>
>

Re: Skippin those gost darn 0 byte diles

Reply via email to