RE: Key is null in map when OrcNewInputFormat is used as Input Format Class

2014-08-22 Thread John Zeng
FYI:  I have created following jira task:

https://issues.apache.org/jira/browse/HIVE-7853

-Original Message-
From: John Zeng [mailto:john.z...@dataguise.com] 
Sent: Friday, August 8, 2014 10:33 AM
To: dev@hive.apache.org
Subject: RE: Key is null in map when OrcNewInputFormat is used as Input Format 
Class

Any update from anybody?  Should I file a bug?

Thanks

-Original Message-
From: John Zeng [mailto:john.z...@dataguise.com] 
Sent: Wednesday, August 6, 2014 10:17 AM
To: dev@hive.apache.org
Subject: Key is null in map when OrcNewInputFormat is used as Input Format Class

Dear OrcNewInputFormat owner,

When using OrcNewInputFormat as input format class for my map reduce job, I 
find its key is always null in my map method. This gives me no way to get row 
number in my map method.  If you compare RCFileInputFormat (for RC file), its 
key in map method returns the row number so I know which row I am processing. 

Is there any workaround for me to get the row number from my map method?  Of 
course, I can count the row number by myself.  But that has two problems: #1 I 
have to assume the row is coming in the order; #2 I will get duplicated (and 
wrong) row numbers if a big input file causes multiple file splits (which will 
trigger my map method multiple times in different data nodes).   At this point, 
I am really seeking a better way to get row number for each processed row in 
map method.

Here is what I have in my map logs:

[2014-08-06 09:39:25 DEBUG com..hadoop.orcfile.OrcFileMap]: Mapper 
Input Key: (null)
[2014-08-06 09:39:25 DEBUG com..hadoop.orcfile.OrcFileMap]: Mapper 
Input Value: {Q8151, T9976, 69976, 8156756, 966798161, 
97898989898, Laura, laura...@gmail.com}

My map method is:

protected void map(Object key, Writable value, Context context)
throws IOException, InterruptedException {
logger.debug(Mapper Input Key:  + key);
logger.debug(Mapper Input Value:  + value.toString());
.
}

Thanks

John


RE: Key is null in map when OrcNewInputFormat is used as Input Format Class

2014-08-08 Thread John Zeng
Any update from anybody?  Should I file a bug?

Thanks

-Original Message-
From: John Zeng [mailto:john.z...@dataguise.com] 
Sent: Wednesday, August 6, 2014 10:17 AM
To: dev@hive.apache.org
Subject: Key is null in map when OrcNewInputFormat is used as Input Format Class

Dear OrcNewInputFormat owner,

When using OrcNewInputFormat as input format class for my map reduce job, I 
find its key is always null in my map method. This gives me no way to get row 
number in my map method.  If you compare RCFileInputFormat (for RC file), its 
key in map method returns the row number so I know which row I am processing. 

Is there any workaround for me to get the row number from my map method?  Of 
course, I can count the row number by myself.  But that has two problems: #1 I 
have to assume the row is coming in the order; #2 I will get duplicated (and 
wrong) row numbers if a big input file causes multiple file splits (which will 
trigger my map method multiple times in different data nodes).   At this point, 
I am really seeking a better way to get row number for each processed row in 
map method.

Here is what I have in my map logs:

[2014-08-06 09:39:25 DEBUG com..hadoop.orcfile.OrcFileMap]: Mapper 
Input Key: (null)
[2014-08-06 09:39:25 DEBUG com..hadoop.orcfile.OrcFileMap]: Mapper 
Input Value: {Q8151, T9976, 69976, 8156756, 966798161, 
97898989898, Laura, laura...@gmail.com}

My map method is:

protected void map(Object key, Writable value, Context context)
throws IOException, InterruptedException {
logger.debug(Mapper Input Key:  + key);
logger.debug(Mapper Input Value:  + value.toString());
.
}

Thanks

John


Key is null in map when OrcNewInputFormat is used as Input Format Class

2014-08-06 Thread John Zeng
Dear OrcNewInputFormat owner,

When using OrcNewInputFormat as input format class for my map reduce job, I 
find its key is always null in my map method. This gives me no way to get row 
number in my map method.  If you compare RCFileInputFormat (for RC file), its 
key in map method returns the row number so I know which row I am processing. 

Is there any workaround for me to get the row number from my map method?  Of 
course, I can count the row number by myself.  But that has two problems: #1 I 
have to assume the row is coming in the order; #2 I will get duplicated (and 
wrong) row numbers if a big input file causes multiple file splits (which will 
trigger my map method multiple times in different data nodes).   At this point, 
I am really seeking a better way to get row number for each processed row in 
map method.

Here is what I have in my map logs:

[2014-08-06 09:39:25 DEBUG com..hadoop.orcfile.OrcFileMap]: Mapper 
Input Key: (null)
[2014-08-06 09:39:25 DEBUG com..hadoop.orcfile.OrcFileMap]: Mapper 
Input Value: {Q8151, T9976, 69976, 8156756, 966798161, 
97898989898, Laura, laura...@gmail.com}

My map method is:

protected void map(Object key, Writable value, Context context)
throws IOException, InterruptedException {
logger.debug(Mapper Input Key:  + key);
logger.debug(Mapper Input Value:  + value.toString());
.
}

Thanks

John


ORC file block size

2014-07-25 Thread John Zeng
Hi,  owner of org.apache.hadoop.hive.ql.io.orc.WriterImpl.java:

When writing a ORC file using following code piece:

   Writer writer = OrcFile.createWriter(new Path(/my_file_path),
   
OrcFile.writerOptions(conf).inspector(inspector).stripeSize(my_stripe_size).bufferSize(my_buffer_size)
   .version(OrcFile.Version.V_0_12));
   /** code to prepare tslist **/

   for (Timestamp ts : tslist) {
 writer.addRow(ts);
   }

   writer.close();

I got following error:

org.apache.hadoop.ipc.RemoteException(java.io.IOException): Specified block 
size is less than configured minimum value 
(dfs.namenode.fs-limits.min-block-size): 20  1048576
   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2215)
   at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2180)
   at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:505)
   at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:354)
   at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
   at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)

After debugging into the code, I found this line in WriterImpl.java(line 169):

this.blockSize = Math.min(MAX_BLOCK_SIZE, 2 * stripeSize);

Basically, the block size is set as stripeSize times 2.  When there is no 
guarantee stripeSize is at least half of the default minimal block size (i.e. 
1048576 as defined by dfs.namenode.fs-limits.min-block-size), such exception 
will be inevitable.

Do we need to change code to at least set blockSize to what is set by 
dfs.namenode.fs-limits.min-block-size?  It will be nice to have a separate 
option for blockSize (instead of always the twice of stripeSize) although.

Thanks

John


Replacement for 'nextColumnsBatch' method in RCFile.Reader

2014-05-21 Thread John Zeng
Hi, All,

I noticed ‘nextColumnsBatch’ is marked as ‘Deprecated’ in RCFile.Reader class.

What is the method that will replace ‘nextColumnsBatch’?  And why?

Thanks

John