Re: Bug in ORC file code? (OrcSerde)?

Michael Segel Wed, 19 Oct 2016 11:30:21 -0700

Just to follow up… 

This appears to be a bug in the hive version of the code… fixed in the orc 
library…  NOTE: There are two different libraries.


Documentation is a bit lax… but in terms of design… 

Its better to do the build completely in the reducer making the mapper code 
cleaner. 


> On Oct 19, 2016, at 11:00 AM, Michael Segel <msegel_had...@hotmail.com> wrote:
> 
> Hi, 
> Since I am not on the ORC mailing list… and since the ORC java code is in the 
> hive APIs… this seems like a good place to start. ;-)
> 
> 
> So… 
> 
> Ran in to a little problem… 
> 
> One of my developers was writing a map/reduce job to read records from a 
> source and after some filter, write the result set to an ORC file. 
> There’s an example of how to do this at:
> http://hadoopcraft.blogspot.com/2014/07/generating-orc-files-using-mapreduce.html
> 
> So far, so good. 
> But now here’s the problem….  Large source data, means many mappers and with 
> the filter, the number of output rows is a fraction in terms of size. 
> So we want to write to a single reducer. (An identity reducer) so that we get 
> only a single file. 
> 
> Here’s the snag. 
> 
> We were using the OrcSerde class to serialize the data and generate an Orc 
> row which we then wrote to the file. 
> 
> Looking at the source code for OrcSerde, OrcSerde.serialize() returns a 
> OrcSerdeRow.
> see: 
> http://grepcode.com/file/repo1.maven.org/maven2/co.cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java
> 
> OrcSerdeRow implements Writable and as we can see in the example code… for a 
> map only example… context.write(Text, Writable) works. 
> 
> However… if we attempt to make this in to a Map/Reduce job, we run in to a 
> problem during run time. the context.write() throws the following exception:
> "Error: java.io.IOException: Type mismatch in value from map: expected 
> org.apache.hadoop.io.Writable, received 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow”
> 
> 
> The goal was to reduce the orc rows and then write out in the reducer. 
> 
> I’m curious as to why the context.write() fails? 
> The error is a bit cryptic since the OrcSerdeRow implements Writable… so the 
> error message doesn’t make sense. 
> 
> 
> Now the quick fix is to borrow the ArrayListWritable from giraph and create 
> the list of fields in to an ArrayListWritable and pass that to the reducer 
> which will then use that to generate the ORC file. 
> 
> Trying to figure out why the context.write() fails… when sending to reducer 
> while it works if its a mapside write.
> 
> The documentation on the ORC site is … well… to be polite… lacking. ;-) 
> 
> I have some ideas why it doesn’t work, however I would like to confirm my 
> suspicions. 
> 
> Thx
> 
> -Mike
> 
> 
> B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB��[��X��ܚX�KK[XZ[�\�\�][��X��ܚX�PY���\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[�\�\�Z[Y���\X�K�ܙ�B


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org

Re: Bug in ORC file code? (OrcSerde)?

Reply via email to