Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

Fengdong Yu Wed, 09 Dec 2015 01:21:08 -0800

I don’t think there is performance difference between 1.x API and 2.x API.


but it’s not a big issue for your change, only 
com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java 
<https://github.com/databricks/spark-xml/blob/master/src/main/java/com/databricks/hadoop/mapreduce/lib/input/XmlInputFormat.java>
 need to change, right?

It’s not a big change to 2.x API. if you agree, I can do, but I cannot promise 
the time within one or two weeks because of my daily job.





> On Dec 9, 2015, at 5:01 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> 
> Hi all, 
> 
> I am writing this email to both user-group and dev-group since this is 
> applicable to both.
> 
> I am now working on Spark XML datasource 
> (https://github.com/databricks/spark-xml 
> <https://github.com/databricks/spark-xml>).
> This uses a InputFormat implementation which I downgraded to Hadoop 1.x for 
> version compatibility.
> 
> However, I found all the internal JSON datasource and others in Databricks 
> use Hadoop 2.x API dealing with TaskAttemptContextImpl by reflecting the 
> method for this because TaskAttemptContext is a class in Hadoop 1.x and an 
> interface in Hadoop 2.x.
> 
> So, I looked through the codes for some advantages for Hadoop 2.x API but I 
> couldn't.
> I wonder if there are some advantages for using Hadoop 2.x API.
> 
> I understand that it is still preferable to use Hadoop 2.x APIs at least for 
> future differences but somehow I feel like it might not have to use Hadoop 
> 2.x by reflecting a method.
> 
> I would appreciate that if you leave a comment here 
> https://github.com/databricks/spark-xml/pull/14 
> <https://github.com/databricks/spark-xml/pull/14> as well as sending back a 
> reply if there is a good explanation
> 
> Thanks!

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

Reply via email to