Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

Hyukjin Kwon Wed, 09 Dec 2015 01:02:18 -0800

Hi all,

I am writing this email to both user-group and dev-group since this is
applicable to both.


I am now working on Spark XML datasource (
https://github.com/databricks/spark-xml).
This uses a InputFormat implementation which I downgraded to Hadoop 1.x for
version compatibility.

However, I found all the internal JSON datasource and others in Databricks
use Hadoop 2.x API dealing with TaskAttemptContextImpl by reflecting the
method for this because TaskAttemptContext is a class in Hadoop 1.x and an
interface in Hadoop 2.x.

So, I looked through the codes for some advantages for Hadoop 2.x API but I
couldn't.
I wonder if there are some advantages for using Hadoop 2.x API.

I understand that it is still preferable to use Hadoop 2.x APIs at least
for future differences but somehow I feel like it might not have to use
Hadoop 2.x by reflecting a method.

I would appreciate that if you leave a comment here
https://github.com/databricks/spark-xml/pull/14 as well as sending back a
reply if there is a good explanation

Thanks!

Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

Reply via email to