> I wanted to know  why is it necessary to remove the Hive jars from the
>Spark build as mentioned on this

Because SparkSQL was originally based on Hive & still uses Hive AST to
parse SQL.

The org.apache.spark.sql.hive package contains the parser which has
hard-references to the hive's internal AST, which is unfortunately
auto-generated code (HiveParser.TOK_TABNAME etc).

Everytime Hive makes a release, those constants change in value and that
is private API because of the lack of backwards-compat, which is violated
by SparkSQL.

So Hive-on-Spark forces mismatched versions of Hive classes, because it's
a circular dependency of Hive(v1) -> Spark -> Hive(v2) due to the basic
laws of causality.

Spark cannot depend on a version of Hive that is unreleased and
Hive-on-Spark release cannot depend on a version of Spark that is
unreleased.

Cheers,
Gopal


Reply via email to