Re: How spark and hive integrate in long term?

2014-11-22 Thread Cheng Lian

Hey Zhan,

This is a great question. We are also seeking for a stable API/protocol 
that works with multiple Hive versions (esp. 0.12+). SPARK-4114 
https://issues.apache.org/jira/browse/SPARK-4114 was opened for this. 
Did some research into HCatalog recently, but I must confess that I’m 
not an expert on HCatalog, actually spent only 1 day on exploring it. So 
please don’t hesitate to correct me if I was wrong about the conclusions 
I made below.


First, although HCatalog API is more pleasant to work with, it’s 
unfortunately feature incomplete. It only provides a subset of most 
commonly used operations. For example, |HCatCreateTableDesc| maps only a 
subset of |CreateTableDesc|, properties like |storeAsSubDirectories|, 
|skewedColNames| and |skewedColValues| are missing. It’s also impossible 
to alter table properties via HCatalog API (Spark SQL uses this to 
implement the |ANALYZE| command). The |hcat| CLI tool provides all those 
features missing in HCatalog API via raw Metastore API, and is 
structurally similar to the old Hive CLI.


Second, HCatalog API itself doesn’t ensure compatibility, it’s the 
Thrift protocol that matters. HCatalog is directly built upon raw 
Metastore API, and talks the same Metastore Thrift protocol. The problem 
we encountered in Spark SQL is that, usually we deploy Spark SQL Hive 
support with embedded mode (for testing) or local mode Metastore, and 
this makes us suffer from things like Metastore database schema changes. 
If Hive Metastore Thrift protocol is guaranteed to be downward 
compatible, then hopefully we can resort to remote mode Metastore and 
always depend on most recent Hive APIs. I had a glance of Thrift 
protocol version handling code in Hive, it seems that downward 
compatibility is not an issue. However I didn’t find any official 
documents about Thrift protocol compatibility.


That said, in the future, hopefully we can only depend on most recent 
Hive dependencies and remove the Hive shim layer introduced in branch 
1.2. For users who use exactly the same version of Hive as Spark SQL, 
they can use either remote or local/embedded Metastore; while for users 
who want to interact with existing legacy Hive clusters, they have to 
setup a remote Metastore and let the Thrift protocol to handle 
compatibility.


— Cheng

On 11/22/14 6:51 AM, Zhan Zhang wrote:


Now Spark and hive integration is a very nice feature. But I am wondering
what the long term roadmap is for spark integration with hive. Both of these
two projects are undergoing fast improvement and changes. Currently, my
understanding is that spark hive sql part relies on hive meta store and
basic parser to operate, and the thrift-server intercept hive query and
replace it with its own engine.

With every release of hive, there need a significant effort on spark part to
support it.

For the metastore part, we may possibly replace it with hcatalog. But given
the dependency of other parts on hive, e.g., metastore, thriftserver,
hcatlog may not be able to help much.

Does anyone have any insight or idea in mind?

Thanks.

Zhan Zhang



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

.


​


Re: How spark and hive integrate in long term?

2014-11-22 Thread Cheng Lian
Should emphasize that this is still a quick and rough conclusion, will 
investigate this in more detail after 1.2.0 release. Anyway we really 
like to provide Hive support in Spark SQL as smooth and clean as 
possible for both developers and end users.


On 11/22/14 11:05 PM, Cheng Lian wrote:


Hey Zhan,

This is a great question. We are also seeking for a stable 
API/protocol that works with multiple Hive versions (esp. 0.12+). 
SPARK-4114 https://issues.apache.org/jira/browse/SPARK-4114 was 
opened for this. Did some research into HCatalog recently, but I must 
confess that I’m not an expert on HCatalog, actually spent only 1 day 
on exploring it. So please don’t hesitate to correct me if I was wrong 
about the conclusions I made below.


First, although HCatalog API is more pleasant to work with, it’s 
unfortunately feature incomplete. It only provides a subset of most 
commonly used operations. For example, |HCatCreateTableDesc| maps only 
a subset of |CreateTableDesc|, properties like 
|storeAsSubDirectories|, |skewedColNames| and |skewedColValues| are 
missing. It’s also impossible to alter table properties via HCatalog 
API (Spark SQL uses this to implement the |ANALYZE| command). The 
|hcat| CLI tool provides all those features missing in HCatalog API 
via raw Metastore API, and is structurally similar to the old Hive CLI.


Second, HCatalog API itself doesn’t ensure compatibility, it’s the 
Thrift protocol that matters. HCatalog is directly built upon raw 
Metastore API, and talks the same Metastore Thrift protocol. The 
problem we encountered in Spark SQL is that, usually we deploy Spark 
SQL Hive support with embedded mode (for testing) or local mode 
Metastore, and this makes us suffer from things like Metastore 
database schema changes. If Hive Metastore Thrift protocol is 
guaranteed to be downward compatible, then hopefully we can resort to 
remote mode Metastore and always depend on most recent Hive APIs. I 
had a glance of Thrift protocol version handling code in Hive, it 
seems that downward compatibility is not an issue. However I didn’t 
find any official documents about Thrift protocol compatibility.


That said, in the future, hopefully we can only depend on most recent 
Hive dependencies and remove the Hive shim layer introduced in branch 
1.2. For users who use exactly the same version of Hive as Spark SQL, 
they can use either remote or local/embedded Metastore; while for 
users who want to interact with existing legacy Hive clusters, they 
have to setup a remote Metastore and let the Thrift protocol to handle 
compatibility.


— Cheng

On 11/22/14 6:51 AM, Zhan Zhang wrote:


Now Spark and hive integration is a very nice feature. But I am wondering
what the long term roadmap is for spark integration with hive. Both of these
two projects are undergoing fast improvement and changes. Currently, my
understanding is that spark hive sql part relies on hive meta store and
basic parser to operate, and the thrift-server intercept hive query and
replace it with its own engine.

With every release of hive, there need a significant effort on spark part to
support it.

For the metastore part, we may possibly replace it with hcatalog. But given
the dependency of other parts on hive, e.g., metastore, thriftserver,
hcatlog may not be able to help much.

Does anyone have any insight or idea in mind?

Thanks.

Zhan Zhang



--
View this message in 
context:http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail:dev-unsubscr...@spark.apache.org
For additional commands, e-mail:dev-h...@spark.apache.org

.


​




Re: How spark and hive integrate in long term?

2014-11-22 Thread Zhan Zhang
Thanks Cheng for the insights. 

Regarding the HCatalog, I did some initial investigation too and agree with 
you. As of now, it seems not a good solution. I will try to talk to Hive people 
to see whether there is such guarantee for downward compatibility for thrift 
protocol. By the way, I tried some basic functions using hive-0.13 connect to 
hive-0.14 metastore, and it looks like they are compatible. 

Thanks.

Zhan Zhang


On Nov 22, 2014, at 7:14 AM, Cheng Lian lian.cs@gmail.com wrote:

 Should emphasize that this is still a quick and rough conclusion, will 
 investigate this in more detail after 1.2.0 release. Anyway we really like to 
 provide Hive support in Spark SQL as smooth and clean as possible for both 
 developers and end users.
 
 On 11/22/14 11:05 PM, Cheng Lian wrote:
 
 Hey Zhan,
 
 This is a great question. We are also seeking for a stable API/protocol that 
 works with multiple Hive versions (esp. 0.12+). SPARK-4114 
 https://issues.apache.org/jira/browse/SPARK-4114 was opened for this. Did 
 some research into HCatalog recently, but I must confess that I’m not an 
 expert on HCatalog, actually spent only 1 day on exploring it. So please 
 don’t hesitate to correct me if I was wrong about the conclusions I made 
 below.
 
 First, although HCatalog API is more pleasant to work with, it’s 
 unfortunately feature incomplete. It only provides a subset of most commonly 
 used operations. For example, |HCatCreateTableDesc| maps only a subset of 
 |CreateTableDesc|, properties like |storeAsSubDirectories|, |skewedColNames| 
 and |skewedColValues| are missing. It’s also impossible to alter table 
 properties via HCatalog API (Spark SQL uses this to implement the |ANALYZE| 
 command). The |hcat| CLI tool provides all those features missing in 
 HCatalog API via raw Metastore API, and is structurally similar to the old 
 Hive CLI.
 
 Second, HCatalog API itself doesn’t ensure compatibility, it’s the Thrift 
 protocol that matters. HCatalog is directly built upon raw Metastore API, 
 and talks the same Metastore Thrift protocol. The problem we encountered in 
 Spark SQL is that, usually we deploy Spark SQL Hive support with embedded 
 mode (for testing) or local mode Metastore, and this makes us suffer from 
 things like Metastore database schema changes. If Hive Metastore Thrift 
 protocol is guaranteed to be downward compatible, then hopefully we can 
 resort to remote mode Metastore and always depend on most recent Hive APIs. 
 I had a glance of Thrift protocol version handling code in Hive, it seems 
 that downward compatibility is not an issue. However I didn’t find any 
 official documents about Thrift protocol compatibility.
 
 That said, in the future, hopefully we can only depend on most recent Hive 
 dependencies and remove the Hive shim layer introduced in branch 1.2. For 
 users who use exactly the same version of Hive as Spark SQL, they can use 
 either remote or local/embedded Metastore; while for users who want to 
 interact with existing legacy Hive clusters, they have to setup a remote 
 Metastore and let the Thrift protocol to handle compatibility.
 
 — Cheng
 
 On 11/22/14 6:51 AM, Zhan Zhang wrote:
 
 Now Spark and hive integration is a very nice feature. But I am wondering
 what the long term roadmap is for spark integration with hive. Both of these
 two projects are undergoing fast improvement and changes. Currently, my
 understanding is that spark hive sql part relies on hive meta store and
 basic parser to operate, and the thrift-server intercept hive query and
 replace it with its own engine.
 
 With every release of hive, there need a significant effort on spark part to
 support it.
 
 For the metastore part, we may possibly replace it with hcatalog. But given
 the dependency of other parts on hive, e.g., metastore, thriftserver,
 hcatlog may not be able to help much.
 
 Does anyone have any insight or idea in mind?
 
 Thanks.
 
 Zhan Zhang
 
 
 
 --
 View this message in 
 context:http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
 Sent from the Apache Spark Developers List mailing list archive at 
 Nabble.com.
 
 -
 To unsubscribe, e-mail:dev-unsubscr...@spark.apache.org
 For additional commands, e-mail:dev-h...@spark.apache.org
 
 .
 
 ​
 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You

Re: How spark and hive integrate in long term?

2014-11-22 Thread Patrick Wendell
There are two distinct topics when it comes to hive integration. Part
of the 1.3 roadmap will likely be better defining the plan for Hive
integration as Hive adds future versions.

1. Ability to interact with Hive metastore's from different versions
== I.e. if a user has a metastore, can Spark SQL read the data? This
one we want need to solve by asking Hive for a stable metastore thrift
API, or adding sufficient features to the HCatalog API so we can use
that.

2. Compatibility with HQL over time as Hive adds new features.
== This relates to how often we update our internal library
dependency on Hive and/or build support for new Hive features
internally.

On Sat, Nov 22, 2014 at 10:01 AM, Zhan Zhang zzh...@hortonworks.com wrote:
 Thanks Cheng for the insights.

 Regarding the HCatalog, I did some initial investigation too and agree with 
 you. As of now, it seems not a good solution. I will try to talk to Hive 
 people to see whether there is such guarantee for downward compatibility for 
 thrift protocol. By the way, I tried some basic functions using hive-0.13 
 connect to hive-0.14 metastore, and it looks like they are compatible.

 Thanks.

 Zhan Zhang


 On Nov 22, 2014, at 7:14 AM, Cheng Lian lian.cs@gmail.com wrote:

 Should emphasize that this is still a quick and rough conclusion, will 
 investigate this in more detail after 1.2.0 release. Anyway we really like 
 to provide Hive support in Spark SQL as smooth and clean as possible for 
 both developers and end users.

 On 11/22/14 11:05 PM, Cheng Lian wrote:

 Hey Zhan,

 This is a great question. We are also seeking for a stable API/protocol 
 that works with multiple Hive versions (esp. 0.12+). SPARK-4114 
 https://issues.apache.org/jira/browse/SPARK-4114 was opened for this. Did 
 some research into HCatalog recently, but I must confess that I'm not an 
 expert on HCatalog, actually spent only 1 day on exploring it. So please 
 don't hesitate to correct me if I was wrong about the conclusions I made 
 below.

 First, although HCatalog API is more pleasant to work with, it's 
 unfortunately feature incomplete. It only provides a subset of most 
 commonly used operations. For example, |HCatCreateTableDesc| maps only a 
 subset of |CreateTableDesc|, properties like |storeAsSubDirectories|, 
 |skewedColNames| and |skewedColValues| are missing. It's also impossible to 
 alter table properties via HCatalog API (Spark SQL uses this to implement 
 the |ANALYZE| command). The |hcat| CLI tool provides all those features 
 missing in HCatalog API via raw Metastore API, and is structurally similar 
 to the old Hive CLI.

 Second, HCatalog API itself doesn't ensure compatibility, it's the Thrift 
 protocol that matters. HCatalog is directly built upon raw Metastore API, 
 and talks the same Metastore Thrift protocol. The problem we encountered in 
 Spark SQL is that, usually we deploy Spark SQL Hive support with embedded 
 mode (for testing) or local mode Metastore, and this makes us suffer from 
 things like Metastore database schema changes. If Hive Metastore Thrift 
 protocol is guaranteed to be downward compatible, then hopefully we can 
 resort to remote mode Metastore and always depend on most recent Hive APIs. 
 I had a glance of Thrift protocol version handling code in Hive, it seems 
 that downward compatibility is not an issue. However I didn't find any 
 official documents about Thrift protocol compatibility.

 That said, in the future, hopefully we can only depend on most recent Hive 
 dependencies and remove the Hive shim layer introduced in branch 1.2. For 
 users who use exactly the same version of Hive as Spark SQL, they can use 
 either remote or local/embedded Metastore; while for users who want to 
 interact with existing legacy Hive clusters, they have to setup a remote 
 Metastore and let the Thrift protocol to handle compatibility.

 -- Cheng

 On 11/22/14 6:51 AM, Zhan Zhang wrote:

 Now Spark and hive integration is a very nice feature. But I am wondering
 what the long term roadmap is for spark integration with hive. Both of 
 these
 two projects are undergoing fast improvement and changes. Currently, my
 understanding is that spark hive sql part relies on hive meta store and
 basic parser to operate, and the thrift-server intercept hive query and
 replace it with its own engine.

 With every release of hive, there need a significant effort on spark part 
 to
 support it.

 For the metastore part, we may possibly replace it with hcatalog. But given
 the dependency of other parts on hive, e.g., metastore, thriftserver,
 hcatlog may not be able to help much.

 Does anyone have any insight or idea in mind?

 Thanks.

 Zhan Zhang



 --
 View this message in 
 context:http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
 Sent from the Apache Spark Developers List mailing list archive at 
 Nabble.com

How spark and hive integrate in long term?

2014-11-21 Thread Zhan Zhang
Now Spark and hive integration is a very nice feature. But I am wondering
what the long term roadmap is for spark integration with hive. Both of these
two projects are undergoing fast improvement and changes. Currently, my
understanding is that spark hive sql part relies on hive meta store and
basic parser to operate, and the thrift-server intercept hive query and
replace it with its own engine.

With every release of hive, there need a significant effort on spark part to
support it. 

For the metastore part, we may possibly replace it with hcatalog. But given
the dependency of other parts on hive, e.g., metastore, thriftserver,
hcatlog may not be able to help much. 

Does anyone have any insight or idea in mind?

Thanks.

Zhan Zhang 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How spark and hive integrate in long term?

2014-11-21 Thread Zhan Zhang
Thanks Dean, for the information.

Hive-on-spark is nice. Spark sql has the advantage to take the full advantage 
of spark and allows user to manipulate the table as RDD through native spark 
support.

When I tried to upgrade the current hive-0.13.1 support to hive-0.14.0. I found 
the hive parser is not compatible any more. In the meantime, those new feature 
introduced in hive-0.14.1, e.g, ACID, etc, is not there yet. In the meantime, 
spark-0.12 also
has some nice feature added which is supported by thrift-server too, e.g., 
hive-0.13, table cache, etc. 

Given that both have more and more features added, it would be great if user 
can take advantage of both. Current, spark sql give us such benefits partially, 
but I am wondering how to keep such integration in long term.

Thanks.

Zhan Zhang

On Nov 21, 2014, at 3:12 PM, Dean Wampler deanwamp...@gmail.com wrote:

 I can't comment on plans for Spark SQL's support for Hive, but several
 companies are porting Hive itself onto Spark:
 
 http://blog.cloudera.com/blog/2014/11/apache-hive-on-apache-spark-the-first-demo/
 
 I'm not sure if they are leveraging the old Shark code base or not, but it
 appears to be a fresh effort.
 
 dean
 
 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com
 
 On Fri, Nov 21, 2014 at 2:51 PM, Zhan Zhang zhaz...@gmail.com wrote:
 
 Now Spark and hive integration is a very nice feature. But I am wondering
 what the long term roadmap is for spark integration with hive. Both of
 these
 two projects are undergoing fast improvement and changes. Currently, my
 understanding is that spark hive sql part relies on hive meta store and
 basic parser to operate, and the thrift-server intercept hive query and
 replace it with its own engine.
 
 With every release of hive, there need a significant effort on spark part
 to
 support it.
 
 For the metastore part, we may possibly replace it with hcatalog. But given
 the dependency of other parts on hive, e.g., metastore, thriftserver,
 hcatlog may not be able to help much.
 
 Does anyone have any insight or idea in mind?
 
 Thanks.
 
 Zhan Zhang
 
 
 
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How spark and hive integrate in long term?

2014-11-21 Thread Ted Yu
bq. spark-0.12 also has some nice feature added

Minor correction: you meant Spark 1.2.0 I guess

Cheers

On Fri, Nov 21, 2014 at 3:45 PM, Zhan Zhang zzh...@hortonworks.com wrote:

 Thanks Dean, for the information.

 Hive-on-spark is nice. Spark sql has the advantage to take the full
 advantage of spark and allows user to manipulate the table as RDD through
 native spark support.

 When I tried to upgrade the current hive-0.13.1 support to hive-0.14.0. I
 found the hive parser is not compatible any more. In the meantime, those
 new feature introduced in hive-0.14.1, e.g, ACID, etc, is not there yet. In
 the meantime, spark-0.12 also
 has some nice feature added which is supported by thrift-server too, e.g.,
 hive-0.13, table cache, etc.

 Given that both have more and more features added, it would be great if
 user can take advantage of both. Current, spark sql give us such benefits
 partially, but I am wondering how to keep such integration in long term.

 Thanks.

 Zhan Zhang

 On Nov 21, 2014, at 3:12 PM, Dean Wampler deanwamp...@gmail.com wrote:

  I can't comment on plans for Spark SQL's support for Hive, but several
  companies are porting Hive itself onto Spark:
 
 
 http://blog.cloudera.com/blog/2014/11/apache-hive-on-apache-spark-the-first-demo/
 
  I'm not sure if they are leveraging the old Shark code base or not, but
 it
  appears to be a fresh effort.
 
  dean
 
  Dean Wampler, Ph.D.
  Author: Programming Scala, 2nd Edition
  http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
  Typesafe http://typesafe.com
  @deanwampler http://twitter.com/deanwampler
  http://polyglotprogramming.com
 
  On Fri, Nov 21, 2014 at 2:51 PM, Zhan Zhang zhaz...@gmail.com wrote:
 
  Now Spark and hive integration is a very nice feature. But I am
 wondering
  what the long term roadmap is for spark integration with hive. Both of
  these
  two projects are undergoing fast improvement and changes. Currently, my
  understanding is that spark hive sql part relies on hive meta store and
  basic parser to operate, and the thrift-server intercept hive query and
  replace it with its own engine.
 
  With every release of hive, there need a significant effort on spark
 part
  to
  support it.
 
  For the metastore part, we may possibly replace it with hcatalog. But
 given
  the dependency of other parts on hive, e.g., metastore, thriftserver,
  hcatlog may not be able to help much.
 
  Does anyone have any insight or idea in mind?
 
  Thanks.
 
  Zhan Zhang
 
 
 
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/How-spark-and-hive-integrate-in-long-term-tp9482.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 


 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org