Re: Hive From Spark: Jdbc VS sparkContext

2017-11-22 Thread Nicolas Paris
Hey

Finally I improved a lot the spark-hive sql performances.

I had some problem with some topology_script.py that made huge log error
trace and reduced spark performances in python mode. I just corrected
the python2 scripts to be python3 ready.
I had some problem with broadcast variable while joining tables. I just
deactivated this fucntionality.

As a result our users are now able to use spark-hive with very limited
resources (2 executors with 4core) and get decent performances for
analytics.

Compared to JDBC presto, this has several advantages:
- integrated solution
- single security layer (hive/kerberos)
- direct partitionned lazy datasets versus complicated jdbc dataset management
- more robust for analytics with less memory (apparently)

However presto still makes sence for sub second analytics, and oltp like
queries and data discovery.

Le 05 nov. 2017 à 13:57, Nicolas Paris écrivait :
> Hi
> 
> After some testing, I have been quite disapointed with hiveContext way of
> accessing hive tables.
> 
> The main problem is resource allocation: I have tons of users and they
> get a limited subset of workers. Then this does not allow to query huge
> datasetsn because to few memory allocated (or maybe I am missing
> something).
> 
> If using Hive jdbc, Hive resources are shared by all my users and then
> queries are able to finish.
> 
> Then I have been testing other jdbc based approach and for now, "presto"
> looks like the most appropriate solution to access hive tables.
> 
> In order to load huge datasets into spark, the proposed approach is to
> use presto distributed CTAS to build an ORC dataset, and access to that
> dataset from spark dataframe loader ability (instead of direct jdbc
> access tha would break the driver memory).
> 
> 
> 
> Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
> > Hi Nicolas,
> > 
> > without the hive thrift server, if you try to run a select * on a table 
> > which
> > has around 10,000 partitions, SPARK will give you some surprises. PRESTO 
> > works
> > fine in these scenarios, and I am sure SPARK community will soon learn from
> > their algorithms.
> > 
> > 
> > Regards,
> > Gourav
> > 
> > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris  wrote:
> > 
> > > I do not think that SPARK will automatically determine the partitions.
> > Actually
> > > it does not automatically determine the partitions. In case a table 
> > has a
> > few
> > > million records, it all goes through the driver.
> > 
> > Hi Gourav
> > 
> > Actualy spark jdbc driver is able to deal direclty with partitions.
> > Sparks creates a jdbc connection for each partition.
> > 
> > All details explained in this post :
> > http://www.gatorsmile.io/numpartitionsinjdbc/
> > 
> > Also an example with greenplum database:
> > http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/
> > 
> > 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
Yes, my thought exactly. Kindly let me know if you need any help to port in
pyspark.

On Mon, Nov 6, 2017 at 8:54 AM, Nicolas Paris  wrote:

> Le 05 nov. 2017 à 22:46, ayan guha écrivait :
> > Thank you for the clarification. That was my understanding too. However
> how to
> > provide the upper bound as it changes for every call in real life. For
> example
> > it is not required for sqoop.
>
> True.  AFAIK sqoop begins with doing a
> "select min(column_split),max(column_split)
> from () as query;"
> and then splits the result.
>
> I was thinking doing the same with wrapper with spark jdbc that would
> infer the number partition, and the upper/lower bound itself.
>
>


-- 
Best Regards,
Ayan Guha


Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Le 05 nov. 2017 à 22:46, ayan guha écrivait :
> Thank you for the clarification. That was my understanding too. However how to
> provide the upper bound as it changes for every call in real life. For example
> it is not required for sqoop. 

True.  AFAIK sqoop begins with doing a  
"select min(column_split),max(column_split) 
from () as query;" 
and then splits the result.

I was thinking doing the same with wrapper with spark jdbc that would
infer the number partition, and the upper/lower bound itself. 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
Thank you for the clarification. That was my understanding too. However how
to provide the upper bound as it changes for every call in real life. For
example it is not required for sqoop.


On Mon, 6 Nov 2017 at 8:20 am, Nicolas Paris  wrote:

> Le 05 nov. 2017 à 22:02, ayan guha écrivait :
> > Can you confirm if JDBC DF Reader actually loads all data from source to
> driver
> > memory and then distributes to the executors?
>
> apparently yes when not using partition column
>
>
> > And this is true even when a
> > partition column is provided?
>
> No, in this case, each worker send a jdbc call accordingly to
> documentation
> https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
>
> --
Best Regards,
Ayan Guha


Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Le 05 nov. 2017 à 22:02, ayan guha écrivait :
> Can you confirm if JDBC DF Reader actually loads all data from source to 
> driver
> memory and then distributes to the executors?

apparently yes when not using partition column


> And this is true even when a
> partition column is provided?

No, in this case, each worker send a jdbc call accordingly to
documentation 
https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
Hi

Can you confirm if JDBC DF Reader actually loads all data from source to
driver memory and then distributes to the executors? And this is true even
when a partition column is provided?

Best
Ayan

On Mon, Nov 6, 2017 at 3:00 AM, David Hodeffi <
david.hode...@niceactimize.com> wrote:

> Testing Spark group e-mail
>
> Confidentiality: This communication and any attachments are intended for
> the above-named persons only and may be confidential and/or legally
> privileged. Any opinions expressed in this communication are not
> necessarily those of NICE Actimize. If this communication has come to you
> in error you must take no action based on it, nor must you copy or show it
> to anyone; please delete/destroy and inform the sender by e-mail
> immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail and
> attachments are free from any virus, we advise that in keeping with good
> computing practice the recipient should ensure they are actually virus free.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha


RE: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread David Hodeffi
Testing Spark group e-mail

Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Le 05 nov. 2017 à 14:11, Gourav Sengupta écrivait :
> thanks a ton for your kind response. Have you used SPARK Session ? I think 
> that
> hiveContext is a very old way of solving things in SPARK, and since then new
> algorithms have been introduced in SPARK. 

I will give a try out sparkSession. 

> It will be a lot of help, given how kind you have been by sharing your
> experience, if you could kindly share your code as well and provide details
> like SPARK , HADOOP, HIVE, and other environment version and details.

I am testing a HDP 2.6 distrib and also:
SPARK: 2.1.1
HADOOP: 2.7.3
HIVE: 1.2.1000
PRESTO: 1.87

> After all, no one wants to use SPARK 1.x version to solve problems anymore,
> though I have seen couple of companies who are stuck with these versions as
> they are using in house deployments which they cannot upgrade because of
> incompatibility issues.

Didn't know hiveContext was legacy spark way. I will give a try to
sparkSession and conclude. After all, I would prefer to provide our
users, a unique and uniform framework such spark, instead of multiple
complicated layers such spark + whatever jdbc access

> 
> 
> Regards,
> Gourav Sengupta
> 
> 
> On Sun, Nov 5, 2017 at 12:57 PM, Nicolas Paris  wrote:
> 
> Hi
> 
> After some testing, I have been quite disapointed with hiveContext way of
> accessing hive tables.
> 
> The main problem is resource allocation: I have tons of users and they
> get a limited subset of workers. Then this does not allow to query huge
> datasetsn because to few memory allocated (or maybe I am missing
> something).
> 
> If using Hive jdbc, Hive resources are shared by all my users and then
> queries are able to finish.
> 
> Then I have been testing other jdbc based approach and for now, "presto"
> looks like the most appropriate solution to access hive tables.
> 
> In order to load huge datasets into spark, the proposed approach is to
> use presto distributed CTAS to build an ORC dataset, and access to that
> dataset from spark dataframe loader ability (instead of direct jdbc
> access tha would break the driver memory).
> 
> 
> 
> Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
> > Hi Nicolas,
> >
> > without the hive thrift server, if you try to run a select * on a table
> which
> > has around 10,000 partitions, SPARK will give you some surprises. PRESTO
> works
> > fine in these scenarios, and I am sure SPARK community will soon learn
> from
> > their algorithms.
> >
> >
> > Regards,
> > Gourav
> >
> > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris 
> wrote:
> >
> >     > I do not think that SPARK will automatically determine the
> partitions.
> >     Actually
> >     > it does not automatically determine the partitions. In case a 
> table
> has a
> >     few
> >     > million records, it all goes through the driver.
> >
> >     Hi Gourav
> >
> >     Actualy spark jdbc driver is able to deal direclty with partitions.
> >     Sparks creates a jdbc connection for each partition.
> >
> >     All details explained in this post :
> >     http://www.gatorsmile.io/numpartitionsinjdbc/
> >
> >     Also an example with greenplum database:
> >     http://engineering.pivotal.io/post/getting-started-with-
> greenplum-spark/
> >
> >
> 
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Gourav Sengupta
Hi Nicolas,


thanks a ton for your kind response. Have you used SPARK Session ? I think
that hiveContext is a very old way of solving things in SPARK, and since
then new algorithms have been introduced in SPARK.

It will be a lot of help, given how kind you have been by sharing your
experience, if you could kindly share your code as well and provide details
like SPARK , HADOOP, HIVE, and other environment version and details.

After all, no one wants to use SPARK 1.x version to solve problems anymore,
though I have seen couple of companies who are stuck with these versions as
they are using in house deployments which they cannot upgrade because of
incompatibility issues.


Regards,
Gourav Sengupta


On Sun, Nov 5, 2017 at 12:57 PM, Nicolas Paris  wrote:

> Hi
>
> After some testing, I have been quite disapointed with hiveContext way of
> accessing hive tables.
>
> The main problem is resource allocation: I have tons of users and they
> get a limited subset of workers. Then this does not allow to query huge
> datasetsn because to few memory allocated (or maybe I am missing
> something).
>
> If using Hive jdbc, Hive resources are shared by all my users and then
> queries are able to finish.
>
> Then I have been testing other jdbc based approach and for now, "presto"
> looks like the most appropriate solution to access hive tables.
>
> In order to load huge datasets into spark, the proposed approach is to
> use presto distributed CTAS to build an ORC dataset, and access to that
> dataset from spark dataframe loader ability (instead of direct jdbc
> access tha would break the driver memory).
>
>
>
> Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
> > Hi Nicolas,
> >
> > without the hive thrift server, if you try to run a select * on a table
> which
> > has around 10,000 partitions, SPARK will give you some surprises. PRESTO
> works
> > fine in these scenarios, and I am sure SPARK community will soon learn
> from
> > their algorithms.
> >
> >
> > Regards,
> > Gourav
> >
> > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris 
> wrote:
> >
> > > I do not think that SPARK will automatically determine the
> partitions.
> > Actually
> > > it does not automatically determine the partitions. In case a
> table has a
> > few
> > > million records, it all goes through the driver.
> >
> > Hi Gourav
> >
> > Actualy spark jdbc driver is able to deal direclty with partitions.
> > Sparks creates a jdbc connection for each partition.
> >
> > All details explained in this post :
> > http://www.gatorsmile.io/numpartitionsinjdbc/
> >
> > Also an example with greenplum database:
> > http://engineering.pivotal.io/post/getting-started-with-
> greenplum-spark/
> >
> >
>


Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Hi

After some testing, I have been quite disapointed with hiveContext way of
accessing hive tables.

The main problem is resource allocation: I have tons of users and they
get a limited subset of workers. Then this does not allow to query huge
datasetsn because to few memory allocated (or maybe I am missing
something).

If using Hive jdbc, Hive resources are shared by all my users and then
queries are able to finish.

Then I have been testing other jdbc based approach and for now, "presto"
looks like the most appropriate solution to access hive tables.

In order to load huge datasets into spark, the proposed approach is to
use presto distributed CTAS to build an ORC dataset, and access to that
dataset from spark dataframe loader ability (instead of direct jdbc
access tha would break the driver memory).



Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
> Hi Nicolas,
> 
> without the hive thrift server, if you try to run a select * on a table which
> has around 10,000 partitions, SPARK will give you some surprises. PRESTO works
> fine in these scenarios, and I am sure SPARK community will soon learn from
> their algorithms.
> 
> 
> Regards,
> Gourav
> 
> On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris  wrote:
> 
> > I do not think that SPARK will automatically determine the partitions.
> Actually
> > it does not automatically determine the partitions. In case a table has 
> a
> few
> > million records, it all goes through the driver.
> 
> Hi Gourav
> 
> Actualy spark jdbc driver is able to deal direclty with partitions.
> Sparks creates a jdbc connection for each partition.
> 
> All details explained in this post :
> http://www.gatorsmile.io/numpartitionsinjdbc/
> 
> Also an example with greenplum database:
> http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/
> 
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Gourav Sengupta
Hi Nicolas,

without the hive thrift server, if you try to run a select * on a table
which has around 10,000 partitions, SPARK will give you some surprises.
PRESTO works fine in these scenarios, and I am sure SPARK community will
soon learn from their algorithms.


Regards,
Gourav

On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris  wrote:

> > I do not think that SPARK will automatically determine the partitions.
> Actually
> > it does not automatically determine the partitions. In case a table has
> a few
> > million records, it all goes through the driver.
>
> Hi Gourav
>
> Actualy spark jdbc driver is able to deal direclty with partitions.
> Sparks creates a jdbc connection for each partition.
>
> All details explained in this post :
> http://www.gatorsmile.io/numpartitionsinjdbc/
>
> Also an example with greenplum database:
> http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/
>


Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
> I do not think that SPARK will automatically determine the partitions. 
> Actually
> it does not automatically determine the partitions. In case a table has a few
> million records, it all goes through the driver.

Hi Gourav

Actualy spark jdbc driver is able to deal direclty with partitions.
Sparks creates a jdbc connection for each partition.

All details explained in this post : 
http://www.gatorsmile.io/numpartitionsinjdbc/

Also an example with greenplum database:
http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
Hi Gourav

> what if the table has partitions and sub-partitions? 

well this also work with multiple orc files having same schema:
val people = sqlContext.read.format("orc").load("hdfs://cluster/people*")
Am I missing something?

> And you do not want to access the entire data?

This works for static datasets, or when new data is comming by batch
processes, the spark application should be reloaded to get the new files
in the folder


>> On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Paris  wrote:
> 
> Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
> > I wonder the differences accessing HIVE tables in two different ways:
> > - with jdbc access
> > - with sparkContext
> 
> Well there is also a third way to access the hive data from spark:
> - with direct file access (here ORC format)
> 
> 
> For example:
> 
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val people = sqlContext.read.format("orc").load("hdfs://cluster//orc_
> people")
> people.createOrReplaceTempView("people")
> sqlContext.sql("SELECT count(1) FROM people WHERE ...").show()
> 
> 
> This method looks much faster than both:
> - with jdbc access
> - with sparkContext
> 
> Any experience on that ?
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Gourav Sengupta
Hi Nicolas,

what if the table has partitions and sub-partitions? And you do not want to
access the entire data?


Regards,
Gourav

On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Paris  wrote:

> Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
> > I wonder the differences accessing HIVE tables in two different ways:
> > - with jdbc access
> > - with sparkContext
>
> Well there is also a third way to access the hive data from spark:
> - with direct file access (here ORC format)
>
>
> For example:
>
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> val people = sqlContext.read.format("orc").load("hdfs://cluster//orc_
> people")
> people.createOrReplaceTempView("people")
> sqlContext.sql("SELECT count(1) FROM people WHERE ...").show()
>
>
> This method looks much faster than both:
> - with jdbc access
> - with sparkContext
>
> Any experience on that ?
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait :
> I wonder the differences accessing HIVE tables in two different ways:
> - with jdbc access
> - with sparkContext

Well there is also a third way to access the hive data from spark:
- with direct file access (here ORC format)


For example:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
val people = sqlContext.read.format("orc").load("hdfs://cluster//orc_people")
people.createOrReplaceTempView("people")
sqlContext.sql("SELECT count(1) FROM people WHERE ...").show()


This method looks much faster than both:
- with jdbc access
- with sparkContext

Any experience on that ?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-10-13 Thread Kabeer Ahmed
My take on this might sound a bit different. Here are few points to consider 
below:

1. Going through  Hive JDBC means that the application is restricted by the # 
of queries that can be compiled. HS2 can only compile one SQL at a time and if 
users have bad SQL, it can take a long time just to compile (not map reduce). 
This will reduce the query throughput i.e. # of queries you can fire through 
the JDBC.

2. Going through Hive JDBC does have an advantage that HMS service is 
protected. The JIRA: https://issues.apache.org/jira/browse/HIVE-13884 does 
protect HMS from crashing - because at the end of the day retrieving metadata 
about a Hive table that may have millions or simply put 1000s of partitions 
hits jvm limit on the array size that it can hold for the metadata retrieved. 
JVM array size limit is hit and there is a crash on HMS. So in effect this is 
good to have to protect HMS & the relational database on its back end.

Note: Hive community does propose to move the database to HBase that scales but 
I dont think this will get implemented sooner.

3. Going through the SparkContext, it directly interfaces with the Hive 
MetaStore. I have tried to put a sequence of code flow below. The bit I didnt 
have time to dive into is that I believe if the table is really large i.e. say 
partitions in the table are more than 32K (size of a short) then some sort of 
slicing does occur (I didnt have time to dive and get this piece of code but 
from experience this does seem to occur).

Code flow:
Spark uses Hive External catalog - goo.gl/7CZcDw
HiveClient version of getPartitions is -> goo.gl/ZAEsqQ
HiveClientImpl of getPartitions is: -> goo.gl/msPrr5
The Hive call is made at: -> goo.gl/TB4NFU
ThriftHiveMetastore.java ->  get_partitions_ps_with_auth

-1 value is sent within Spark all the way throughout to Hive Metastore thrift. 
So in effect for large tables at a time 32K partitions are retrieved. This also 
has led to a few HMS crashes but I am yet to identify if this is really the 
cause.


Based on the 3 points above, I would prefer to use SparkContext. If the cause 
of crash is indeed high # of partitions retrieval, then I may opt for the JDBC 
route.

Thanks
Kabeer.


On Fri, 13 Oct 2017 09:22:37 +0200, Nicolas Paris wrote:
>> In case a table has a few
>> million records, it all goes through the driver.
>
> This sounds clear in JDBC mode, the driver get all the rows and then it
> spreads the RDD over the executors.
>
> I d'say that most use cases deal with SQL to aggregate huge datasets,
> and retrieve small amount of rows to be then transformed for ML tasks.
> Then using JDBC offers the robustness of HIVE to produce a small aggregated
> dataset into spark. While using SPARK SQL uses RDD to produce the small
> one from huge.
>
> Not very clear how SPARK SQL deal with huge HIVE table. Does it load
> everything into memory and crash, or does this never happend?
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


--
Sent using Dekko from my Ubuntu device

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-10-13 Thread Nicolas Paris
> In case a table has a few
> million records, it all goes through the driver.

This sounds clear in JDBC mode, the driver get all the rows and then it
spreads the RDD over the executors.

I d'say that most use cases deal with SQL to aggregate huge datasets,
and retrieve small amount of rows to be then transformed for ML tasks.
Then using JDBC offers the robustness of HIVE to produce a small aggregated
dataset into spark. While using SPARK SQL uses RDD to produce the small
one from huge.

Not very clear how SPARK SQL deal with huge HIVE table. Does it load
everything into memory and crash, or does this never happend?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread Gourav Sengupta
Hi,

I do not think that SPARK will automatically determine the partitions.
Actually it does not automatically determine the partitions. In case a
table has a few million records, it all goes through the driver.


Ofcourse, I have only tried JDBC connections in AURORA, Oracle and Postgres.

Regards,
Gourav Sengupta

On Tue, Oct 10, 2017 at 10:14 PM, weand  wrote:

> Is Hive from Spark via JDBC working for you? In case it does, I would be
> interested in your setup :-)
>
> We can't get this working. See bug here, especially my last comment:
> https://issues.apache.org/jira/browse/SPARK-21063
>
> Regards
> Andreas
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


RE: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread Walia, Reema
I am able to connect to Spark via JDBC - tested with Squirrel. I am referencing 
all the jars of current Spark distribution under 
/usr/hdp/current/spark2-client/jars/*

Thanks,
Reema


-Original Message-
From: weand [mailto:andreas.we...@gmail.com] 
Sent: Tuesday, October 10, 2017 5:14 PM
To: user@spark.apache.org
Subject: Re: Hive From Spark: Jdbc VS sparkContext

  [ External Email ]

Is Hive from Spark via JDBC working for you? In case it does, I would be 
interested in your setup :-)

We can't get this working. See bug here, especially my last comment:
https://issues.apache.org/jira/browse/SPARK-21063

Regards
Andreas



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

_

This message is for the designated recipient only and may contain privileged, 
proprietary
or otherwise private information. If you have received it in error, please 
notify the sender
immediately and delete the original. Any other use of the email by you is 
prohibited.

Dansk - Deutsch - Espanol - Francais - Italiano - Japanese - Nederlands - Norsk 
- Portuguese - Chinese
Svenska: 
http://www.cardinalhealth.com/en/support/terms-and-conditions-english.html


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread weand
Is Hive from Spark via JDBC working for you? In case it does, I would be
interested in your setup :-)

We can't get this working. See bug here, especially my last comment:
https://issues.apache.org/jira/browse/SPARK-21063

Regards
Andreas



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread ayan guha
That is not correct, IMHO. If I am not wrong, Spark will still load data in
executor, by running some stats on the data itself to identify
partitions

On Tue, Oct 10, 2017 at 9:23 PM, 郭鹏飞  wrote:

>
> > 在 2017年10月4日,上午2:08,Nicolas Paris  写道:
> >
> > Hi
> >
> > I wonder the differences accessing HIVE tables in two different ways:
> > - with jdbc access
> > - with sparkContext
> >
> > I would say that jdbc is better since it uses HIVE that is based on
> > map-reduce / TEZ and then works on disk.
> > Using spark rdd can lead to memory errors on very huge datasets.
> >
> >
> > Anybody knows or can point me to relevant documentation ?
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
> The jdbc will load data into the driver node, this may slow down the
> speed,and may OOM.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha


Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread 郭鹏飞

> 在 2017年10月4日,上午2:08,Nicolas Paris  写道:
> 
> Hi
> 
> I wonder the differences accessing HIVE tables in two different ways:
> - with jdbc access
> - with sparkContext
> 
> I would say that jdbc is better since it uses HIVE that is based on
> map-reduce / TEZ and then works on disk. 
> Using spark rdd can lead to memory errors on very huge datasets.
> 
> 
> Anybody knows or can point me to relevant documentation ?
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


The jdbc will load data into the driver node, this may slow down the speed,and 
may OOM.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-10-04 Thread ayan guha
Well the obvious point is security. Ranger and Sentry can secure jdbc
endpoints only. For performance aspect, I am equally curious 邏

On Wed, 4 Oct 2017 at 10:30 pm, Gourav Sengupta 
wrote:

> Hi,
>
> I am genuinely curious to see whether any one responds to this question.
>
> Its very hard to shake off JAVA, OOPs and JDBC's :)
>
>
>
> Regards,
> Gourav Sengupta
>
> On Tue, Oct 3, 2017 at 7:08 PM, Nicolas Paris  wrote:
>
>> Hi
>>
>> I wonder the differences accessing HIVE tables in two different ways:
>> - with jdbc access
>> - with sparkContext
>>
>> I would say that jdbc is better since it uses HIVE that is based on
>> map-reduce / TEZ and then works on disk.
>> Using spark rdd can lead to memory errors on very huge datasets.
>>
>>
>> Anybody knows or can point me to relevant documentation ?
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
> --
Best Regards,
Ayan Guha


Re: Hive From Spark: Jdbc VS sparkContext

2017-10-04 Thread Gourav Sengupta
Hi,

I am genuinely curious to see whether any one responds to this question.

Its very hard to shake off JAVA, OOPs and JDBC's :)



Regards,
Gourav Sengupta

On Tue, Oct 3, 2017 at 7:08 PM, Nicolas Paris  wrote:

> Hi
>
> I wonder the differences accessing HIVE tables in two different ways:
> - with jdbc access
> - with sparkContext
>
> I would say that jdbc is better since it uses HIVE that is based on
> map-reduce / TEZ and then works on disk.
> Using spark rdd can lead to memory errors on very huge datasets.
>
>
> Anybody knows or can point me to relevant documentation ?
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


RE: Hive From Spark

2014-08-25 Thread Andrew Lee
Hi Du,
I didn't notice the ticket was updated recently. SPARK-2848 is a sub-task of 
Spark-2420, and it's already resolved in Spark 1.1.0.It looks like Spark-2420 
will release in Spark 1.2.0 according to the current JIRA status.
I'm tracking branch-1.1 instead of the master and haven't seen the results 
merged. Still seeing guava 14.0.1 so I don't think Spark 2848 has been merged 
yet.
Will be great to have someone to confirm or clarify the expectation.
 From: l...@yahoo-inc.com.INVALID
 To: van...@cloudera.com; alee...@hotmail.com
 CC: user@spark.apache.org
 Subject: Re: Hive From Spark
 Date: Sat, 23 Aug 2014 00:08:47 +
 
 I thought the fix had been pushed to the apache master ref. commit
 [SPARK-2848] Shade Guava in uber-jars By Marcelo Vanzin on 8/20. So my
 previous email was based on own build of the apache master, which turned
 out not working yet.
 
 Marcelo: Please correct me if I got that commit wrong.
 
 Thanks,
 Du
 
 
 
 On 8/22/14, 11:41 AM, Marcelo Vanzin van...@cloudera.com wrote:
 
 SPARK-2420 is fixed. I don't think it will be in 1.1, though - might
 be too risky at this point.
 
 I'm not familiar with spark-sql.
 
 On Fri, Aug 22, 2014 at 11:25 AM, Andrew Lee alee...@hotmail.com wrote:
  Hopefully there could be some progress on SPARK-2420. It looks like
 shading
  may be the voted solution among downgrading.
 
  Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark
  1.1.2?
 
  By the way, regarding bin/spark-sql? Is this more of a debugging tool
 for
  Spark job integrating with Hive?
  How does people use spark-sql? I'm trying to understand the rationale
 and
  motivation behind this script, any idea?
 
 
  Date: Thu, 21 Aug 2014 16:31:08 -0700
 
  Subject: Re: Hive From Spark
  From: van...@cloudera.com
  To: l...@yahoo-inc.com.invalid
  CC: user@spark.apache.org; u...@spark.incubator.apache.org;
  pwend...@gmail.com
 
 
  Hi Du,
 
  I don't believe the Guava change has made it to the 1.1 branch. The
  Guava doc says hashInt was added in 12.0, so what's probably
  happening is that you have and old version of Guava in your classpath
  before the Spark jars. (Hadoop ships with Guava 11, so that may be the
  source of your problem.)
 
  On Thu, Aug 21, 2014 at 4:23 PM, Du Li l...@yahoo-inc.com.invalid
 wrote:
   Hi,
  
   This guava dependency conflict problem should have been fixed as of
   yesterday according to
 https://issues.apache.org/jira/browse/SPARK-2420
  
   However, I just got java.lang.NoSuchMethodError:
  
   
 com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/Ha
 shCode;
   by the following code snippet and ³mvn3 test² on Mac. I built the
 latest
   version of spark (1.1.0-SNAPSHOT) and installed the jar files to the
   local
   maven repo. From my pom file I explicitly excluded guava from almost
 all
   possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and
   hadoop-client. This snippet is abstracted from a larger project. So
 the
   pom.xml includes many dependencies although not all are required by
 this
   snippet. The pom.xml is attached.
  
   Anybody knows what to fix it?
  
   Thanks,
   Du
   ---
  
   package com.myself.test
  
   import org.scalatest._
   import org.apache.hadoop.io.{NullWritable, BytesWritable}
   import org.apache.spark.{SparkContext, SparkConf}
   import org.apache.spark.SparkContext._
  
   class MyRecord(name: String) extends Serializable {
   def getWritable(): BytesWritable = {
   new
   
 BytesWritable(Option(name).getOrElse(\\N).toString.getBytes(UTF-8))
   }
  
   final override def equals(that: Any): Boolean = {
   if( !that.isInstanceOf[MyRecord] )
   false
   else {
   val other = that.asInstanceOf[MyRecord]
   this.getWritable == other.getWritable
   }
   }
   }
  
   class MyRecordTestSuite extends FunSuite {
   // construct an MyRecord by Consumer.schema
   val rec: MyRecord = new MyRecord(James Bond)
  
   test(generated SequenceFile should be readable from spark) {
   val path = ./testdata/
  
   val conf = new SparkConf(false).setMaster(local).setAppName(test
 data
   exchange with Hive)
   conf.set(spark.driver.host, localhost)
   val sc = new SparkContext(conf)
   val rdd = sc.makeRDD(Seq(rec))
   rdd.map((x: MyRecord) = (NullWritable.get(), x.getWritable()))
   .saveAsSequenceFile(path)
  
   val bytes = sc.sequenceFile(path, classOf[NullWritable],
   classOf[BytesWritable]).first._2
   assert(rec.getWritable() == bytes)
  
   sc.stop()
   System.clearProperty(spark.driver.port)
   }
   }
  
  
   From: Andrew Lee alee...@hotmail.com
   Reply-To: user@spark.apache.org user@spark.apache.org
   Date: Monday, July 21, 2014 at 10:27 AM
   To: user@spark.apache.org user@spark.apache.org,
   u...@spark.incubator.apache.org u...@spark.incubator.apache.org
  
   Subject: RE: Hive From Spark
  
   Hi All,
  
   Currently, if you are running Spark HiveContext API with Hive 0.12,
 it
   won't
   work due to the following 2 libraries which are not consistent with
 Hive

Re: Hive From Spark

2014-08-22 Thread Du Li
I thought the fix had been pushed to the apache master ref. commit
[SPARK-2848] Shade Guava in uber-jars By Marcelo Vanzin on 8/20. So my
previous email was based on own build of the apache master, which turned
out not working yet.

Marcelo: Please correct me if I got that commit wrong.

Thanks,
Du



On 8/22/14, 11:41 AM, Marcelo Vanzin van...@cloudera.com wrote:

SPARK-2420 is fixed. I don't think it will be in 1.1, though - might
be too risky at this point.

I'm not familiar with spark-sql.

On Fri, Aug 22, 2014 at 11:25 AM, Andrew Lee alee...@hotmail.com wrote:
 Hopefully there could be some progress on SPARK-2420. It looks like
shading
 may be the voted solution among downgrading.

 Any idea when this will happen? Could it happen in Spark 1.1.1 or Spark
 1.1.2?

 By the way, regarding bin/spark-sql? Is this more of a debugging tool
for
 Spark job integrating with Hive?
 How does people use spark-sql? I'm trying to understand the rationale
and
 motivation behind this script, any idea?


 Date: Thu, 21 Aug 2014 16:31:08 -0700

 Subject: Re: Hive From Spark
 From: van...@cloudera.com
 To: l...@yahoo-inc.com.invalid
 CC: user@spark.apache.org; u...@spark.incubator.apache.org;
 pwend...@gmail.com


 Hi Du,

 I don't believe the Guava change has made it to the 1.1 branch. The
 Guava doc says hashInt was added in 12.0, so what's probably
 happening is that you have and old version of Guava in your classpath
 before the Spark jars. (Hadoop ships with Guava 11, so that may be the
 source of your problem.)

 On Thu, Aug 21, 2014 at 4:23 PM, Du Li l...@yahoo-inc.com.invalid
wrote:
  Hi,
 
  This guava dependency conflict problem should have been fixed as of
  yesterday according to
https://issues.apache.org/jira/browse/SPARK-2420
 
  However, I just got java.lang.NoSuchMethodError:
 
  
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/Ha
shCode;
  by the following code snippet and ³mvn3 test² on Mac. I built the
latest
  version of spark (1.1.0-SNAPSHOT) and installed the jar files to the
  local
  maven repo. From my pom file I explicitly excluded guava from almost
all
  possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and
  hadoop-client. This snippet is abstracted from a larger project. So
the
  pom.xml includes many dependencies although not all are required by
this
  snippet. The pom.xml is attached.
 
  Anybody knows what to fix it?
 
  Thanks,
  Du
  ---
 
  package com.myself.test
 
  import org.scalatest._
  import org.apache.hadoop.io.{NullWritable, BytesWritable}
  import org.apache.spark.{SparkContext, SparkConf}
  import org.apache.spark.SparkContext._
 
  class MyRecord(name: String) extends Serializable {
  def getWritable(): BytesWritable = {
  new
  
BytesWritable(Option(name).getOrElse(\\N).toString.getBytes(UTF-8))
  }
 
  final override def equals(that: Any): Boolean = {
  if( !that.isInstanceOf[MyRecord] )
  false
  else {
  val other = that.asInstanceOf[MyRecord]
  this.getWritable == other.getWritable
  }
  }
  }
 
  class MyRecordTestSuite extends FunSuite {
  // construct an MyRecord by Consumer.schema
  val rec: MyRecord = new MyRecord(James Bond)
 
  test(generated SequenceFile should be readable from spark) {
  val path = ./testdata/
 
  val conf = new SparkConf(false).setMaster(local).setAppName(test
data
  exchange with Hive)
  conf.set(spark.driver.host, localhost)
  val sc = new SparkContext(conf)
  val rdd = sc.makeRDD(Seq(rec))
  rdd.map((x: MyRecord) = (NullWritable.get(), x.getWritable()))
  .saveAsSequenceFile(path)
 
  val bytes = sc.sequenceFile(path, classOf[NullWritable],
  classOf[BytesWritable]).first._2
  assert(rec.getWritable() == bytes)
 
  sc.stop()
  System.clearProperty(spark.driver.port)
  }
  }
 
 
  From: Andrew Lee alee...@hotmail.com
  Reply-To: user@spark.apache.org user@spark.apache.org
  Date: Monday, July 21, 2014 at 10:27 AM
  To: user@spark.apache.org user@spark.apache.org,
  u...@spark.incubator.apache.org u...@spark.incubator.apache.org
 
  Subject: RE: Hive From Spark
 
  Hi All,
 
  Currently, if you are running Spark HiveContext API with Hive 0.12,
it
  won't
  work due to the following 2 libraries which are not consistent with
Hive
  0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a
  common
  practice, they should be consistent to work inter-operable).
 
  These are under discussion in the 2 JIRA tickets:
 
  https://issues.apache.org/jira/browse/HIVE-7387
 
  https://issues.apache.org/jira/browse/SPARK-2420
 
  When I ran the command by tweaking the classpath and build for Spark
  1.0.1-rc3, I was able to create table through HiveContext, however,
when
  I
  fetch the data, due to incompatible API calls in Guava, it breaks.
This
  is
  critical since it needs to map the cllumns to the RDD schema.
 
  Hive and Hadoop are using an older version of guava libraries
(11.0.1)
  where
  Spark Hive is using guava 14.0.1+.
  The community isn't willing to downgrade to 11.0.1 which

Re: Hive From Spark

2014-08-21 Thread Du Li
Hi,

This guava dependency conflict problem should have been fixed as of yesterday 
according to https://issues.apache.org/jira/browse/SPARK-2420

However, I just got java.lang.NoSuchMethodError: 
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
by the following code snippet and “mvn3 test” on Mac. I built the latest 
version of spark (1.1.0-SNAPSHOT) and installed the jar files to the local 
maven repo. From my pom file I explicitly excluded guava from almost all 
possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and 
hadoop-client. This snippet is abstracted from a larger project. So the pom.xml 
includes many dependencies although not all are required by this snippet. The 
pom.xml is attached.

Anybody knows what to fix it?

Thanks,
Du
---

package com.myself.test

import org.scalatest._
import org.apache.hadoop.io.{NullWritable, BytesWritable}
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._

class MyRecord(name: String) extends Serializable {
  def getWritable(): BytesWritable = {
new BytesWritable(Option(name).getOrElse(\\N).toString.getBytes(UTF-8))
  }

  final override def equals(that: Any): Boolean = {
if( !that.isInstanceOf[MyRecord] )
  false
else {
  val other = that.asInstanceOf[MyRecord]
  this.getWritable == other.getWritable
}
  }
}

class MyRecordTestSuite extends FunSuite {
  // construct an MyRecord by Consumer.schema
  val rec: MyRecord = new MyRecord(James Bond)

  test(generated SequenceFile should be readable from spark) {
val path = ./testdata/

val conf = new SparkConf(false).setMaster(local).setAppName(test data 
exchange with Hive)
conf.set(spark.driver.host, localhost)
val sc = new SparkContext(conf)
val rdd = sc.makeRDD(Seq(rec))
rdd.map((x: MyRecord) = (NullWritable.get(), x.getWritable()))
  .saveAsSequenceFile(path)

val bytes = sc.sequenceFile(path, classOf[NullWritable], 
classOf[BytesWritable]).first._2
assert(rec.getWritable() == bytes)

sc.stop()
System.clearProperty(spark.driver.port)
  }
}


From: Andrew Lee alee...@hotmail.commailto:alee...@hotmail.com
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Date: Monday, July 21, 2014 at 10:27 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org, 
u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org 
u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org
Subject: RE: Hive From Spark

Hi All,

Currently, if you are running Spark HiveContext API with Hive 0.12, it won't 
work due to the following 2 libraries which are not consistent with Hive 0.12 
and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common 
practice, they should be consistent to work inter-operable).

These are under discussion in the 2 JIRA tickets:

https://issues.apache.org/jira/browse/HIVE-7387

https://issues.apache.org/jira/browse/SPARK-2420

When I ran the command by tweaking the classpath and build for Spark 1.0.1-rc3, 
I was able to create table through HiveContext, however, when I fetch the data, 
due to incompatible API calls in Guava, it breaks. This is critical since it 
needs to map the cllumns to the RDD schema.

Hive and Hadoop are using an older version of guava libraries (11.0.1) where 
Spark Hive is using guava 14.0.1+.
The community isn't willing to downgrade to 11.0.1 which is the current version 
for Hadoop 2.2 and Hive 0.12.
Be aware of protobuf version as well in Hive 0.12 (it uses protobuf 2.4).


scala

scala import org.apache.spark.SparkContext
import org.apache.spark.SparkContext

scala import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive._

scala

scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext: org.apache.spark.sql.hive.HiveContext = 
org.apache.spark.sql.hive.HiveContext@34bee01a

scala

scala hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))
res0: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:104
== Query Plan ==
Native command: executed by Hive

scala hiveContext.hql(LOAD DATA LOCAL INPATH 
'examples/src/main/resources/kv1.txt' INTO TABLE src)
res1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[3] at RDD at SchemaRDD.scala:104
== Query Plan ==
Native command: executed by Hive

scala

scala // Queries are expressed in HiveQL

scala hiveContext.hql(FROM src SELECT key, value).collect().foreach(println)
java.lang.NoSuchMethodError: 
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at 
org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102

Re: Hive From Spark

2014-08-21 Thread Marcelo Vanzin
Hi Du,

I don't believe the Guava change has made it to the 1.1 branch. The
Guava doc says hashInt was added in 12.0, so what's probably
happening is that you have and old version of Guava in your classpath
before the Spark jars. (Hadoop ships with Guava 11, so that may be the
source of your problem.)

On Thu, Aug 21, 2014 at 4:23 PM, Du Li l...@yahoo-inc.com.invalid wrote:
 Hi,

 This guava dependency conflict problem should have been fixed as of
 yesterday according to https://issues.apache.org/jira/browse/SPARK-2420

 However, I just got java.lang.NoSuchMethodError:
 com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
 by the following code snippet and “mvn3 test” on Mac. I built the latest
 version of spark (1.1.0-SNAPSHOT) and installed the jar files to the local
 maven repo. From my pom file I explicitly excluded guava from almost all
 possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and
 hadoop-client. This snippet is abstracted from a larger project. So the
 pom.xml includes many dependencies although not all are required by this
 snippet. The pom.xml is attached.

 Anybody knows what to fix it?

 Thanks,
 Du
 ---

 package com.myself.test

 import org.scalatest._
 import org.apache.hadoop.io.{NullWritable, BytesWritable}
 import org.apache.spark.{SparkContext, SparkConf}
 import org.apache.spark.SparkContext._

 class MyRecord(name: String) extends Serializable {
   def getWritable(): BytesWritable = {
 new
 BytesWritable(Option(name).getOrElse(\\N).toString.getBytes(UTF-8))
   }

   final override def equals(that: Any): Boolean = {
 if( !that.isInstanceOf[MyRecord] )
   false
 else {
   val other = that.asInstanceOf[MyRecord]
   this.getWritable == other.getWritable
 }
   }
 }

 class MyRecordTestSuite extends FunSuite {
   // construct an MyRecord by Consumer.schema
   val rec: MyRecord = new MyRecord(James Bond)

   test(generated SequenceFile should be readable from spark) {
 val path = ./testdata/

 val conf = new SparkConf(false).setMaster(local).setAppName(test data
 exchange with Hive)
 conf.set(spark.driver.host, localhost)
 val sc = new SparkContext(conf)
 val rdd = sc.makeRDD(Seq(rec))
 rdd.map((x: MyRecord) = (NullWritable.get(), x.getWritable()))
   .saveAsSequenceFile(path)

 val bytes = sc.sequenceFile(path, classOf[NullWritable],
 classOf[BytesWritable]).first._2
 assert(rec.getWritable() == bytes)

 sc.stop()
 System.clearProperty(spark.driver.port)
   }
 }


 From: Andrew Lee alee...@hotmail.com
 Reply-To: user@spark.apache.org user@spark.apache.org
 Date: Monday, July 21, 2014 at 10:27 AM
 To: user@spark.apache.org user@spark.apache.org,
 u...@spark.incubator.apache.org u...@spark.incubator.apache.org

 Subject: RE: Hive From Spark

 Hi All,

 Currently, if you are running Spark HiveContext API with Hive 0.12, it won't
 work due to the following 2 libraries which are not consistent with Hive
 0.12 and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common
 practice, they should be consistent to work inter-operable).

 These are under discussion in the 2 JIRA tickets:

 https://issues.apache.org/jira/browse/HIVE-7387

 https://issues.apache.org/jira/browse/SPARK-2420

 When I ran the command by tweaking the classpath and build for Spark
 1.0.1-rc3, I was able to create table through HiveContext, however, when I
 fetch the data, due to incompatible API calls in Guava, it breaks. This is
 critical since it needs to map the cllumns to the RDD schema.

 Hive and Hadoop are using an older version of guava libraries (11.0.1) where
 Spark Hive is using guava 14.0.1+.
 The community isn't willing to downgrade to 11.0.1 which is the current
 version for Hadoop 2.2 and Hive 0.12.
 Be aware of protobuf version as well in Hive 0.12 (it uses protobuf 2.4).

 scala

 scala import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext

 scala import org.apache.spark.sql.hive._
 import org.apache.spark.sql.hive._

 scala

 scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 hiveContext: org.apache.spark.sql.hive.HiveContext =
 org.apache.spark.sql.hive.HiveContext@34bee01a

 scala

 scala hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT, value
 STRING))
 res0: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[0] at RDD at SchemaRDD.scala:104
 == Query Plan ==
 Native command: executed by Hive

 scala hiveContext.hql(LOAD DATA LOCAL INPATH
 'examples/src/main/resources/kv1.txt' INTO TABLE src)
 res1: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[3] at RDD at SchemaRDD.scala:104
 == Query Plan ==
 Native command: executed by Hive

 scala

 scala // Queries are expressed in HiveQL

 scala hiveContext.hql(FROM src SELECT key,
 value).collect().foreach(println)
 java.lang.NoSuchMethodError:
 com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
 at
 org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util

RE: Hive From Spark

2014-07-22 Thread Andrew Lee
Hi Sean,
Thanks for clarifying. I re-read SPARK-2420 and now have a better understanding.
From a user perspective, what would you recommend to build Spark with Hive 
0.12 / 0.13+ libraries moving forward and deploy to production cluster that 
runs on a older version of Hadoop (e.g. 2.2 or 2.4) ?
My concern is that there's going to be a lag for technology adoption and since 
Spark is moving fast, the libraries may always be newer. Protobuf is one good 
example, shading. From a biz point of view, if there is no benefit to upgrade 
the library, the chances that this will happen with a higher priority is low 
due to stability concern and re-running the entire test suite. Just by 
observation, there's still a lot of ppl running Hadoop 2.2 instead of 2.4 or 
2.5 and the release and upgrade is depending on other big players such as 
Cloudera, Hortonwork, etc for their distro. Not to mention the process of 
upgrading.
Is there any benefit to use Guava 14 in Spark? I believe there is usually some 
competitive reason why Spark choose Guava 14, however, I'm not sure if anyone 
raise that in the conversation so I don't know if that is necessary.
Looking forward to seeing Hive on Spark to work soon. Please let me know if 
there's any help or feedback I can provide.
Thanks Sean.


 From: so...@cloudera.com
 Date: Mon, 21 Jul 2014 18:36:10 +0100
 Subject: Re: Hive From Spark
 To: user@spark.apache.org
 
 I haven't seen anyone actively 'unwilling' -- I hope not. See
 discussion at https://issues.apache.org/jira/browse/SPARK-2420 where I
 sketch what a downgrade means. I think it just hasn't gotten a looking
 over.
 
 Contrary to what I thought earlier, the conflict does in fact cause
 problems in theory, and you show it causes a problem in practice. Not
 to mention it causes issues for Hive-on-Spark now.
 
 On Mon, Jul 21, 2014 at 6:27 PM, Andrew Lee alee...@hotmail.com wrote:
  Hive and Hadoop are using an older version of guava libraries (11.0.1) where
  Spark Hive is using guava 14.0.1+.
  The community isn't willing to downgrade to 11.0.1 which is the current
  version for Hadoop 2.2 and Hive 0.12.
  

RE: Hive From Spark

2014-07-21 Thread Andrew Lee
$$iwC.init(console:19)
at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
at $iwC$$iwC$$iwC$$iwC.init(console:26)
at $iwC$$iwC$$iwC.init(console:28)
at $iwC$$iwC.init(console:30)
at $iwC.init(console:32)
at init(console:34)
at .init(console:38)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 From: hao.ch...@intel.com
 To: user@spark.apache.org; u...@spark.incubator.apache.org
 Subject: RE: Hive From Spark
 Date: Mon, 21 Jul 2014 01:14:19 +
 
 JiaJia, I've checkout the latest 1.0 branch, and then do the following steps:
 SPAKR_HIVE=true sbt/sbt clean assembly
 cd examples
 ../bin/run-example sql.hive.HiveFromSpark
 
 It works well in my local
 
 From your log output, it shows Invalid method name: 'get_table', seems an 
 incompatible jar version or something wrong between the Hive metastore 
 service and client, can you double check the jar versions of Hive metastore 
 service or thrift?
 
 
 -Original Message-
 From: JiajiaJing [mailto:jj.jing0...@gmail.com] 
 Sent: Saturday, July 19, 2014 7:29 AM
 To: u...@spark.incubator.apache.org
 Subject: RE: Hive From Spark
 
 Hi Cheng Hao,
 
 Thank you very much for your reply.
 
 Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 .
 
 Some setups of the environment are done by running SPARK_HIVE=true sbt/sbt 
 assembly/assembly, including the jar in all the workers, and copying the 
 hive-site.xml to spark's conf dir. 
 
 And then run the program as:   ./bin/run-example 
 org.apache.spark.examples.sql.hive.HiveFromSpark  
 
 It's good to know that this example runs well on your machine, could you 
 please give me some insight about your have done as well?
 
 Thank you very much!
 
 Jiajia
 
 
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-tp10110p10215.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
  

Re: Hive From Spark

2014-07-21 Thread Sean Owen
I haven't seen anyone actively 'unwilling' -- I hope not. See
discussion at https://issues.apache.org/jira/browse/SPARK-2420 where I
sketch what a downgrade means. I think it just hasn't gotten a looking
over.

Contrary to what I thought earlier, the conflict does in fact cause
problems in theory, and you show it causes a problem in practice. Not
to mention it causes issues for Hive-on-Spark now.

On Mon, Jul 21, 2014 at 6:27 PM, Andrew Lee alee...@hotmail.com wrote:
 Hive and Hadoop are using an older version of guava libraries (11.0.1) where
 Spark Hive is using guava 14.0.1+.
 The community isn't willing to downgrade to 11.0.1 which is the current
 version for Hadoop 2.2 and Hive 0.12.


RE: Hive From Spark

2014-07-20 Thread Cheng, Hao
JiaJia, I've checkout the latest 1.0 branch, and then do the following steps:
SPAKR_HIVE=true sbt/sbt clean assembly
cd examples
../bin/run-example sql.hive.HiveFromSpark

It works well in my local

From your log output, it shows Invalid method name: 'get_table', seems an 
incompatible jar version or something wrong between the Hive metastore service 
and client, can you double check the jar versions of Hive metastore service or 
thrift?


-Original Message-
From: JiajiaJing [mailto:jj.jing0...@gmail.com] 
Sent: Saturday, July 19, 2014 7:29 AM
To: u...@spark.incubator.apache.org
Subject: RE: Hive From Spark

Hi Cheng Hao,

Thank you very much for your reply.

Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 .

Some setups of the environment are done by running SPARK_HIVE=true sbt/sbt 
assembly/assembly, including the jar in all the workers, and copying the 
hive-site.xml to spark's conf dir. 

And then run the program as:   ./bin/run-example 
org.apache.spark.examples.sql.hive.HiveFromSpark  

It's good to know that this example runs well on your machine, could you please 
give me some insight about your have done as well?

Thank you very much!

Jiajia







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-tp10110p10215.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: Hive From Spark

2014-07-18 Thread JiajiaJing
Hi Cheng Hao,

Thank you very much for your reply.

Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 .

Some setups of the environment are done by running SPARK_HIVE=true sbt/sbt
assembly/assembly, including the jar in all the workers, and copying the
hive-site.xml to spark's conf dir. 

And then run the program as:   ./bin/run-example
org.apache.spark.examples.sql.hive.HiveFromSpark  

It's good to know that this example runs well on your machine, could you
please give me some insight about your have done as well?

Thank you very much!

Jiajia







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-tp10110p10215.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.