Re: Hive From Spark: Jdbc VS sparkContext

2017-11-22 Thread Nicolas Paris
Hey Finally I improved a lot the spark-hive sql performances. I had some problem with some topology_script.py that made huge log error trace and reduced spark performances in python mode. I just corrected the python2 scripts to be python3 ready. I had some problem with broadcast variable while

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
Yes, my thought exactly. Kindly let me know if you need any help to port in pyspark. On Mon, Nov 6, 2017 at 8:54 AM, Nicolas Paris wrote: > Le 05 nov. 2017 à 22:46, ayan guha écrivait : > > Thank you for the clarification. That was my understanding too. However > how to > >

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Le 05 nov. 2017 à 22:46, ayan guha écrivait : > Thank you for the clarification. That was my understanding too. However how to > provide the upper bound as it changes for every call in real life. For example > it is not required for sqoop.  True. AFAIK sqoop begins with doing a "select

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
Thank you for the clarification. That was my understanding too. However how to provide the upper bound as it changes for every call in real life. For example it is not required for sqoop. On Mon, 6 Nov 2017 at 8:20 am, Nicolas Paris wrote: > Le 05 nov. 2017 à 22:02, ayan

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Le 05 nov. 2017 à 22:02, ayan guha écrivait : > Can you confirm if JDBC DF Reader actually loads all data from source to > driver > memory and then distributes to the executors? apparently yes when not using partition column > And this is true even when a > partition column is provided? No,

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
Hi Can you confirm if JDBC DF Reader actually loads all data from source to driver memory and then distributes to the executors? And this is true even when a partition column is provided? Best Ayan On Mon, Nov 6, 2017 at 3:00 AM, David Hodeffi < david.hode...@niceactimize.com> wrote: > Testing

RE: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread David Hodeffi
Testing Spark group e-mail Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Le 05 nov. 2017 à 14:11, Gourav Sengupta écrivait : > thanks a ton for your kind response. Have you used SPARK Session ? I think > that > hiveContext is a very old way of solving things in SPARK, and since then new > algorithms have been introduced in SPARK.  I will give a try out sparkSession.

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Gourav Sengupta
Hi Nicolas, thanks a ton for your kind response. Have you used SPARK Session ? I think that hiveContext is a very old way of solving things in SPARK, and since then new algorithms have been introduced in SPARK. It will be a lot of help, given how kind you have been by sharing your experience,

Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Hi After some testing, I have been quite disapointed with hiveContext way of accessing hive tables. The main problem is resource allocation: I have tons of users and they get a limited subset of workers. Then this does not allow to query huge datasetsn because to few memory allocated (or maybe I

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Gourav Sengupta
Hi Nicolas, without the hive thrift server, if you try to run a select * on a table which has around 10,000 partitions, SPARK will give you some surprises. PRESTO works fine in these scenarios, and I am sure SPARK community will soon learn from their algorithms. Regards, Gourav On Sun, Oct 15,

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
> I do not think that SPARK will automatically determine the partitions. > Actually > it does not automatically determine the partitions. In case a table has a few > million records, it all goes through the driver. Hi Gourav Actualy spark jdbc driver is able to deal direclty with partitions.

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
Hi Gourav > what if the table has partitions and sub-partitions? well this also work with multiple orc files having same schema: val people = sqlContext.read.format("orc").load("hdfs://cluster/people*") Am I missing something? > And you do not want to access the entire data? This works for

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Gourav Sengupta
Hi Nicolas, what if the table has partitions and sub-partitions? And you do not want to access the entire data? Regards, Gourav On Sun, Oct 15, 2017 at 12:55 PM, Nicolas Paris wrote: > Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait : > > I wonder the differences

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-15 Thread Nicolas Paris
Le 03 oct. 2017 à 20:08, Nicolas Paris écrivait : > I wonder the differences accessing HIVE tables in two different ways: > - with jdbc access > - with sparkContext Well there is also a third way to access the hive data from spark: - with direct file access (here ORC format) For example: val

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-13 Thread Kabeer Ahmed
My take on this might sound a bit different. Here are few points to consider below: 1. Going through Hive JDBC means that the application is restricted by the # of queries that can be compiled. HS2 can only compile one SQL at a time and if users have bad SQL, it can take a long time just to

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-13 Thread Nicolas Paris
> In case a table has a few > million records, it all goes through the driver. This sounds clear in JDBC mode, the driver get all the rows and then it spreads the RDD over the executors. I d'say that most use cases deal with SQL to aggregate huge datasets, and retrieve small amount of rows to be

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread Gourav Sengupta
Hi, I do not think that SPARK will automatically determine the partitions. Actually it does not automatically determine the partitions. In case a table has a few million records, it all goes through the driver. Ofcourse, I have only tried JDBC connections in AURORA, Oracle and Postgres.

RE: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread Walia, Reema
To: user@spark.apache.org Subject: Re: Hive From Spark: Jdbc VS sparkContext [ External Email ] Is Hive from Spark via JDBC working for you? In case it does, I would be interested in your setup :-) We can't get this working. See bug here, especially my last comment: https://issues.apache.org

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread weand
Is Hive from Spark via JDBC working for you? In case it does, I would be interested in your setup :-) We can't get this working. See bug here, especially my last comment: https://issues.apache.org/jira/browse/SPARK-21063 Regards Andreas -- Sent from:

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread ayan guha
That is not correct, IMHO. If I am not wrong, Spark will still load data in executor, by running some stats on the data itself to identify partitions On Tue, Oct 10, 2017 at 9:23 PM, 郭鹏飞 wrote: > > > 在 2017年10月4日,上午2:08,Nicolas Paris 写道: > > >

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-10 Thread 郭鹏飞
> 在 2017年10月4日,上午2:08,Nicolas Paris 写道: > > Hi > > I wonder the differences accessing HIVE tables in two different ways: > - with jdbc access > - with sparkContext > > I would say that jdbc is better since it uses HIVE that is based on > map-reduce / TEZ and then works on

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-04 Thread ayan guha
Well the obvious point is security. Ranger and Sentry can secure jdbc endpoints only. For performance aspect, I am equally curious 邏 On Wed, 4 Oct 2017 at 10:30 pm, Gourav Sengupta wrote: > Hi, > > I am genuinely curious to see whether any one responds to this

Re: Hive From Spark: Jdbc VS sparkContext

2017-10-04 Thread Gourav Sengupta
Hi, I am genuinely curious to see whether any one responds to this question. Its very hard to shake off JAVA, OOPs and JDBC's :) Regards, Gourav Sengupta On Tue, Oct 3, 2017 at 7:08 PM, Nicolas Paris wrote: > Hi > > I wonder the differences accessing HIVE tables in two

Hive From Spark: Jdbc VS sparkContext

2017-10-03 Thread Nicolas Paris
Hi I wonder the differences accessing HIVE tables in two different ways: - with jdbc access - with sparkContext I would say that jdbc is better since it uses HIVE that is based on map-reduce / TEZ and then works on disk. Using spark rdd can lead to memory errors on very huge datasets. Anybody