Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
Yes, my thought exactly. Kindly let me know if you need any help to port in
pyspark.

On Mon, Nov 6, 2017 at 8:54 AM, Nicolas Paris  wrote:

> Le 05 nov. 2017 à 22:46, ayan guha écrivait :
> > Thank you for the clarification. That was my understanding too. However
> how to
> > provide the upper bound as it changes for every call in real life. For
> example
> > it is not required for sqoop.
>
> True.  AFAIK sqoop begins with doing a
> "select min(column_split),max(column_split)
> from () as query;"
> and then splits the result.
>
> I was thinking doing the same with wrapper with spark jdbc that would
> infer the number partition, and the upper/lower bound itself.
>
>


-- 
Best Regards,
Ayan Guha


Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Le 05 nov. 2017 à 22:46, ayan guha écrivait :
> Thank you for the clarification. That was my understanding too. However how to
> provide the upper bound as it changes for every call in real life. For example
> it is not required for sqoop. 

True.  AFAIK sqoop begins with doing a  
"select min(column_split),max(column_split) 
from () as query;" 
and then splits the result.

I was thinking doing the same with wrapper with spark jdbc that would
infer the number partition, and the upper/lower bound itself. 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
Thank you for the clarification. That was my understanding too. However how
to provide the upper bound as it changes for every call in real life. For
example it is not required for sqoop.


On Mon, 6 Nov 2017 at 8:20 am, Nicolas Paris  wrote:

> Le 05 nov. 2017 à 22:02, ayan guha écrivait :
> > Can you confirm if JDBC DF Reader actually loads all data from source to
> driver
> > memory and then distributes to the executors?
>
> apparently yes when not using partition column
>
>
> > And this is true even when a
> > partition column is provided?
>
> No, in this case, each worker send a jdbc call accordingly to
> documentation
> https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
>
> --
Best Regards,
Ayan Guha


Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Le 05 nov. 2017 à 22:02, ayan guha écrivait :
> Can you confirm if JDBC DF Reader actually loads all data from source to 
> driver
> memory and then distributes to the executors?

apparently yes when not using partition column


> And this is true even when a
> partition column is provided?

No, in this case, each worker send a jdbc call accordingly to
documentation 
https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread ayan guha
Hi

Can you confirm if JDBC DF Reader actually loads all data from source to
driver memory and then distributes to the executors? And this is true even
when a partition column is provided?

Best
Ayan

On Mon, Nov 6, 2017 at 3:00 AM, David Hodeffi <
david.hode...@niceactimize.com> wrote:

> Testing Spark group e-mail
>
> Confidentiality: This communication and any attachments are intended for
> the above-named persons only and may be confidential and/or legally
> privileged. Any opinions expressed in this communication are not
> necessarily those of NICE Actimize. If this communication has come to you
> in error you must take no action based on it, nor must you copy or show it
> to anyone; please delete/destroy and inform the sender by e-mail
> immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail and
> attachments are free from any virus, we advise that in keeping with good
> computing practice the recipient should ensure they are actually virus free.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha


Re: spark-avro aliases incompatible

2017-11-05 Thread Gourav Sengupta
Hi Gaspar,

can you please provide the details regarding the environment, versions,
libraries and code snippets please?

For example: SPARK version, OS, distribution, running on YARN, etc and all
other details.


Regards,
Gourav Sengupta

On Sun, Nov 5, 2017 at 9:03 AM, Gaspar Muñoz  wrote:

> Hi there,
>
> I use avro format to store historical due to avro schema evolution. I
> manage external schemas and read  them using avroSchema option so we have
> been able to add and delete columns.
>
> The problem is when I introduced aliases and Spark process didn't work as
> expected and then I read in spark-avro library "At the moment, it ignores
> docs, aliases and other properties present in the Avro file".
>
> How do you manage aliases and column renaming? Is there any workaround?
>
> Thanks in advance.
>
> Regards
>
> --
> Gaspar Muñoz Soria
>
> Vía de las dos Castillas, 33
> ,
> Ática 4, 3ª Planta
> 28224 Pozuelo de Alarcón, Madrid
> Tel: +34 91 828 6473
>


RE: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread David Hodeffi
Testing Spark group e-mail

Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Le 05 nov. 2017 à 14:11, Gourav Sengupta écrivait :
> thanks a ton for your kind response. Have you used SPARK Session ? I think 
> that
> hiveContext is a very old way of solving things in SPARK, and since then new
> algorithms have been introduced in SPARK. 

I will give a try out sparkSession. 

> It will be a lot of help, given how kind you have been by sharing your
> experience, if you could kindly share your code as well and provide details
> like SPARK , HADOOP, HIVE, and other environment version and details.

I am testing a HDP 2.6 distrib and also:
SPARK: 2.1.1
HADOOP: 2.7.3
HIVE: 1.2.1000
PRESTO: 1.87

> After all, no one wants to use SPARK 1.x version to solve problems anymore,
> though I have seen couple of companies who are stuck with these versions as
> they are using in house deployments which they cannot upgrade because of
> incompatibility issues.

Didn't know hiveContext was legacy spark way. I will give a try to
sparkSession and conclude. After all, I would prefer to provide our
users, a unique and uniform framework such spark, instead of multiple
complicated layers such spark + whatever jdbc access

> 
> 
> Regards,
> Gourav Sengupta
> 
> 
> On Sun, Nov 5, 2017 at 12:57 PM, Nicolas Paris  wrote:
> 
> Hi
> 
> After some testing, I have been quite disapointed with hiveContext way of
> accessing hive tables.
> 
> The main problem is resource allocation: I have tons of users and they
> get a limited subset of workers. Then this does not allow to query huge
> datasetsn because to few memory allocated (or maybe I am missing
> something).
> 
> If using Hive jdbc, Hive resources are shared by all my users and then
> queries are able to finish.
> 
> Then I have been testing other jdbc based approach and for now, "presto"
> looks like the most appropriate solution to access hive tables.
> 
> In order to load huge datasets into spark, the proposed approach is to
> use presto distributed CTAS to build an ORC dataset, and access to that
> dataset from spark dataframe loader ability (instead of direct jdbc
> access tha would break the driver memory).
> 
> 
> 
> Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
> > Hi Nicolas,
> >
> > without the hive thrift server, if you try to run a select * on a table
> which
> > has around 10,000 partitions, SPARK will give you some surprises. PRESTO
> works
> > fine in these scenarios, and I am sure SPARK community will soon learn
> from
> > their algorithms.
> >
> >
> > Regards,
> > Gourav
> >
> > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris 
> wrote:
> >
> >     > I do not think that SPARK will automatically determine the
> partitions.
> >     Actually
> >     > it does not automatically determine the partitions. In case a 
> table
> has a
> >     few
> >     > million records, it all goes through the driver.
> >
> >     Hi Gourav
> >
> >     Actualy spark jdbc driver is able to deal direclty with partitions.
> >     Sparks creates a jdbc connection for each partition.
> >
> >     All details explained in this post :
> >     http://www.gatorsmile.io/numpartitionsinjdbc/
> >
> >     Also an example with greenplum database:
> >     http://engineering.pivotal.io/post/getting-started-with-
> greenplum-spark/
> >
> >
> 
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Gourav Sengupta
Hi Nicolas,


thanks a ton for your kind response. Have you used SPARK Session ? I think
that hiveContext is a very old way of solving things in SPARK, and since
then new algorithms have been introduced in SPARK.

It will be a lot of help, given how kind you have been by sharing your
experience, if you could kindly share your code as well and provide details
like SPARK , HADOOP, HIVE, and other environment version and details.

After all, no one wants to use SPARK 1.x version to solve problems anymore,
though I have seen couple of companies who are stuck with these versions as
they are using in house deployments which they cannot upgrade because of
incompatibility issues.


Regards,
Gourav Sengupta


On Sun, Nov 5, 2017 at 12:57 PM, Nicolas Paris  wrote:

> Hi
>
> After some testing, I have been quite disapointed with hiveContext way of
> accessing hive tables.
>
> The main problem is resource allocation: I have tons of users and they
> get a limited subset of workers. Then this does not allow to query huge
> datasetsn because to few memory allocated (or maybe I am missing
> something).
>
> If using Hive jdbc, Hive resources are shared by all my users and then
> queries are able to finish.
>
> Then I have been testing other jdbc based approach and for now, "presto"
> looks like the most appropriate solution to access hive tables.
>
> In order to load huge datasets into spark, the proposed approach is to
> use presto distributed CTAS to build an ORC dataset, and access to that
> dataset from spark dataframe loader ability (instead of direct jdbc
> access tha would break the driver memory).
>
>
>
> Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
> > Hi Nicolas,
> >
> > without the hive thrift server, if you try to run a select * on a table
> which
> > has around 10,000 partitions, SPARK will give you some surprises. PRESTO
> works
> > fine in these scenarios, and I am sure SPARK community will soon learn
> from
> > their algorithms.
> >
> >
> > Regards,
> > Gourav
> >
> > On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris 
> wrote:
> >
> > > I do not think that SPARK will automatically determine the
> partitions.
> > Actually
> > > it does not automatically determine the partitions. In case a
> table has a
> > few
> > > million records, it all goes through the driver.
> >
> > Hi Gourav
> >
> > Actualy spark jdbc driver is able to deal direclty with partitions.
> > Sparks creates a jdbc connection for each partition.
> >
> > All details explained in this post :
> > http://www.gatorsmile.io/numpartitionsinjdbc/
> >
> > Also an example with greenplum database:
> > http://engineering.pivotal.io/post/getting-started-with-
> greenplum-spark/
> >
> >
>


Re: Hive From Spark: Jdbc VS sparkContext

2017-11-05 Thread Nicolas Paris
Hi

After some testing, I have been quite disapointed with hiveContext way of
accessing hive tables.

The main problem is resource allocation: I have tons of users and they
get a limited subset of workers. Then this does not allow to query huge
datasetsn because to few memory allocated (or maybe I am missing
something).

If using Hive jdbc, Hive resources are shared by all my users and then
queries are able to finish.

Then I have been testing other jdbc based approach and for now, "presto"
looks like the most appropriate solution to access hive tables.

In order to load huge datasets into spark, the proposed approach is to
use presto distributed CTAS to build an ORC dataset, and access to that
dataset from spark dataframe loader ability (instead of direct jdbc
access tha would break the driver memory).



Le 15 oct. 2017 à 19:24, Gourav Sengupta écrivait :
> Hi Nicolas,
> 
> without the hive thrift server, if you try to run a select * on a table which
> has around 10,000 partitions, SPARK will give you some surprises. PRESTO works
> fine in these scenarios, and I am sure SPARK community will soon learn from
> their algorithms.
> 
> 
> Regards,
> Gourav
> 
> On Sun, Oct 15, 2017 at 3:43 PM, Nicolas Paris  wrote:
> 
> > I do not think that SPARK will automatically determine the partitions.
> Actually
> > it does not automatically determine the partitions. In case a table has 
> a
> few
> > million records, it all goes through the driver.
> 
> Hi Gourav
> 
> Actualy spark jdbc driver is able to deal direclty with partitions.
> Sparks creates a jdbc connection for each partition.
> 
> All details explained in this post :
> http://www.gatorsmile.io/numpartitionsinjdbc/
> 
> Also an example with greenplum database:
> http://engineering.pivotal.io/post/getting-started-with-greenplum-spark/
> 
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark-avro aliases incompatible

2017-11-05 Thread Gaspar Muñoz
Hi there,

I use avro format to store historical due to avro schema evolution. I
manage external schemas and read  them using avroSchema option so we have
been able to add and delete columns.

The problem is when I introduced aliases and Spark process didn't work as
expected and then I read in spark-avro library "At the moment, it ignores
docs, aliases and other properties present in the Avro file".

How do you manage aliases and column renaming? Is there any workaround?

Thanks in advance.

Regards

-- 
Gaspar Muñoz Soria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473