Re: JDBC Very Slow

2016-09-16 Thread Takeshi Yamamuro
Hi,

It'd be better to set `predicates` in jdbc arguments for loading in
parallel.
See:
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L200

// maropu

On Sat, Sep 17, 2016 at 7:46 AM, Benjamin Kim  wrote:

> I am testing this in spark-shell. I am following the Spark documentation
> by simply adding the PostgreSQL driver to the Spark Classpath.
>
> SPARK_CLASSPATH=/path/to/postgresql/driver spark-shell
>
>
> Then, I run the code below to connect to the PostgreSQL database to query.
> This is when I have problems.
>
> Thanks,
> Ben
>
>
> On Sep 16, 2016, at 3:29 PM, Nikolay Zhebet  wrote:
>
> Hi! Can you split init code with current comand? I thing it is main
> problem in your code.
> 16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim" 
> написал:
>
>> Has anyone using Spark 1.6.2 encountered very slow responses from pulling
>> data from PostgreSQL using JDBC? I can get to the table and see the schema,
>> but when I do a show, it takes very long or keeps timing out.
>>
>> The code is simple.
>>
>> val jdbcDF = sqlContext.read.format("jdbc").options(
>> Map("url" -> "jdbc:postgresql://dbserver:po
>> rt/database?user=user&password=password",
>>"dbtable" -> “schema.table")).load()
>>
>> jdbcDF.show
>>
>>
>> If anyone can help, please let me know.
>>
>> Thanks,
>> Ben
>>
>>
>


-- 
---
Takeshi Yamamuro


Re: JDBC Very Slow

2016-09-16 Thread Benjamin Kim
I am testing this in spark-shell. I am following the Spark documentation by 
simply adding the PostgreSQL driver to the Spark Classpath.

SPARK_CLASSPATH=/path/to/postgresql/driver spark-shell

Then, I run the code below to connect to the PostgreSQL database to query. This 
is when I have problems.

Thanks,
Ben


> On Sep 16, 2016, at 3:29 PM, Nikolay Zhebet  wrote:
> 
> Hi! Can you split init code with current comand? I thing it is main problem 
> in your code.
> 
> 16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim"  > написал:
> Has anyone using Spark 1.6.2 encountered very slow responses from pulling 
> data from PostgreSQL using JDBC? I can get to the table and see the schema, 
> but when I do a show, it takes very long or keeps timing out.
> 
> The code is simple.
> 
> val jdbcDF = sqlContext.read.format("jdbc").options(
> Map("url" -> 
> "jdbc:postgresql://dbserver:port/database?user=user&password=password",
>"dbtable" -> “schema.table")).load()
> 
> jdbcDF.show
> 
> If anyone can help, please let me know.
> 
> Thanks,
> Ben
> 



Re: JDBC Very Slow

2016-09-16 Thread Nikolay Zhebet
Hi! Can you split init code with current comand? I thing it is main problem
in your code.
16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim" 
написал:

> Has anyone using Spark 1.6.2 encountered very slow responses from pulling
> data from PostgreSQL using JDBC? I can get to the table and see the schema,
> but when I do a show, it takes very long or keeps timing out.
>
> The code is simple.
>
> val jdbcDF = sqlContext.read.format("jdbc").options(
> Map("url" -> "jdbc:postgresql://dbserver:port/database?user=user&
> password=password",
>"dbtable" -> “schema.table")).load()
>
> jdbcDF.show
>
>
> If anyone can help, please let me know.
>
> Thanks,
> Ben
>
>


JDBC Very Slow

2016-09-16 Thread Benjamin Kim
Has anyone using Spark 1.6.2 encountered very slow responses from pulling data 
from PostgreSQL using JDBC? I can get to the table and see the schema, but when 
I do a show, it takes very long or keeps timing out.

The code is simple.

val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> 
"jdbc:postgresql://dbserver:port/database?user=user&password=password",
   "dbtable" -> “schema.table")).load()

jdbcDF.show

If anyone can help, please let me know.

Thanks,
Ben



Re: DataFrame/JDBC very slow performance

2015-08-26 Thread Dhaval Patel
Thanks Michael, much appreciated!

Nothing should be held in memory for a query like this (other than a single
count per partition), so I don't think that is the problem.  There is
likely an error buried somewhere.

For your above comments - I don't get any error but just get the NULL as
return value. I have tried digging deeper in the logs etc but couldn't spot
anything. Is there any other suggestions to spot such buried errors?

Thanks,
Dhaval

On Mon, Aug 24, 2015 at 6:38 PM, Michael Armbrust 
wrote:

> Much appreciated! I am not comparing with "select count(*)" for
>> performance, but it was one simple thing I tried to check the performance
>> :). I think it now makes sense since Spark tries to extract all records
>> before doing the count. I thought having an aggregated function query
>> submitted over JDBC/Teradata would let Teradata do the heavy lifting.
>>
>
> We currently only push down filters since there is a lot of variability in
> what types of aggregations various databases support.  You can manually
> pushdown whatever you want by replacing the table name with a subquery
> (i.e. "(SELECT ... FROM ...)")
>
>- How come my second query for (5B) records didn't return anything
>> even after a long processing? If I understood correctly, Spark would try to
>> fit it in memory and if not then might use disk space, which I have
>> available?
>>
>
> Nothing should be held in memory for a query like this (other than a
> single count per partition), so I don't think that is the problem.  There
> is likely an error buried somewhere.
>
>
>>  - Am I supposed to do any Spark related tuning to make it work?
>>
>> My main need is to access data from these large table(s) on demand and
>> provide aggregated and calculated results much quicker, for that  I was
>> trying out Spark. Next step I am thinking to export data in Parque files
>> and give it a try. Do you have any suggestions for to deal with the problem?
>>
>
> Exporting to parquet will likely be a faster option that trying to query
> through JDBC, since we have many more opportunities for parallelism here.
>


Re: DataFrame/JDBC very slow performance

2015-08-24 Thread Michael Armbrust
>
> Much appreciated! I am not comparing with "select count(*)" for
> performance, but it was one simple thing I tried to check the performance
> :). I think it now makes sense since Spark tries to extract all records
> before doing the count. I thought having an aggregated function query
> submitted over JDBC/Teradata would let Teradata do the heavy lifting.
>

We currently only push down filters since there is a lot of variability in
what types of aggregations various databases support.  You can manually
pushdown whatever you want by replacing the table name with a subquery
(i.e. "(SELECT ... FROM ...)")

   - How come my second query for (5B) records didn't return anything
> even after a long processing? If I understood correctly, Spark would try to
> fit it in memory and if not then might use disk space, which I have
> available?
>

Nothing should be held in memory for a query like this (other than a single
count per partition), so I don't think that is the problem.  There is
likely an error buried somewhere.


>  - Am I supposed to do any Spark related tuning to make it work?
>
> My main need is to access data from these large table(s) on demand and
> provide aggregated and calculated results much quicker, for that  I was
> trying out Spark. Next step I am thinking to export data in Parque files
> and give it a try. Do you have any suggestions for to deal with the problem?
>

Exporting to parquet will likely be a faster option that trying to query
through JDBC, since we have many more opportunities for parallelism here.


DataFrame/JDBC very slow performance

2015-08-24 Thread Dhaval Patel
I am trying to access a mid-size Teradata table (~100 million rows) via
JDBC in standalone mode on a single node (local[*]). When I tried with BIG
table (5B records) then no results returned upon completion of query.

I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24
cores, 126G RAM).

I have tried several memory setup and tuning options to make it work
faster, but neither of them made a huge impact.

I am sure there is something I am missing and below is my final try that
took about 11 minutes to get this simple counts vs it only took 40 seconds
using a JDBC connection through R to get the counts.


bin/pyspark --driver-memory 40g --executor-memory 40g

df = sqlContext.read.jdbc("jdbc:teradata://..)
df.count()


[image: Inline image 1]