[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency

2018-03-15 Thread Jo Desmet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401300#comment-16401300
 ] 

Jo Desmet commented on SPARK-8008:
--

Too bad that this issue is not considered high priority. Too many times I come 
to the problem that I need to process billions of records. So the only way to 
handle this is to create a huge amount of partitions, and then throttle usingĀ 
spark.executor.cores. However this setting effectively throttles my entire RDD, 
not just the portion that loads from database. It would be hugely beneficial 
that I can not only restrict the number of partitions at any time, but also the 
task concurrency at any point in my RDD.

> JDBC data source can overload the external database system due to high 
> concurrency
> --
>
> Key: SPARK-8008
> URL: https://issues.apache.org/jira/browse/SPARK-8008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rene Treffer
>Priority: Major
>
> Spark tries to load as many partitions as possible in parallel, which can in 
> turn overload the database although it would be possible to load all 
> partitions given a lower concurrency.
> It would be nice to either limit the maximum concurrency or to at least warn 
> about this behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency

2015-07-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633004#comment-14633004
 ] 

Reynold Xin commented on SPARK-8008:


[~rtreffer] I'm going to close this one for now because I don't know if there 
is any feasible things we can do in the short-term. If we come up with a 
solution, we can reopen the ticket.


 JDBC data source can overload the external database system due to high 
 concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency

2015-06-13 Thread Rene Treffer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14584777#comment-14584777
 ] 

Rene Treffer commented on SPARK-8008:
-

Well, turns out I was wrong, union does not work as expected. I'm also running 
into very strange problems on load.

I'm now doing a load on a per-day basis, storing it to parquet files which in 
turn requires SPARK-4176 and the SPARK-7897 improvement that [~rxin] just 
merged.
I'm simply using a folder structure that should work for loading (extra field 
serialized in the path).

 JDBC data source can overload the external database system due to high 
 concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency

2015-06-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570450#comment-14570450
 ] 

Reynold Xin commented on SPARK-8008:


[~rtreffer] can you submit a patch to the jdbc data source write to throttle 
there using your union hack?

 JDBC data source can overload the external database system due to high 
 concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency

2015-06-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568047#comment-14568047
 ] 

Michael Armbrust commented on SPARK-8008:
-

Out of curiosity, if you are not caching and going directly into a shuffle, is 
this actually bad for memory consumption?  Do we not stream into the shuffle 
files?

 JDBC data source can overload the external database system due to high 
 concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency

2015-06-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568076#comment-14568076
 ] 

Reynold Xin commented on SPARK-8008:


That's still bad from a ft perspective though.

 JDBC data source can overload the external database system due to high 
 concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency

2015-06-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568048#comment-14568048
 ] 

Reynold Xin commented on SPARK-8008:


There could still be a sorting or aggregation happening right?

 JDBC data source can overload the external database system due to high 
 concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency

2015-06-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568054#comment-14568054
 ] 

Michael Armbrust commented on SPARK-8008:
-

Well yes, but if you read my suggestion it was to extract with few partitions 
and then {{repartition}} before doing further work.

 JDBC data source can overload the external database system due to high 
 concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency

2015-06-01 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568125#comment-14568125
 ] 

Michael Armbrust commented on SPARK-8008:
-

After talking more off-line with [~rxin] it does seem like unioning of many 
smaller partitions (effectively serializing a bunch of smaller loads and then 
combining them) is also pretty reasonable approach.  Would be awesome if you 
could report back on your partition count experiments and add this and a 
description of your other workaround to this ticket!  Then we can try and build 
one of them into the datasource natively.

 JDBC data source can overload the external database system due to high 
 concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency

2015-06-01 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568052#comment-14568052
 ] 

Reynold Xin commented on SPARK-8008:


Basically if you are really loading terabytes of data like Rene was doing here, 
it's not a good idea to do it on 5 partitions.

 JDBC data source can overload the external database system due to high 
 concurrency
 --

 Key: SPARK-8008
 URL: https://issues.apache.org/jira/browse/SPARK-8008
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Rene Treffer

 Spark tries to load as many partitions as possible in parallel, which can in 
 turn overload the database although it would be possible to load all 
 partitions given a lower concurrency.
 It would be nice to either limit the maximum concurrency or to at least warn 
 about this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org