[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401300#comment-16401300 ] Jo Desmet commented on SPARK-8008: -- Too bad that this issue is not considered high priority. Too many times I come to the problem that I need to process billions of records. So the only way to handle this is to create a huge amount of partitions, and then throttle usingĀ spark.executor.cores. However this setting effectively throttles my entire RDD, not just the portion that loads from database. It would be hugely beneficial that I can not only restrict the number of partitions at any time, but also the task concurrency at any point in my RDD. > JDBC data source can overload the external database system due to high > concurrency > -- > > Key: SPARK-8008 > URL: https://issues.apache.org/jira/browse/SPARK-8008 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Rene Treffer >Priority: Major > > Spark tries to load as many partitions as possible in parallel, which can in > turn overload the database although it would be possible to load all > partitions given a lower concurrency. > It would be nice to either limit the maximum concurrency or to at least warn > about this behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633004#comment-14633004 ] Reynold Xin commented on SPARK-8008: [~rtreffer] I'm going to close this one for now because I don't know if there is any feasible things we can do in the short-term. If we come up with a solution, we can reopen the ticket. JDBC data source can overload the external database system due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Components: SQL Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14584777#comment-14584777 ] Rene Treffer commented on SPARK-8008: - Well, turns out I was wrong, union does not work as expected. I'm also running into very strange problems on load. I'm now doing a load on a per-day basis, storing it to parquet files which in turn requires SPARK-4176 and the SPARK-7897 improvement that [~rxin] just merged. I'm simply using a folder structure that should work for loading (extra field serialized in the path). JDBC data source can overload the external database system due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Components: SQL Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570450#comment-14570450 ] Reynold Xin commented on SPARK-8008: [~rtreffer] can you submit a patch to the jdbc data source write to throttle there using your union hack? JDBC data source can overload the external database system due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Components: SQL Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568047#comment-14568047 ] Michael Armbrust commented on SPARK-8008: - Out of curiosity, if you are not caching and going directly into a shuffle, is this actually bad for memory consumption? Do we not stream into the shuffle files? JDBC data source can overload the external database system due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Components: SQL Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568076#comment-14568076 ] Reynold Xin commented on SPARK-8008: That's still bad from a ft perspective though. JDBC data source can overload the external database system due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Components: SQL Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568048#comment-14568048 ] Reynold Xin commented on SPARK-8008: There could still be a sorting or aggregation happening right? JDBC data source can overload the external database system due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Components: SQL Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568054#comment-14568054 ] Michael Armbrust commented on SPARK-8008: - Well yes, but if you read my suggestion it was to extract with few partitions and then {{repartition}} before doing further work. JDBC data source can overload the external database system due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Components: SQL Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568125#comment-14568125 ] Michael Armbrust commented on SPARK-8008: - After talking more off-line with [~rxin] it does seem like unioning of many smaller partitions (effectively serializing a bunch of smaller loads and then combining them) is also pretty reasonable approach. Would be awesome if you could report back on your partition count experiments and add this and a description of your other workaround to this ticket! Then we can try and build one of them into the datasource natively. JDBC data source can overload the external database system due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Components: SQL Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568052#comment-14568052 ] Reynold Xin commented on SPARK-8008: Basically if you are really loading terabytes of data like Rene was doing here, it's not a good idea to do it on 5 partitions. JDBC data source can overload the external database system due to high concurrency -- Key: SPARK-8008 URL: https://issues.apache.org/jira/browse/SPARK-8008 Project: Spark Issue Type: Bug Components: SQL Reporter: Rene Treffer Spark tries to load as many partitions as possible in parallel, which can in turn overload the database although it would be possible to load all partitions given a lower concurrency. It would be nice to either limit the maximum concurrency or to at least warn about this behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org