[ 
https://issues.apache.org/jira/browse/SPARK-22299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061359#comment-17061359
 ] 

Pushpinder Heer commented on SPARK-22299:
-----------------------------------------

This shows resolved, but no version indicated.  Is this still active?  Thanks!

> Use OFFSET and LIMIT for JDBC DataFrameReader striping
> ------------------------------------------------------
>
>                 Key: SPARK-22299
>                 URL: https://issues.apache.org/jira/browse/SPARK-22299
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
>            Reporter: Zack Behringer
>            Priority: Minor
>              Labels: bulk-closed
>
> Loading a large table (300M rows) from JDBC can be partitioned into tasks 
> using the column, numPartitions, lowerBound and upperBound parameters on 
> DataFrameReader.jdbc(), but that becomes troublesome if the column is 
> skewed/fragmented (as in somebody used a global sequence for the partition 
> column instead of a sequence specific to the table, or if the table becomes 
> fragmented by deletes, etc.).
> This can be worked around by using a modulus operation on the column, but 
> that will be slow unless there is a already an index using the modulus 
> expression with the exact numPartitions value, so that doesn't scale well if 
> you want to change the number partitions. Another way would be to use an 
> expression index on a hash of the partition column, but I'm not sure if JDBC 
> striping is smart enough to create hash ranges for each stripe using hashes 
> of the lower and upper bound parameters. If it is, that is great, but still 
> that requires a very large index just for this use case.
> A less invasive approach would be to use the table's physical ordering along 
> with OFFSET and LIMIT so that only the total number of records to read would 
> need to be known beforehand in order to evenly distribute, no indexes needed. 
> I realize that OFFSET and LIMIT are not standard SQL keywords.
> I also see that a list of custom predicates can be defined. I haven't tried 
> that to see if I can embed numPartitions specific predicates each with their 
> own OFFSET and LIMIT range.
> Some relational databases take quite a long time to count the number of 
> records in order to determine the stripe size, though, so this can also 
> troublesome. Could a feature similar to "spark.sql.files.maxRecordsPerFile" 
> be used in conjunction with the number of executors to read manageable 
> batches (internally using OFFSET and LIMIT) until there are no more available 
> results?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to