[ https://issues.apache.org/jira/browse/SPARK-22299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17061359#comment-17061359 ]
Pushpinder Heer commented on SPARK-22299: ----------------------------------------- This shows resolved, but no version indicated. Is this still active? Thanks! > Use OFFSET and LIMIT for JDBC DataFrameReader striping > ------------------------------------------------------ > > Key: SPARK-22299 > URL: https://issues.apache.org/jira/browse/SPARK-22299 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0 > Reporter: Zack Behringer > Priority: Minor > Labels: bulk-closed > > Loading a large table (300M rows) from JDBC can be partitioned into tasks > using the column, numPartitions, lowerBound and upperBound parameters on > DataFrameReader.jdbc(), but that becomes troublesome if the column is > skewed/fragmented (as in somebody used a global sequence for the partition > column instead of a sequence specific to the table, or if the table becomes > fragmented by deletes, etc.). > This can be worked around by using a modulus operation on the column, but > that will be slow unless there is a already an index using the modulus > expression with the exact numPartitions value, so that doesn't scale well if > you want to change the number partitions. Another way would be to use an > expression index on a hash of the partition column, but I'm not sure if JDBC > striping is smart enough to create hash ranges for each stripe using hashes > of the lower and upper bound parameters. If it is, that is great, but still > that requires a very large index just for this use case. > A less invasive approach would be to use the table's physical ordering along > with OFFSET and LIMIT so that only the total number of records to read would > need to be known beforehand in order to evenly distribute, no indexes needed. > I realize that OFFSET and LIMIT are not standard SQL keywords. > I also see that a list of custom predicates can be defined. I haven't tried > that to see if I can embed numPartitions specific predicates each with their > own OFFSET and LIMIT range. > Some relational databases take quite a long time to count the number of > records in order to determine the stripe size, though, so this can also > troublesome. Could a feature similar to "spark.sql.files.maxRecordsPerFile" > be used in conjunction with the number of executors to read manageable > batches (internally using OFFSET and LIMIT) until there are no more available > results? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org