How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

Louis Hust Wed, 11 Apr 2018 21:12:53 -0700

We  want to extract data from mysql, and calculate in sparksql.
The sql explain like below.



REGIONKEY#177,N_COMMENT#178] PushedFilters: [], ReadSchema:
struct<N_NATIONKEY:int,N_NAME:string,N_REGIONKEY:int,N_COMMENT:string>
                  +- *(20) Sort [r_regionkey#203 ASC NULLS FIRST], false, 0
                     +- Exchange(coordinator id: 266374831)
hashpartitioning(r_regionkey#203, 200), coordinator[target post-shuffle
partition size: 67108864]
                        +- *(19) Project [R_REGIONKEY#203]
                           +- *(19) Filter ((isnotnull(r_name#204) &&
(r_name#204 = AFRICA)) && isnotnull(r_regionkey#203))
                              +- InMemoryTableScan [R_REGIONKEY#203,
r_name#204], [isnotnull(r_name#204), (r_name#204 = AFRICA),
isnotnull(r_regionkey#203)]
                                    +- InMemoryRelation [R_REGIONKEY#203,
R_NAME#204, R_COMMENT#205], true, 10000, StorageLevel(disk, memory, 1
replicas)
                                          +- *(1) Scan JDBCRelation(region)
[numPartitions=1] [R_REGIONKEY#203,R_NAME#204,R_COMMENT#205] PushedFilters:
[], ReadSchema: struct<R_REGIONKEY:int,R_NAME:string,R_COMMENT:string>


As you see, all JDBCRelation convert to InMemoryRelation. Cause the JDBC
table is so big, the all data can not be filled into memory,  OOM occurs.
If there is some option to make SparkSQL use Disk if memory not enough?

How to use disk instead of just InMemoryRelation when use JDBC datasource in SPARKSQL?

Reply via email to