We want to extract data from mysql, and calculate in sparksql. The sql explain like below.
REGIONKEY#177,N_COMMENT#178] PushedFilters: [], ReadSchema: struct<N_NATIONKEY:int,N_NAME:string,N_REGIONKEY:int,N_COMMENT:string> +- *(20) Sort [r_regionkey#203 ASC NULLS FIRST], false, 0 +- Exchange(coordinator id: 266374831) hashpartitioning(r_regionkey#203, 200), coordinator[target post-shuffle partition size: 67108864] +- *(19) Project [R_REGIONKEY#203] +- *(19) Filter ((isnotnull(r_name#204) && (r_name#204 = AFRICA)) && isnotnull(r_regionkey#203)) +- InMemoryTableScan [R_REGIONKEY#203, r_name#204], [isnotnull(r_name#204), (r_name#204 = AFRICA), isnotnull(r_regionkey#203)] +- InMemoryRelation [R_REGIONKEY#203, R_NAME#204, R_COMMENT#205], true, 10000, StorageLevel(disk, memory, 1 replicas) +- *(1) Scan JDBCRelation(region) [numPartitions=1] [R_REGIONKEY#203,R_NAME#204,R_COMMENT#205] PushedFilters: [], ReadSchema: struct<R_REGIONKEY:int,R_NAME:string,R_COMMENT:string> As you see, all JDBCRelation convert to InMemoryRelation. Cause the JDBC table is so big, the all data can not be filled into memory, OOM occurs. If there is some option to make SparkSQL use Disk if memory not enough?