You want to use `Dataset.persist(StorageLevel.MEMORY_AND_DISK)`? On Thu, Apr 12, 2018 at 1:12 PM, Louis Hust <louis.h...@gmail.com> wrote:
> We want to extract data from mysql, and calculate in sparksql. > The sql explain like below. > > > REGIONKEY#177,N_COMMENT#178] PushedFilters: [], ReadSchema: > struct<N_NATIONKEY:int,N_NAME:string,N_REGIONKEY:int,N_COMMENT:string> > +- *(20) Sort [r_regionkey#203 ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 266374831) > hashpartitioning(r_regionkey#203, 200), coordinator[target post-shuffle > partition size: 67108864] > +- *(19) Project [R_REGIONKEY#203] > +- *(19) Filter ((isnotnull(r_name#204) && > (r_name#204 = AFRICA)) && isnotnull(r_regionkey#203)) > +- InMemoryTableScan [R_REGIONKEY#203, > r_name#204], [isnotnull(r_name#204), (r_name#204 = AFRICA), > isnotnull(r_regionkey#203)] > +- InMemoryRelation [R_REGIONKEY#203, > R_NAME#204, R_COMMENT#205], true, 10000, StorageLevel(disk, memory, 1 > replicas) > +- *(1) Scan > JDBCRelation(region) [numPartitions=1] > [R_REGIONKEY#203,R_NAME#204,R_COMMENT#205] > PushedFilters: [], ReadSchema: struct<R_REGIONKEY:int,R_NAME: > string,R_COMMENT:string> > > > As you see, all JDBCRelation convert to InMemoryRelation. Cause the JDBC > table is so big, the all data can not be filled into memory, OOM occurs. > If there is some option to make SparkSQL use Disk if memory not enough? > -- --- Takeshi Yamamuro