[ https://issues.apache.org/jira/browse/CARBONDATA-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhichao Zhang resolved CARBONDATA-1366. ---------------------------------------- Resolution: Fixed > When sort_scope=global_sort, use 'StorageLevel.MEMORY_AND_DISK_SER' instead > of 'StorageLevel.MEMORY_AND_DISK' for 'convertRDD' persisting to improve > loading performance > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: CARBONDATA-1366 > URL: https://issues.apache.org/jira/browse/CARBONDATA-1366 > Project: CarbonData > Issue Type: Bug > Components: data-load, spark-integration > Affects Versions: 1.2.0 > Reporter: Zhichao Zhang > Assignee: Zhichao Zhang > Priority: Minor > Fix For: 1.2.0 > > Time Spent: 4h 20m > Remaining Estimate: 0h > > My testing env and configs are as followings: > Env: > 6 executors, 9G mem + 6 cores per executor > Configs: > SINGLE_PASS=true > SORT_SCOPE=GLOBAL_SORT > spark.memory.fraction=0.5 > if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK_SER)' in method > 'org.apache.carbondata.spark.load.DataLoadProcessBuilderOnSpark.loadDataUsingGlobalSort', > it takes about 7.2 min to load 144136697 lines (10.9 G parquet files), and > if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK)', it takes about > 9.5 min to load 144136697 lines. -- This message was sent by Atlassian JIRA (v6.4.14#64029)