[jira] [Commented] (CARBONDATA-1281) Disk hotspot found during data loading
[ https://issues.apache.org/jira/browse/CARBONDATA-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104607#comment-16104607 ] xuchuanyin commented on CARBONDATA-1281: Here I will provide the configuration used in my test for others to reference. # ENV 3 HUAWEI RH2288 nodes, each has 24 Cores(E5-2667@2.90GHz), 256GB MEM, 11 Disks(SAS) # USE CASE 88Billion Recods with CSV format 340+ columns per record NO Dictionary column TABLE_BLOCKSIZE 64 INVERTED_INDEX about 9 columns # CONF parameter valueorigin-value carbon.number.of.cores 20 carbon.number.of.cores.while.loading 14 sort.inmemory.size.inmb 2048 1024 offheap.sort.chunk.size.inmb128 64 carbon.sort.intermediate.files.limit20 20 carbon.sort.file.buffer.size50 20 carbon.use.local.dirtruefalse carbon.use.multiple.dir true false # RESULT Using `LOAD DATA INPATH `, the loading cost about 6min Observing the NMON, each disk IO usage is quite average. > Disk hotspot found during data loading > -- > > Key: CARBONDATA-1281 > URL: https://issues.apache.org/jira/browse/CARBONDATA-1281 > Project: CarbonData > Issue Type: Improvement > Components: core, data-load >Affects Versions: 1.1.0 >Reporter: xuchuanyin >Assignee: xuchuanyin > Fix For: 1.2.0 > > Time Spent: 17.5h > Remaining Estimate: 0h > > # Scenario > Currently we have done a massive data loading. The input data is about 71GB > in CSV format,and have about 88million records. When using carbondata, we do > not use any dictionary encoding. Our testing environment has three nodes and > each of them have 11 disks as yarn executor directory. We submit the loading > command through JDBCServer.The JDBCServer instance have three executors in > total, one on each node respectively. The loading takes about 10minutes > (+-3min vary from each time). > We have observed the nmon information during the loading and find: > 1. lots of CPU waits in the first half of loading; > 2. only one single disk has many writes and almost reaches its bottleneck > (Avg. 80M/s, Max. 150M/s on SAS Disk) > 3. the other disks are quite idel > # Analyze > When do data loading, carbondata read and sort data locally(default scope) > and write the temp files to local disk. In my case, there is only one > executor in one node, so carbondata write all the temp file to one > disk(container directory or yarn local directory), thus resulting into single > disk hotspot. > # Modification > We should support multiple directory for writing temp files to avoid disk > hotspot. > Ps: I have improved this in my environment and the result is pretty > optimistic: the loading takes about 6minutes (10 minutes before improving). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (CARBONDATA-1281) Disk hotspot found during data loading
[ https://issues.apache.org/jira/browse/CARBONDATA-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089614#comment-16089614 ] xuchuanyin commented on CARBONDATA-1281: [~Bjangir] I've checked the code, and cannot find the property `carbon.tempstore.locations`. Do you mean the property `carbon.tempstore.location` in carbondata source code? This property does not resolve the hotspot problem while do loading. > Disk hotspot found during data loading > -- > > Key: CARBONDATA-1281 > URL: https://issues.apache.org/jira/browse/CARBONDATA-1281 > Project: CarbonData > Issue Type: Improvement > Components: core, data-load >Affects Versions: 1.1.0 >Reporter: xuchuanyin > Time Spent: 1h > Remaining Estimate: 0h > > # Scenario > Currently we have done a massive data loading. The input data is about 71GB > in CSV format,and have about 88million records. When using carbondata, we do > not use any dictionary encoding. Our testing environment has three nodes and > each of them have 11 disks as yarn executor directory. We submit the loading > command through JDBCServer.The JDBCServer instance have three executors in > total, one on each node respectively. The loading takes about 10minutes > (+-3min vary from each time). > We have observed the nmon information during the loading and find: > 1. lots of CPU waits in the first half of loading; > 2. only one single disk has many writes and almost reaches its bottleneck > (Avg. 80M/s, Max. 150M/s on SAS Disk) > 3. the other disks are quite idel > # Analyze > When do data loading, carbondata read and sort data locally(default scope) > and write the temp files to local disk. In my case, there is only one > executor in one node, so carbondata write all the temp file to one > disk(container directory or yarn local directory), thus resulting into single > disk hotspot. > # Modification > We should support multiple directory for writing temp files to avoid disk > hotspot. > Ps: I have improved this in my environment and the result is pretty > optimistic: the loading takes about 6minutes (10 minutes before improving). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (CARBONDATA-1281) Disk hotspot found during data loading
[ https://issues.apache.org/jira/browse/CARBONDATA-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089510#comment-16089510 ] Babulal commented on CARBONDATA-1281: - Hi Can you please try option carbon.tempstore.locations In carbon.properties . It will accept mutiple disks to local store/sort. Thanks Babu > Disk hotspot found during data loading > -- > > Key: CARBONDATA-1281 > URL: https://issues.apache.org/jira/browse/CARBONDATA-1281 > Project: CarbonData > Issue Type: Improvement > Components: core, data-load >Affects Versions: 1.1.0 >Reporter: xuchuanyin > Time Spent: 1h > Remaining Estimate: 0h > > # Scenario > Currently we have done a massive data loading. The input data is about 71GB > in CSV format,and have about 88million records. When using carbondata, we do > not use any dictionary encoding. Our testing environment has three nodes and > each of them have 11 disks as yarn executor directory. We submit the loading > command through JDBCServer.The JDBCServer instance have three executors in > total, one on each node respectively. The loading takes about 10minutes > (+-3min vary from each time). > We have observed the nmon information during the loading and find: > 1. lots of CPU waits in the first half of loading; > 2. only one single disk has many writes and almost reaches its bottleneck > (Avg. 80M/s, Max. 150M/s on SAS Disk) > 3. the other disks are quite idel > # Analyze > When do data loading, carbondata read and sort data locally(default scope) > and write the temp files to local disk. In my case, there is only one > executor in one node, so carbondata write all the temp file to one > disk(container directory or yarn local directory), thus resulting into single > disk hotspot. > # Modification > We should support multiple directory for writing temp files to avoid disk > hotspot. > Ps: I have improved this in my environment and the result is pretty > optimistic: the loading takes about 6minutes (10 minutes before improving). -- This message was sent by Atlassian JIRA (v6.4.14#64029)