[jira] [Commented] (CARBONDATA-1281) Disk hotspot found during data loading

2017-07-28 Thread xuchuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104607#comment-16104607
 ] 

xuchuanyin commented on CARBONDATA-1281:


Here I will provide the configuration used in my test for others to reference.

# ENV

3 HUAWEI RH2288 nodes, each has 24 Cores(E5-2667@2.90GHz), 256GB MEM, 11 
Disks(SAS)

# USE CASE

88Billion Recods with CSV format

340+ columns per record

NO Dictionary column

TABLE_BLOCKSIZE 64

INVERTED_INDEX about 9 columns

# CONF

parameter   valueorigin-value
carbon.number.of.cores  20 
 carbon.number.of.cores.while.loading   14 
sort.inmemory.size.inmb 2048   1024
offheap.sort.chunk.size.inmb128 64
carbon.sort.intermediate.files.limit20  20
carbon.sort.file.buffer.size50  20
carbon.use.local.dirtruefalse
carbon.use.multiple.dir true false

# RESULT

Using `LOAD  DATA INPATH `, the loading cost about 6min

Observing the NMON, each disk IO usage is quite average.

> Disk hotspot found during data loading
> --
>
> Key: CARBONDATA-1281
> URL: https://issues.apache.org/jira/browse/CARBONDATA-1281
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core, data-load
>Affects Versions: 1.1.0
>Reporter: xuchuanyin
>Assignee: xuchuanyin
> Fix For: 1.2.0
>
>  Time Spent: 17.5h
>  Remaining Estimate: 0h
>
> # Scenario
> Currently we have done a massive data loading. The input data is about 71GB 
> in CSV format,and have about 88million records. When using carbondata, we do 
> not use any dictionary encoding. Our testing environment has three nodes and 
> each of them have 11 disks as yarn executor directory. We submit the loading 
> command through JDBCServer.The JDBCServer instance have three executors in 
> total, one on each node respectively. The loading takes about 10minutes 
> (+-3min vary from each time).
> We have observed the nmon information during the loading and find:
> 1. lots of CPU waits in the first half of loading;
> 2. only one single disk has many writes and almost reaches its bottleneck 
> (Avg. 80M/s, Max. 150M/s on SAS Disk)
> 3. the other disks are quite idel
> # Analyze
> When do data loading, carbondata read and sort data locally(default scope) 
> and write the temp files to local disk. In my case, there is only one 
> executor in one node, so carbondata write all the temp file to one 
> disk(container directory or yarn local directory), thus resulting into single 
> disk hotspot.
> # Modification
> We should support multiple directory for writing temp files to avoid disk 
> hotspot.
> Ps: I have improved this in my environment and the result is pretty 
> optimistic: the loading takes about 6minutes (10 minutes before improving).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (CARBONDATA-1281) Disk hotspot found during data loading

2017-07-17 Thread xuchuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089614#comment-16089614
 ] 

xuchuanyin commented on CARBONDATA-1281:


[~Bjangir] I've checked the code, and cannot find the property 
`carbon.tempstore.locations`. 

Do you mean the property `carbon.tempstore.location` in carbondata source code? 
This property does not resolve the hotspot problem while do loading.



> Disk hotspot found during data loading
> --
>
> Key: CARBONDATA-1281
> URL: https://issues.apache.org/jira/browse/CARBONDATA-1281
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core, data-load
>Affects Versions: 1.1.0
>Reporter: xuchuanyin
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> # Scenario
> Currently we have done a massive data loading. The input data is about 71GB 
> in CSV format,and have about 88million records. When using carbondata, we do 
> not use any dictionary encoding. Our testing environment has three nodes and 
> each of them have 11 disks as yarn executor directory. We submit the loading 
> command through JDBCServer.The JDBCServer instance have three executors in 
> total, one on each node respectively. The loading takes about 10minutes 
> (+-3min vary from each time).
> We have observed the nmon information during the loading and find:
> 1. lots of CPU waits in the first half of loading;
> 2. only one single disk has many writes and almost reaches its bottleneck 
> (Avg. 80M/s, Max. 150M/s on SAS Disk)
> 3. the other disks are quite idel
> # Analyze
> When do data loading, carbondata read and sort data locally(default scope) 
> and write the temp files to local disk. In my case, there is only one 
> executor in one node, so carbondata write all the temp file to one 
> disk(container directory or yarn local directory), thus resulting into single 
> disk hotspot.
> # Modification
> We should support multiple directory for writing temp files to avoid disk 
> hotspot.
> Ps: I have improved this in my environment and the result is pretty 
> optimistic: the loading takes about 6minutes (10 minutes before improving).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (CARBONDATA-1281) Disk hotspot found during data loading

2017-07-17 Thread Babulal (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089510#comment-16089510
 ] 

Babulal commented on CARBONDATA-1281:
-

Hi 
Can you please try option carbon.tempstore.locations
In carbon.properties . It will accept mutiple disks to local store/sort.

Thanks
Babu


> Disk hotspot found during data loading
> --
>
> Key: CARBONDATA-1281
> URL: https://issues.apache.org/jira/browse/CARBONDATA-1281
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core, data-load
>Affects Versions: 1.1.0
>Reporter: xuchuanyin
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> # Scenario
> Currently we have done a massive data loading. The input data is about 71GB 
> in CSV format,and have about 88million records. When using carbondata, we do 
> not use any dictionary encoding. Our testing environment has three nodes and 
> each of them have 11 disks as yarn executor directory. We submit the loading 
> command through JDBCServer.The JDBCServer instance have three executors in 
> total, one on each node respectively. The loading takes about 10minutes 
> (+-3min vary from each time).
> We have observed the nmon information during the loading and find:
> 1. lots of CPU waits in the first half of loading;
> 2. only one single disk has many writes and almost reaches its bottleneck 
> (Avg. 80M/s, Max. 150M/s on SAS Disk)
> 3. the other disks are quite idel
> # Analyze
> When do data loading, carbondata read and sort data locally(default scope) 
> and write the temp files to local disk. In my case, there is only one 
> executor in one node, so carbondata write all the temp file to one 
> disk(container directory or yarn local directory), thus resulting into single 
> disk hotspot.
> # Modification
> We should support multiple directory for writing temp files to avoid disk 
> hotspot.
> Ps: I have improved this in my environment and the result is pretty 
> optimistic: the loading takes about 6minutes (10 minutes before improving).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)