[jira] [Commented] (CARBONDATA-4106) Compaction is not working properly

2021-01-18 Thread Ajantha Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267160#comment-17267160
 ] 

Ajantha Bhat commented on CARBONDATA-4106:
--

closing the defect as it is not an issue and the current compaction cannot be 
useful in this corner case. 

> Compaction is not working properly
> --
>
> Key: CARBONDATA-4106
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4106
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.0.1
> Environment: Apache spark 2.4.5, carbonData 2.0.1
>Reporter: suyash yadav
>Priority: Major
> Fix For: 2.0.1
>
> Attachments: describe_fact_probe_1
>
>
> Hi Team,
> We are using apache carbondata 2.0.1 for one of our POC and we observed that 
> we are not getting proper benifit from using compaction (Both majour and 
> minor).
> Please find below details for the issue we are facing:
> *Name of the table used*:  fact_365_1_probe_1
> +*Number of rows:*
> +
> select count(*) from fact_365_1_probe_1
>  ++
>  |count(1)|
>  ++
>  |76963753|
> *Sample data from the table:*
> ==
> +---+--++--+-+---+
>  | ts| metric| tags_id| value| epoch| ts2|
>  
> +---+--++--+-+---+
>  |2021-01-07 
> 21:05:00|Probe.Duplicate.Poll.Count|c8dead9b-87ae-46ae-8703-bc2b7bfba5d4|39.611356797970274|1610033757768|2021-01-07
>  00:00:00|
>  |2021-01-07 
> 23:50:00|Probe.Duplicate.Poll.Count|62351ef2-f2ce-49d1-a2fd-a0d1e5f6a1b9| 
> 72.70658115131307|1610043742516|2021-01-07 00:00:00|
>  
> [^describe_fact_probe_1]
>  
> I have attached  the describe output which will show you the other details of 
> the table.
> The size of the table is 3.24 GB and even after running minor or majour 
> compaction the size remain almost the same.
> So we re not getting any benifit by running the compaction.Could you please 
> review the shared details and help us in identifying if we are missing 
> something here or is there any bug?
> Also we need answer to the following questions about carbondata storate:
> 1. In case of decimal values, how the storage behaves like if i have one row 
> with 20 digits after decimal and second row has only 5 digits  after decimal 
> so how and what would be the difference in the storage taken.
> 2. My second question is , if i have two tables and one of the table has same 
> values for 100 rows and other table has different values for 100 rows so how 
> carbon will behave as far as the storage is concerned in this scenario. WHich 
> table will take less storage or both will take same storage.
> 3.Also for string datatype could you please describe what is the storage 
> defined for string datatype.
>  
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CARBONDATA-4106) Compaction is not working properly

2021-01-14 Thread Ajantha Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265783#comment-17265783
 ] 

Ajantha Bhat commented on CARBONDATA-4106:
--

For your case, each load is mapped to one partition folder (different than 
previous loads), compaction on partition table can only merge within a 
partition. So, for you it will not combine across partition and table after and 
before compaction looks same. If your load has multiple partition values and 
next loads has previous loads partition values, then only compaction can be 
useful 

> Compaction is not working properly
> --
>
> Key: CARBONDATA-4106
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4106
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.0.1
> Environment: Apache spark 2.4.5, carbonData 2.0.1
>Reporter: suyash yadav
>Priority: Major
> Fix For: 2.0.1
>
> Attachments: describe_fact_probe_1
>
>
> Hi Team,
> We are using apache carbondata 2.0.1 for one of our POC and we observed that 
> we are not getting proper benifit from using compaction (Both majour and 
> minor).
> Please find below details for the issue we are facing:
> *Name of the table used*:  fact_365_1_probe_1
> +*Number of rows:*
> +
> select count(*) from fact_365_1_probe_1
>  ++
>  |count(1)|
>  ++
>  |76963753|
> *Sample data from the table:*
> ==
> +---+--++--+-+---+
>  | ts| metric| tags_id| value| epoch| ts2|
>  
> +---+--++--+-+---+
>  |2021-01-07 
> 21:05:00|Probe.Duplicate.Poll.Count|c8dead9b-87ae-46ae-8703-bc2b7bfba5d4|39.611356797970274|1610033757768|2021-01-07
>  00:00:00|
>  |2021-01-07 
> 23:50:00|Probe.Duplicate.Poll.Count|62351ef2-f2ce-49d1-a2fd-a0d1e5f6a1b9| 
> 72.70658115131307|1610043742516|2021-01-07 00:00:00|
>  
> [^describe_fact_probe_1]
>  
> I have attached  the describe output which will show you the other details of 
> the table.
> The size of the table is 3.24 GB and even after running minor or majour 
> compaction the size remain almost the same.
> So we re not getting any benifit by running the compaction.Could you please 
> review the shared details and help us in identifying if we are missing 
> something here or is there any bug?
> Also we need answer to the following questions about carbondata storate:
> 1. In case of decimal values, how the storage behaves like if i have one row 
> with 20 digits after decimal and second row has only 5 digits  after decimal 
> so how and what would be the difference in the storage taken.
> 2. My second question is , if i have two tables and one of the table has same 
> values for 100 rows and other table has different values for 100 rows so how 
> carbon will behave as far as the storage is concerned in this scenario. WHich 
> table will take less storage or both will take same storage.
> 3.Also for string datatype could you please describe what is the storage 
> defined for string datatype.
>  
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CARBONDATA-4106) Compaction is not working properly

2021-01-14 Thread Ajantha Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265778#comment-17265778
 ] 

Ajantha Bhat commented on CARBONDATA-4106:
--

0. Compaction cannot gaurentee the reduction in table size, it can only merge 
the small files into big files. (with this IO time can be reduced during query)
Reducing the total table size depends on many factors including data 
cardinality.
Also your table is already partitioned, compaction will try to merge the 
segments within the same parition, so will not make much difference for few 
segments.

1. In a column, we group 32000 rows into pages, so final storage data type 
depends on the all the values in the column page. We try to apply adaptive and 
delte encoding for this 32000 values to try to store it in less space than 
actual data type.

2. table will same values as 100 rows will be smaller in storage as we do RLE 
encoding and compression.

3. By default string undergoes dictionary encoding, we store and encoding INT 
values. If the cardinality in a blocklet is more than 1, then cannot use 
dictionary. That time we fallback to storing as string itself as a byte array 
format.

> Compaction is not working properly
> --
>
> Key: CARBONDATA-4106
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4106
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.0.1
> Environment: Apache spark 2.4.5, carbonData 2.0.1
>Reporter: suyash yadav
>Priority: Major
> Fix For: 2.0.1
>
> Attachments: describe_fact_probe_1
>
>
> Hi Team,
> We are using apache carbondata 2.0.1 for one of our POC and we observed that 
> we are not getting proper benifit from using compaction (Both majour and 
> minor).
> Please find below details for the issue we are facing:
> *Name of the table used*:  fact_365_1_probe_1
> +*Number of rows:*
> +
> select count(*) from fact_365_1_probe_1
>  ++
>  |count(1)|
>  ++
>  |76963753|
> *Sample data from the table:*
> ==
> +---+--++--+-+---+
>  | ts| metric| tags_id| value| epoch| ts2|
>  
> +---+--++--+-+---+
>  |2021-01-07 
> 21:05:00|Probe.Duplicate.Poll.Count|c8dead9b-87ae-46ae-8703-bc2b7bfba5d4|39.611356797970274|1610033757768|2021-01-07
>  00:00:00|
>  |2021-01-07 
> 23:50:00|Probe.Duplicate.Poll.Count|62351ef2-f2ce-49d1-a2fd-a0d1e5f6a1b9| 
> 72.70658115131307|1610043742516|2021-01-07 00:00:00|
>  
> [^describe_fact_probe_1]
>  
> I have attached  the describe output which will show you the other details of 
> the table.
> The size of the table is 3.24 GB and even after running minor or majour 
> compaction the size remain almost the same.
> So we re not getting any benifit by running the compaction.Could you please 
> review the shared details and help us in identifying if we are missing 
> something here or is there any bug?
> Also we need answer to the following questions about carbondata storate:
> 1. In case of decimal values, how the storage behaves like if i have one row 
> with 20 digits after decimal and second row has only 5 digits  after decimal 
> so how and what would be the difference in the storage taken.
> 2. My second question is , if i have two tables and one of the table has same 
> values for 100 rows and other table has different values for 100 rows so how 
> carbon will behave as far as the storage is concerned in this scenario. WHich 
> table will take less storage or both will take same storage.
> 3.Also for string datatype could you please describe what is the storage 
> defined for string datatype.
>  
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)