[jira] [Created] (CARBONDATA-4207) MV data getting lost

2021-06-11 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4207:


 Summary: MV data getting lost
 Key: CARBONDATA-4207
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4207
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Affects Versions: 2.0.1
Reporter: suyash yadav
 Fix For: 2.0.1


Hi Team,

We have observed one more issue, We had created one table and a timeseries MV 
on it. We had loaded almost 15 hours of data into it and then when we were 
loading 16th hour data the loading failed because of some reason but it caused 
MV to go empty. Our mv has now zero rows. Could you please let us know if there 
is any bug or this is how it is supposed to work. Because our MV did not have 
any avg function so ideally the loading to MV should have been incremental , 
and in that case MV should not have got impacted if the subsequent hour loading 
to main table failed. Please have a look into this issue. And let us know what 
information you need.

 
scala> spark.sql("insert into Flow_TS_2day_stats_04062021 select 
start_time,end_time,source_ip_address,destintion_ip_address,appname,protocol_id,source_tos,src_as,dst_as,source_mask,destination_mask,dst_tos,input_pkt,output_pkt,input_byt,output_byt,source_port,destination_port,in_interface,out_interface
 from Flow_TS_1day_stats_24052021  where start_time>='2021-03-04 07:00:00' and 
start_time< '2021-03-04 09:00:00'").show()
 
[1:38|https://carbondataworkspace.slack.com/archives/D01GLHKSAFL/p1623226096008700]
scala> spark.sql("insert into Flow_TS_2day_stats_04062021 select 
start_time,end_time,source_ip_address,destintion_ip_address,appname,protocol_id,source_tos,src_as,dst_as,source_mask,destination_mask,dst_tos,input_pkt,output_pkt,input_byt,output_byt,source_port,destination_port,in_interface,out_interface
 from Flow_TS_1day_stats_24052021  where start_time>='2021-03-04 15:00:00' and 
start_time< '2021-03-04 16:00:00'").show()
21/06/06 14:25:33 AUDIT audit: \{"time":"June 6, 2021 2:25:33 PM 
IST","username":"root","opName":"INSERT 
INTO","opId":"4069819623887063","opStatus":"START"}
21/06/06 14:44:14 AUDIT audit: \{"time":"June 6, 2021 2:44:14 PM 
IST","username":"root","opName":"INSERT 
INTO","opId":"4070940294400824","opStatus":"START"}
21/06/06 16:06:05 AUDIT audit: \{"time":"June 6, 2021 4:06:05 PM 
IST","username":"root","opName":"INSERT 
INTO","opId":"4070940294400824","opStatus":"SUCCESS","opTime":"4911240 
ms","table":"default.Interface_Level_Agg_10min_MV_04062021","extraInfo":{"SegmentId":"6","DataSize":"4.52GB","IndexSize":"108.27KB"}}
21/06/06 16:06:09 AUDIT audit: \{"time":"June 6, 2021 4:06:09 PM 
IST","username":"root","opName":"INSERT 
INTO","opId":"4069819623887063","opStatus":"SUCCESS","opTime":"6036073 
ms","table":"default.flow_ts_2day_stats_04062021","extraInfo":{"SegmentId":"6","DataSize":"12.37GB","IndexSize":"262.43KB"}}[^Stack_Trace]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4205) MINOR compaction getting triggered by it self while inserting data to a table

2021-06-09 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4205:


 Summary: MINOR compaction getting triggered by it self while 
inserting data to a table
 Key: CARBONDATA-4205
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4205
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Affects Versions: 2.0.1
 Environment: apache carbondata 2.0.1, hadoop 2.7.2, spark 2.4.5
Reporter: suyash yadav


Hi Team we have created a table and also created a timeseries MV on it. Later 
we tried to insert a some data from other table to this newly created table but 
we have observed that while inserting ...MINOR compaction on the MV is getting 
triggered by it self. It doesn't happen for all the insert but whnever we 
insert 6 to 7th hour data and then 14 to 15 hour datathe MINOR compaction 
gets triggered. Could you tell us why the MINOR compaction is getting triggered 
by it self.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4182) Performance issue when multiple load happeneing to same table for same interval with 2 MVs.

2021-05-11 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4182:


 Summary: Performance issue when multiple load happeneing to same 
table for same interval with 2 MVs.
 Key: CARBONDATA-4182
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4182
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Affects Versions: 2.1.0
 Environment: Apache carbon 2.1.0
Reporter: suyash yadav


Hi Team,

We need your help to resolve one of the performance issue that we are facing. 
Please see below details about the table structure and schema implemented at 
our end:

1.We have 25 tables and 2 MVS created for these tables for hour and day 
granularity.
 2.One table can have more than 1 can for same interval and whenver multiple 
csv are there as an input to these tables then sequential loading will take 
place.
 3.for different tables data loading is parallal but whenver 2 csvs are there 
for same table then for that table sequential load will happen by one and then 
other csv respectivly.
 4.We have observed for 1minute of csv to be loaded into the table it is taking 
approximalty 345 to 60 minutes which is creating a huge backlog.

We need your help to resolve this performanec issue as there will be no use of 
15 minutes of data will take more than 15 minutes to load. Users are not going 
to wait and will get fed up by this slowness.

Kindly advice.

Regards
Suyash Yadav



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4177) performence issue with Query

2021-05-06 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4177:


 Summary: performence issue with Query
 Key: CARBONDATA-4177
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4177
 Project: CarbonData
  Issue Type: Bug
  Components: core
Affects Versions: 2.0.1
Reporter: suyash yadav
 Fix For: 2.0.1


Hi Team,Hi Team,
We are working on a POC using carbondata 2.0.1 and have come across parformance 
issue.Below are the details:
1.Table creation query:

==
spark.sql("create table Flow_TS_1day_stats_16042021(start_time 
timestamp,end_time timestamp,source_ip_address string,destintion_ip_address 
string,appname string,protocol_name string,source_tos smallint,in_interface 
smallint,out_interface smallint,src_as bigint,dst_as bigint,source_mask 
smallint,destination_mask smallint, dst_tos smallint,input_pkt bigint,input_byt 
bigint,output_pkt bigint,output_byt bigint,source_port int,destination_port 
int) stored as carbondata TBLPROPERTIES 
('local_dictionary_enable'='false')").show()

TWO MVs are there on this table, Below are the queries for those MVs

:===
1. Network MV


spark.sql("create materialized view 
Network_Level_Agg_10min_MV_with_ip_15042021_again as select 
timeseries(end_time,'ten_minute') as end_time,source_ip_address, 
destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,sum(input_pkt)
 as input_pkt,sum(input_byt) as input_byt,sum(output_pkt) as 
output_pkt,sum(output_byt) as output_byt from Flow_TS_1day_stats_15042021_again 
group by 
timeseries(end_time,'ten_minute'),source_ip_address,destintion_ip_address, 
appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as 
order by input_pkt,input_byt,output_pkt,output_byt desc").show(false)

2. Interfae MV:

==Interface :==
spark.sql("create materialized view Interface_Level_Agg_10min_MV_16042021 as 
select timeseries(end_time,'ten_minute') as end_time, 
source_ip_address,destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,in_interface,out_interface,sum(input_pkt)
 as input_pkt,sum(input_byt) as input_byt,sum(output_pkt) as 
output_pkt,sum(output_byt) as output_byt from Flow_TS_1day_stats_16042021 group 
by timeseries(end_time,'ten_minute'), 
source_ip_address,destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,in_interface,out_interface
 order by input_pkt,input_byt,output_pkt,output_byt desc").show(false)


+*We are firing below query for fethcing data which is taking almost 10 
seconds:*+


*Select appname,input_byt from Flow_TS_1day_stats_16042021 where end_time >= 
'2021-03-02 00:00:00' and end_time < '2021-03-03 00:00:00' group by 
appname,input_byt order by input_byt desc LIMIT 10*

 


The above query is only fetching 10 records but it is taking almost 10 seconds 
to complete.
Could you please review above schemas and help us to understand how can we get 
some improvement in the qury execution time. We are expectingt he response 
should be in subseconds.

Table Name : RAW Table (1 Day - 300K/Sec)#Records : 2592000


RegardsSuyash Yadav                          



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4165) Carbondata summing up two values of same timestamp.

2021-04-07 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4165:


 Summary: Carbondata summing up two values of same timestamp.
 Key: CARBONDATA-4165
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4165
 Project: CarbonData
  Issue Type: Wish
  Components: core
Affects Versions: 2.0.1
 Environment: apache carbondata 2.0.1, apache spark 2.4.5 hadoop 2.7.2
Reporter: suyash yadav
 Fix For: 2.0.1


Hi Team,

 

We have seen a behaviour while using Carbondata 2.0.1 that if we get 2 values 
for same timestamp then it tries to sum both the values and put it as one 
value. Instead we need that it should discard previous  value and use the 
latest one.

 

Please let us know if there is any functionality already available in 
carbondata to handle duplicate values by it self or if there is any plan to 
implement such a functionality.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4150) Information about indexed datamap

2021-03-16 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4150:


 Summary: Information about indexed datamap
 Key: CARBONDATA-4150
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4150
 Project: CarbonData
  Issue Type: Wish
  Components: core
Affects Versions: 2.0.1
 Environment: apache 2.0.1 spark 2.4.5 hadoop 2.7.2
Reporter: suyash yadav
 Fix For: 2.0.1


Hi Team,

 

We would like to know detailed information about indexed datamap and possible 
use cases for this datamap.

So please help us in getting answer to below queries:-
 
1) What is an indexed datamap and related use cases.
2) how it is to be used,
3) any reference documents
 
 
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4132) Numer of records not matching in MVs

2021-02-18 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4132:


 Summary: Numer of records not matching in MVs
 Key: CARBONDATA-4132
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4132
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Affects Versions: 2.0.1
 Environment: Apache carbondata 2.0.1
Reporter: suyash yadav
 Fix For: 2.0.1


Hi Team, 

We are working on a POC where we need to insert 300k records/second in a table 
where we have already created Timeeries MVs with Minute,Hour,Day granularity.

 

As per our the Minute based MV should contain 300K records till the insertion 
of next minute data. Also the hour and Day based MVs should contain 300K 
records till the arrival of next hour and next day data respectively.

 

But The count of records in MV is not coming out as per our expectation.It is 
always more than our expectation.

But the strange thing is, When we drop the MV and create the MV after inserting 
the data in the table then the count if reocrds comes correct.So it is clear 
there is no problem with MV definition and the data.

 

Kindly help us in resolving this issue on priority.Please find more details 
below:

Table definition:

===

spark.sql("create table Flow_Raw_TS(export_ms bigint,exporter_ip 
string,pkt_seq_num bigint,flow_seq_num int,src_ip string,dst_ip 
string,protocol_id smallint,src_tos smallint,dst_tos smallint,raw_src_tos 
smallint,raw_dst_tos smallint,src_mask smallint,dst_mask smallint,tcp_bits 
int,src_port int,in_if_id bigint,in_if_entity_id bigint,in_if_enabled 
boolean,dst_port int,out_if_id bigint,out_if_entity_id bigint,out_if_enabled 
boolean,direction smallint,in_octets bigint,out_octets bigint,in_packets 
bigint,out_packets bigint,next_hop_ip string,bgp_src_as_num 
bigint,bgp_dst_as_num bigint,bgp_next_hop_ip string,end_ms timestamp,start_ms 
timestamp,app_id string,app_name string,src_ip_group string,dst_ip_group 
string,policy_qos_classification_hierarchy string,policy_qos_queue_id 
bigint,worker_id int,day bigint ) stored as carbondata TBLPROPERTIES 
('local_dictionary_enable'='false')



MV definition:

 

==

+*Minute based*+

spark.sql("create materialized view Flow_Raw_TS_agg_001_min as select 
timeseries(end_ms,'minute') as 
end_ms,src_ip,dst_ip,app_name,in_if_id,src_tos,src_ip_group,dst_ip_group,protocol_id,bgp_src_as_num,
 bgp_dst_as_num,policy_qos_classification_hierarchy, 
policy_qos_queue_id,sum(in_octets) as octects, sum(in_packets) as packets, 
sum(out_packets) as out_packets, sum(out_octets) as out_octects FROM 
Flow_Raw_TS group by 
timeseries(end_ms,'minute'),src_ip,dst_ip,app_name,in_if_id,src_tos,src_ip_group,
 
dst_ip_group,protocol_id,bgp_src_as_num,bgp_dst_as_num,policy_qos_classification_hierarchy,
 policy_qos_queue_id").show()

+*Hour Based*+

val startTime = System.nanoTime
spark.sql("create materialized view Flow_Raw_TS_agg_001_hour as select 
timeseries(end_ms,'hour') as end_ms,app_name,sum(in_octets) as octects, 
sum(in_packets) as packets, sum(out_packets) as out_packets, sum(out_octets) as 
out_octects, in_if_id,src_tos,src_ip_group, dst_ip_group,protocol_id,src_ip, 
dst_ip,bgp_src_as_num, bgp_dst_as_num,policy_qos_classification_hierarchy, 
policy_qos_queue_id FROM Flow_Raw_TS group by 
timeseries(end_ms,'hour'),in_if_id,app_name,src_tos,src_ip_group,dst_ip_group,protocol_id,src_ip,
 dst_ip,bgp_src_as_num,bgp_dst_as_num,policy_qos_classification_hierarchy, 
policy_qos_queue_id").show()
val endTime = System.nanoTime
val elapsedSeconds = (endTime - startTime) / 1e9d



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4120) select queries against carbondata tables getting stuck when fired through Apache Hive

2021-02-04 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4120:


 Summary: select queries against carbondata tables getting stuck 
when fired through Apache Hive
 Key: CARBONDATA-4120
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4120
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Affects Versions: 2.0.1
 Environment: Apache Hive 3.1.2, Apache carbondata 2.0.1
Reporter: suyash yadav


Hi Team need one more help..We have created a table which has around 172 
million records and we have connected this table through Apache Hive..but 
whenever we are running  select count(*) on this table through hive, the query 
gets stuck.We can run the query successfully when we run it through spark shell 
but through Hive it is always getting stuck.One more observation is, Whenever 
we run any query which contains join the query gets stuck. Also for where 
clause the query gets executed with smaller table but when we run it against 
the bigger table, it also gets stuck. So could you giys guide us how can we run 
all these queries successfully without any issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4116) Concurrent Data Loading Issue

2021-02-02 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4116:


 Summary: Concurrent Data Loading Issue
 Key: CARBONDATA-4116
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4116
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Affects Versions: 1.6.1, 2.0.1
 Environment: Apache carbondata 2.0.1
Reporter: suyash yadav


Even carbon claim that it can support the concurrent data loading together with 
table compaction, in fact it cannot. We have faced data inconsistent issue in 
carbon 1.6.1 when loading data concurrently into the table together with 
compaction. That is why we implement table locking for load data, compact and 
clean files command. All this is due to the manipulation of the table’s 
metadata file, i.e .tablestatus.

 

We are facing issue in concurrent data loading together with compaction - gets 
us to data inconsistency - what is the way out to fix this as we want 
concurrent loading with compaction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4108) How to connect carbondata with Hive

2021-01-14 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4108:


 Summary: How to connect carbondata with Hive
 Key: CARBONDATA-4108
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4108
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Affects Versions: 2.0.1
 Environment: apache carbondata 2.0.1, spark 2.4.5, Hive 2.0
Reporter: suyash yadav
 Fix For: 2.0.1


Hi Team,

We would like to know how to connect hive with carbondata.We are doing a POC 
where in we need to access carbondata table through hive but we need this 
configuration with username and password. So our hive connection should have 
some username and password configuration to connect to carbondata tables.

 

Could you guys please review above requirement and suggest steps to achieve the 
same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4106) Compaction is not working properly

2021-01-12 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4106:


 Summary: Compaction is not working properly
 Key: CARBONDATA-4106
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4106
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Affects Versions: 2.0.1
 Environment: Apache spark 2.4.5, carbonData 2.0.1
Reporter: suyash yadav
 Fix For: 2.0.1
 Attachments: describe_fact_probe_1

Hi Team,

We are using apache carbondata 2.0.1 for one of our POC and we observed that we 
are not getting proper benifit from using compaction (Both majour and minor).

Please find below details for the issue we are facing:

*Name of the table used*:  fact_365_1_probe_1

+*Number of rows:*
+
select count(*) from fact_365_1_probe_1
 ++
 |count(1)|
 ++
 |76963753|

*Sample data from the table:*
==

+---+--++--+-+---+
 | ts| metric| tags_id| value| epoch| ts2|
 
+---+--++--+-+---+
 |2021-01-07 
21:05:00|Probe.Duplicate.Poll.Count|c8dead9b-87ae-46ae-8703-bc2b7bfba5d4|39.611356797970274|1610033757768|2021-01-07
 00:00:00|
 |2021-01-07 
23:50:00|Probe.Duplicate.Poll.Count|62351ef2-f2ce-49d1-a2fd-a0d1e5f6a1b9| 
72.70658115131307|1610043742516|2021-01-07 00:00:00|
 
[^describe_fact_probe_1]
 
I have attached  the describe output which will show you the other details of 
the table.

The size of the table is 3.24 GB and even after running minor or majour 
compaction the size remain almost the same.

So we re not getting any benifit by running the compaction.Could you please 
review the shared details and help us in identifying if we are missing 
something here or is there any bug?


Also we need answer to the following questions about carbondata storate:

1. In case of decimal values, how the storage behaves like if i have one row 
with 20 digits after decimal and second row has only 5 digits  after decimal so 
how and what would be the difference in the storage taken.



2. My second question is , if i have two tables and one of the table has same 
values for 100 rows and other table has different values for 100 rows so how 
carbon will behave as far as the storage is concerned in this scenario. WHich 
table will take less storage or both will take same storage.

3.Also for string datatype could you please describe what is the storage 
defined for string datatype.
 






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4085) How to improve query execution time further

2020-12-15 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4085:


 Summary: How to improve query execution time further
 Key: CARBONDATA-4085
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4085
 Project: CarbonData
  Issue Type: Improvement
  Components: sql
Affects Versions: 2.0.1
Reporter: suyash yadav
 Fix For: 2.0.1


Hi Team,

We are doing a POC where we would like oour query execution to be fatser, 
mostly in the range of 3 to 4 seconds.

We have read carbon docuements where it has been claimed that carbondata can 
help to scan PETABYTES of data and present results in 3 to 4 seconds , which 
does not seem to be the case as per our observation.

Our table size is 1.6 billionand  query is fetching only 4K records but still 
it takes around 22 to 25 seconds for query execution.

Below is our query that we are firing:

==

spark.sql("select ts,resource,metric,value from fact_timestamp_global left join 
tags_10_days_test on fact_timestamp_global.tags_id= tags_10_days_test.id where 
metric in ('Outbound Utilization (percent)','Inbound Utilization (percent)') 
and resource='10.212.7.98_if:<0001>' and ts>='2020-09-28 00:00:00' and 
ts<='2020-09-28 23:55:55'").show(false)

=



Definition of fact_timestamp_global is like below:



spark.sql("create table Fact_timestamp_GLOBAL(ts timestamp,metric 
string,tags_id string,value double) partitioned by (ts2 timestamp) stored as 
carbondata TBLPROPERTIES 
('SORT_COLUMNS'='ts,metric','SORT_SCOPE'='GLOBAL_SORT')").show()

==

Definition of tags_10_days_test is like below:



spark.sql("create table tags_10_days_test(id string,resource string) stored as 
carbondata TBLPROPERTIES('SORT_COLUMNS'='id,resource')").show()

==

 

Kindly go through above points and help us the query performence further.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CARBONDATA-4079) Queries with Date range are taking time

2020-12-09 Thread suyash yadav (Jira)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-4079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247041#comment-17247041
 ] 

suyash yadav commented on CARBONDATA-4079:
--

Below are the complete background of the issue.

 

We created two tables named as fact_10days_data_ts and tags_final and below are 
the description of these two tables.

 

*Spark.sql("desc extended fact_timestamp_global").show(100,false)*

col_name |data_type 
   |comment|
+-+-+---+
|ts   |timestamp
    |null   |
|metric   |string   
    |null   |
|tags_id  |string   
    |null   |
|value    |double   
    |null   |
|ts2  |timestamp
    |null   |
| | 
    |   |
|## Detailed Table Information| 
    |   |
|Database |default  
    |   |
|Table    |fact_timestamp_global
    |   |
|Owner    |root 
    |   |
|Created  |Wed Dec 09 19:31:48 MYT 2020 
    |   |
|Location 
|hdfs://10.212.5.2:7200/Data/default/fact_timestamp_global|   |
|External |false
    |   |
|Transactional    |true 
    |   |
|Streaming    |false
    |   |
|Table Block Size |1024 MB  
    |   |
|Table Blocklet Size  |64 MB
    |   |
|Comment  | 
    |   |
|Bad Record Path  | 
    |   |
|Min Input Per Node Per Load  |0.0B 
    |   |
| | 
    |   |
|## Index Information | 
    |   |
|Sort Scope   |global_sort  
    |   |
|Sort Columns |ts, metric   
    |   |
|Inverted Index Columns   | 
    |   |
|Cached Min/Max Index Columns |All columns  
    |   |
|Min/Max Index Cache Level    |BLOCK
    |   |
|Table page size in mb    | 
    |   |
| | 
    |   |
|## Encoding Information  | 
    |   |
|Local Dictionary Enabled |true 
    |   |
|Local Dictionary Threshold   |1
    |   |
|Local Dictionary Include |metric,tags_id   
    |   |
| | 
    |   |
|## Compaction Information    | 
    |   |
|MAJOR_COMPACTION_SIZE    |1024 
    |   |
|AUTO_LOAD_MERGE  |false
    |   |
|COMPACTION_LEVEL_THRESHOLD   |4,3  
    |   |
|COMPACTION_PRESERVE_SEGMENTS |0
    |   |
|ALLOWED_COMPACTION_DAYS  |0
    |   |
| | 
    |   |
|## Partition Information | 
    |   |
|Partition Type   |NATIVE_HIVE  
    |   |
|Partition Columns    |ts2:TIMESTAMP  

[jira] [Created] (CARBONDATA-4079) Queries with Date range are taking time

2020-12-09 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4079:


 Summary: Queries with Date range are taking time
 Key: CARBONDATA-4079
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4079
 Project: CarbonData
  Issue Type: Improvement
  Components: data-query
Affects Versions: 2.1.0
Reporter: suyash yadav


Hi Team,

We are doing a POC to understand how can we improve performance of the query 
fired against table created in apache carbondata.

Below is the sample query:

 

*spark.sql("select ts,resource,metric,value from fact_timestamp_global left 
join tags_10_Days_test on fact_timestamp_global.tags_id= tags_10_Days_test.id 
where metric in ('Outbound Utilization (percent)','Inbound Utilization 
(percent)') and resource='10.212.7.98_if:<0001>' and  ts between '2020-09-21 
00:00:00' and '2020-09-21 12:55:55' group by 
ts,resource,metric,value").show(1,false)*

As you can see above query contains the date range filter.We have noticed that 
due to this date range filter the query time is coming around 15 seconds which 
is not proving useful as we have to bring down the query execution time to 3 to 
4 seconds.

Could you please review above query and suggest a better way of framing the 
above query specially the date range filter which can be  helpful to get the 
desired query execution time?

 

In case you need more details then please do let me know. 

 

Regards

Suyash Yadav



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CARBONDATA-4025) storage space for MV is double to that of a table on which MV has been created.

2020-10-19 Thread suyash yadav (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216618#comment-17216618
 ] 

suyash yadav commented on CARBONDATA-4025:
--

Hi Team,

 

Can somebody look into this request?

 

Regards

Suyash Yadav

> storage space for MV is double to that of a table on which MV has been 
> created.
> ---
>
> Key: CARBONDATA-4025
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4025
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.0.1
> Environment: Apcahe carbondata 2.0.1
> Apache spark 2.4.5
> Hadoop 2.7.2
>Reporter: suyash yadav
>Priority: Major
>
> We are doing a POC based on carbondata but we have observed that when we 
> create n MV on a table with timeseries function of same granualarity the MV 
> takes double the space of the table.
>  
> In my scenario, My table has 1.3 million records and MV also has same number 
> of records but the size of the table is 3.6 MB but the size of the MV is 
> around 6.5 MB.
> This is really important for us as critical business decision are getting 
> affected due to this behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CARBONDATA-4025) storage space for MV is double to that of a table on which MV has been created.

2020-10-14 Thread suyash yadav (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213792#comment-17213792
 ] 

suyash yadav commented on CARBONDATA-4025:
--

Hi any update on this?

> storage space for MV is double to that of a table on which MV has been 
> created.
> ---
>
> Key: CARBONDATA-4025
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4025
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.0.1
> Environment: Apcahe carbondata 2.0.1
> Apache spark 2.4.5
> Hadoop 2.7.2
>Reporter: suyash yadav
>Priority: Major
>
> We are doing a POC based on carbondata but we have observed that when we 
> create n MV on a table with timeseries function of same granualarity the MV 
> takes double the space of the table.
>  
> In my scenario, My table has 1.3 million records and MV also has same number 
> of records but the size of the table is 3.6 MB but the size of the MV is 
> around 6.5 MB.
> This is really important for us as critical business decision are getting 
> affected due to this behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4025) strage size of MV is double to that of a table.

2020-10-09 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4025:


 Summary: strage size of MV is double to that of a table.
 Key: CARBONDATA-4025
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4025
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Affects Versions: 2.0.1
 Environment: Apcahe carbondata 2.0.1
Apache spark 2.4.5
Hadoop 2.7.2
Reporter: suyash yadav


We are doing a POC based on carbondata but we have observed that when we create 
n MV on a table with timeseries function of same granualarity the MV takes 
double the space of the table.

 

In my scenario, My table has 1.3 million records and MV also has same number of 
records but the size of the table is 3.6 MB but the size of the MV is around 
6.5 MB.

This is really important for us as critical business decision are getting 
affected due to this behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-4025) storage space for MV is double to that of a table on which MV has been created.

2020-10-09 Thread suyash yadav (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

suyash yadav updated CARBONDATA-4025:
-
Summary: storage space for MV is double to that of a table on which MV has 
been created.  (was: strage size of MV is double to that of a table.)

> storage space for MV is double to that of a table on which MV has been 
> created.
> ---
>
> Key: CARBONDATA-4025
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4025
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.0.1
> Environment: Apcahe carbondata 2.0.1
> Apache spark 2.4.5
> Hadoop 2.7.2
>Reporter: suyash yadav
>Priority: Major
>
> We are doing a POC based on carbondata but we have observed that when we 
> create n MV on a table with timeseries function of same granualarity the MV 
> takes double the space of the table.
>  
> In my scenario, My table has 1.3 million records and MV also has same number 
> of records but the size of the table is 3.6 MB but the size of the MV is 
> around 6.5 MB.
> This is really important for us as critical business decision are getting 
> affected due to this behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3951) Documentation not available for upgrading carbondata from older to newer version

2020-08-14 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-3951:


 Summary: Documentation not available for upgrading carbondata from 
older to newer version
 Key: CARBONDATA-3951
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3951
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Affects Versions: 1.6.1
 Environment: RHEL 7.x
Reporter: suyash yadav
 Fix For: 2.0.0


Hi Team,

 

We are doing a POC where we want to upgrade carbondata inside our application  
from old version to latest version.

 

Due to lack of documentation, We are not able to move ahead as we are not sure 
about he steps needs to be followed to perform the upgrade and then verify if 
it has been done successfully without any issue.

 

We tried replacing jars with newer ones but we got below exception:

 

    

    Exception in thread "main" 
java.lang.reflect.InvocationTargetException

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

    at java.lang.reflect.Method.invoke(Method.java:498)

    at 
org.apache.spark.util.CarbonReflectionUtils$.getSessionState(CarbonReflectionUtils.scala:223)

    at 
org.apache.spark.sql.CarbonSession.sessionState$lzycompute(CarbonSession.scala:57)

    at 
org.apache.spark.sql.CarbonSession.sessionState(CarbonSession.scala:56)

    at 
org.apache.spark.sql.CarbonSession$CarbonBuilder.getOrCreateCarbonSession(CarbonSession.scala:258)

    at 
org.apache.spark.sql.CarbonSession$CarbonBuilder.getOrCreateCarbonSession(CarbonSession.scala:165)

    at persistent.diamond.Server.start(Server.java:176)

    at persistent.diamond.Server.main(Server.java:595)

Caused by: java.lang.NoSuchMethodError: 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser: method ()V not 
found

    at 
org.apache.spark.sql.parser.CarbonSparkSqlParser.(CarbonSparkSqlParser.scala:39)

    at 
org.apache.spark.sql.hive.CarbonSessionStateBuilder.sqlParser$lzycompute(CarbonSessionState.scala:204)

    at 
org.apache.spark.sql.hive.CarbonSessionStateBuilder.sqlParser(CarbonSessionState.scala:204)

    at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:292)

    ... 11 more

    

 

Could you please help here to resolve above issue and clearly define the steps 
for upgrade.

 

Regards

Suyash Yadav



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3354) how to use filiters in datamaps

2019-04-12 Thread suyash yadav (JIRA)
suyash yadav created CARBONDATA-3354:


 Summary: how to use filiters in datamaps
 Key: CARBONDATA-3354
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3354
 Project: CarbonData
  Issue Type: Task
  Components: core
Affects Versions: 1.5.2
 Environment: apache carbon data 1.5.x
Reporter: suyash yadav
 Fix For: NONE


Hi Team,

 

We are doing a POC on apache carbon data so that we can verify if this database 
is capable of handling amount of data we are collecting form network devices.

 

We are stuck on few of our datamap related activities and have below queries: 

 
 # How to use timiebased filters while creating datamap.We tried a time based 
condition while creating a datamap but it didn't work.
 # How to create a timeseries datamap with column which is having value of 
epoch time.Our query is like below:-  *carbon.sql("CREATE DATAMAP test ON TABLE 
carbon_RT_test USING 'timeseries' DMPROPERTIES 
('event_time'='endMs','minute_granularity'='1',) AS SELECT sum(inOctets) FROM 
carbon_RT_test GROUP BY inIfId")*
 # *In above query endMs is having epoch time value.*
 # We got an error like below: "Timeseries event time is only supported on 
Timestamp column"
 # Also we need to know if we can have a time granularity other then 1 like in 
above query, can we have minute_granularity='5*'.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)