[jira] [Created] (CARBONDATA-4083) Refactor Update and Support Update Atomicity

2020-12-13 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-4083:
---

 Summary: Refactor Update and Support Update Atomicity
 Key: CARBONDATA-4083
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4083
 Project: CarbonData
  Issue Type: Improvement
Reporter: Xingjun Hao


Currently, we will modify tablestatus file for serveral times in the update 
flow. In total 4 tablestauts write ops destoy the Atomicity to a certain 
extent. which maybe incur dirty data under update failure scenrios.

The first time we update tablestatus is when writing delta files, firstly we 
update the updatedeltastarttime and updatedeltaendtime in the tablestatus, then 
delete some segments, which bring 2 tablestatus write ops.



The second time we update tatblstatus is when insert new data. just like the 
first time, will bring 2 tablesatus write ops.

Also, auto compaction doesn't work for UPDATE. UPDATE won't trigger MINOR 
Compaction even when we TURN ON carbon.merge.auto.compaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4063) Refactor getBlockId and getShortBlockId function

2020-11-29 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-4063:
---

 Summary: Refactor getBlockId and getShortBlockId function
 Key: CARBONDATA-4063
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4063
 Project: CarbonData
  Issue Type: Improvement
Reporter: Xingjun Hao


Now. getBlockId and getShortBlockId functions are too complex and unreadable.

Which need to be simpler and readable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4045) Add TPCDS TestCase

2020-10-26 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-4045:
---

 Summary: Add TPCDS TestCase
 Key: CARBONDATA-4045
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4045
 Project: CarbonData
  Issue Type: Test
Reporter: Xingjun Hao


There is no TPC-DS TestCase in the current source code. It is difficult to 
debug TPC-DS on small dataset. Also, TPC-DS TestCase would help to find 
possible issues



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4044) Fix dirty data in indexfile while IUD with stale data in segment folder

2020-10-26 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-4044:
---

 Summary: Fix dirty data in indexfile while IUD with stale data in 
segment folder
 Key: CARBONDATA-4044
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4044
 Project: CarbonData
  Issue Type: Bug
Reporter: Xingjun Hao


XX.mergecarbonindex and XX..segment records the indexfiles list of a segment. 
now, we generate xx.mergeindexfile and xx.segment  based on filter out all 
indexfiles(including carbonindex and mergecarbonindex), which will leading 
dirty data when there is stale data in segment folder.

For example, there are a stale index file in segment_0 folder, 
"0_1603763776.carbonindex".

While loading, a new carbonindex "0_16037752342.carbonindex" is wrote, when 
merge carbonindex files, we expect to only merge 0_16037752342.carbonindex, But 
If we filter out all carbonindex in segment folder, both 
"0_1603763776.carbonindex" and 0_16037752342.carbonindex will be merged and 
recorded into segment file.

 

While updating, there has same problem. 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4032) Drop partition command clean other partition dictionaries

2020-10-13 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-4032:
---

 Summary: Drop partition command clean other partition dictionaries
 Key: CARBONDATA-4032
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4032
 Project: CarbonData
  Issue Type: Bug
  Components: sql
Affects Versions: 2.0.1
Reporter: Xingjun Hao
 Fix For: 2.1.0


1. CREATE TABLE droppartition (id STRING, sales STRING) PARTITIONED BY (dtm 
STRING)STORED AS carbondata

2. insert into droppartition values ('01', '0', '20200907'),('03', '0', 
'20200908'),

3. insert overwrite table droppartition partition (dtm=20200908) select * from 
droppartition where dtm = 20200907;
insert overwrite table droppartition partition (dtm=20200909) select * from 
droppartition where dtm = 20200907;

4. alter table droppartition drop partition (dtm=20200909)

the dirctionary "20200908" was deleted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4028) Fail to unlock during update

2020-10-11 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-4028:
---

 Summary: Fail to unlock during update
 Key: CARBONDATA-4028
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4028
 Project: CarbonData
  Issue Type: Bug
Reporter: Xingjun Hao


In the update flow, we unpresist {{dataset before unlocking. unlock will fail 
once the dataset unpresist is interrupted.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4027) Fix the wrong modifiedtime of loading files in insert stage

2020-10-11 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-4027:
---

 Summary: Fix the wrong modifiedtime of loading files in insert 
stage
 Key: CARBONDATA-4027
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4027
 Project: CarbonData
  Issue Type: Bug
Reporter: Xingjun Hao


In the insertstage flow, there is a empty file with suffix '.loading' to mark 
the stage in the status of 'in processing'. We update the modifiedtime of 
'.loading' file for monitoring the insertstage start time, which can be used 
for calculate TIMEOUT, help to retry and recovery.

Before, we use setModifiedTime function to update the modifiedtime, which has a 
serious bug.

For S3 file, setModifiedTime operation do not take effect. leading to the 
incorrect inserstage starttime of 'loading' file.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4026) Thread leakage while Loading

2020-10-11 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-4026:
---

 Summary: Thread leakage while Loading
 Key: CARBONDATA-4026
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4026
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Affects Versions: 2.0.1
Reporter: Xingjun Hao
 Fix For: 2.1.0


A few code of Inserting/Loading/InsertStage/IndexServer won't shutdown 
executorservice. leads to thread leakage which will degrade the performance of 
the driver and executor. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4016) NPE and FileNotFound in Show Segments and Insert Stage

2020-09-29 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-4016:
---

 Summary: NPE and FileNotFound in Show Segments and Insert Stage
 Key: CARBONDATA-4016
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4016
 Project: CarbonData
  Issue Type: Bug
  Components: flink-integration, spark-integration
Affects Versions: 2.0.1
Reporter: Xingjun Hao
 Fix For: 2.1.0


# Insert Stage,  While Spark read Stages which are writting by Flink in the 
meanwhile, JSONFORMAT EXCEPTION will be thrown.
 # Show Segments with STAGE, when read stages which are writting by Flink or 
deleting by spark. JSONFORMAT EXCEPTION will be thrown
 # Show Segment will load partition info for non-partition table, which shall 
be avoided.
 # In getLastModifiedTime of TableStatus, if the loadendtime is empty, 
getLastModifiedTime throw NPE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4014) Support Change Column Comment

2020-09-26 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-4014:
---

 Summary: Support Change Column Comment
 Key: CARBONDATA-4014
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4014
 Project: CarbonData
  Issue Type: New Feature
  Components: sql
Affects Versions: 2.0.1
Reporter: Xingjun Hao
 Fix For: 2.1.0


Now, we support add comment when CREATE TABLE and ADD COLUMN. but do not 
support alter comment of specified column. 

We shall support alter comment with hive syntax

"ALTER TABLE table_name CHANGE [COLUMN] col_name col_name data_type [COMMENT 
col_comment]"

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3945) NPE while Data Loading

2020-08-07 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3945:
---

 Summary: NPE while Data Loading 
 Key: CARBONDATA-3945
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3945
 Project: CarbonData
  Issue Type: Bug
Reporter: Xingjun Hao


# getLastModifiedTime of LoadMetadataDetails fails due to 
"updateDeltaEndTimestamp is empty string".
 # In the getCommittedIndexFile founction, NPE happens because of "segmentfile 
is null" under the Unusual cases.
 # Cleaning temp files fails because of "partitionInfo is null" under the 
unusual cases.
 # When calculating sizeInBytes of CarbonRelation, under the unusual cases, it 
need to collect the directory size. but the directory path only works for 
non-partition tables, for partition tables, filenotfoundexcepiton was throwed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3944) Delete stage files was interrupted when IOException happen

2020-08-07 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3944:
---

 Summary: Delete stage files was interrupted when IOException happen
 Key: CARBONDATA-3944
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3944
 Project: CarbonData
  Issue Type: Bug
Reporter: Xingjun Hao


In the insertstage flow, the stage files will be deleted with retry mechanism. 
but then IOException happen due to network abnormal etc, the delete stage flow 
will be interrupted, which is unexpected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3940) Fail to commit the output of task due to Rename IOException in the Loading processing

2020-08-05 Thread Xingjun Hao (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingjun Hao updated CARBONDATA-3940:

Summary: Fail to commit the output of task due to Rename IOException in the 
Loading processing  (was: Fail to commit the output of task due to Rename 
IOException in the Data Loading)

> Fail to commit the output of task due to Rename IOException in the Loading 
> processing
> -
>
> Key: CARBONDATA-3940
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3940
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Xingjun Hao
>Priority: Major
>
> During the load process, commitTask fails with high probability. The 
> exceptionstack shows that it was throwed by HadoopMapReduceCommitProtocol, 
> not CarbonSQLHadoopMapMapReduceCommitProtocol, implying that there is has a 
> class type error in initialization of the "Committer". which should have been 
> initialized as CarbonSQLHadoopMapMapReduceCommitProtocol, but was incorrectly 
> initialized to HadoopMapReduceCommitProtocol.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3940) Fail to commit the output of task due to Rename IOException in the Data Loading

2020-08-05 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3940:
---

 Summary: Fail to commit the output of task due to Rename 
IOException in the Data Loading
 Key: CARBONDATA-3940
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3940
 Project: CarbonData
  Issue Type: Bug
Reporter: Xingjun Hao


During the load process, commitTask fails with high probability. The 
exceptionstack shows that it was throwed by HadoopMapReduceCommitProtocol, not 
CarbonSQLHadoopMapMapReduceCommitProtocol, implying that there is has a class 
type error in initialization of the "Committer". which should have been 
initialized as CarbonSQLHadoopMapMapReduceCommitProtocol, but was incorrectly 
initialized to HadoopMapReduceCommitProtocol.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3940) CommitTask fails due to Rename IOException in the Loading processing

2020-08-05 Thread Xingjun Hao (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingjun Hao updated CARBONDATA-3940:

Summary: CommitTask fails due to Rename IOException in the Loading 
processing  (was: Fail to commit the output of task due to Rename IOException 
in the Loading processing)

> CommitTask fails due to Rename IOException in the Loading processing
> 
>
> Key: CARBONDATA-3940
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3940
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Xingjun Hao
>Priority: Major
>
> During the load process, commitTask fails with high probability. The 
> exceptionstack shows that it was throwed by HadoopMapReduceCommitProtocol, 
> not CarbonSQLHadoopMapMapReduceCommitProtocol, implying that there is has a 
> class type error in initialization of the "Committer". which should have been 
> initialized as CarbonSQLHadoopMapMapReduceCommitProtocol, but was incorrectly 
> initialized to HadoopMapReduceCommitProtocol.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3898) Support Option 'carbon.enable.querywithmv'

2020-07-12 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3898:
---

 Summary: Support Option 'carbon.enable.querywithmv'
 Key: CARBONDATA-3898
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3898
 Project: CarbonData
  Issue Type: New Feature
Reporter: Xingjun Hao


When MV enabled, SQL rewrite takes a lot of time, a new option 
'carbon.enable.querywithmv' shall be supported, which can turn off SQL Rewrite 
when the configured value is false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3879) Filtering Segmets Optimazation

2020-06-29 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3879:
---

 Summary: Filtering Segmets Optimazation
 Key: CARBONDATA-3879
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3879
 Project: CarbonData
  Issue Type: Improvement
  Components: data-query
Affects Versions: 2.0.0
Reporter: Xingjun Hao
 Fix For: 2.0.2


During filter segments flow, there are a lot of LIST.CONTAINS, which has heavy 
time overhead when there are tens of thousands segments.

For example, if there are 5 segments. it will trigger LIST.CONTAINS  for 
each segment, the LIST also has about 5 elements. so the time complexity 
will be O(5 * 5 )



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (CARBONDATA-3877) Reduce read tablestatus overhead during inserting into partition table

2020-06-29 Thread Xingjun Hao (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingjun Hao closed CARBONDATA-3877.
---
Resolution: Fixed

> Reduce read tablestatus overhead during inserting into partition table
> --
>
> Key: CARBONDATA-3877
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3877
> Project: CarbonData
>  Issue Type: Improvement
>  Components: spark-integration
>Affects Versions: 2.0.0
>Reporter: Xingjun Hao
>Priority: Major
> Fix For: 2.0.2
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently during inserting into a partition table, there are a lot of 
> tablestauts read operations, but when storing table status file in object 
> store, reading of table status file may fail (receive IOException or 
> JsonSyntaxException) when table status file is being modifying, which leading 
> to High failure rate when concurrent insert into a partition table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3877) Reduce read tablestatus overhead during inserting into partition table

2020-06-28 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3877:
---

 Summary: Reduce read tablestatus overhead during inserting into 
partition table
 Key: CARBONDATA-3877
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3877
 Project: CarbonData
  Issue Type: Improvement
  Components: spark-integration
Affects Versions: 2.0.0
Reporter: Xingjun Hao
 Fix For: 2.0.2


Currently during inserting into a partition table, there are a lot of 
tablestauts read operations, but when storing table status file in object 
store, reading of table status file may fail (receive IOException or 
JsonSyntaxException) when table status file is being modifying, which leading 
to High failure rate when concurrent insert into a partition table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3875) Support show segments include stage

2020-06-27 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3875:
---

 Summary: Support show segments include stage
 Key: CARBONDATA-3875
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3875
 Project: CarbonData
  Issue Type: New Feature
  Components: spark-integration
Affects Versions: 2.0.0, 2.0.1
Reporter: Xingjun Hao
 Fix For: 2.0.2


There is a lack of monitoring of the stage information in the current system, 
'Show segments include stage' command shall be supported. which will provide 
monitoring information, such as createTime, partitioninfo, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3858) Check CDC deltafiles count in the testcase

2020-06-22 Thread Xingjun Hao (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingjun Hao updated CARBONDATA-3858:

Description: Current there is no deltafiles count check in the testcase, 
which shall be supplemented.  (was: In the CDC flow. the parallelism of 
deltafiles processing is the same as executor number, which reduce the 
parallelism heavily. The insufficient parallelism limits CPU overhead, hampers 
CDC's performance.)
Summary: Check CDC deltafiles count in the testcase  (was: Increase the 
parallelism of CDC intermediate files processing)

> Check CDC deltafiles count in the testcase
> --
>
> Key: CARBONDATA-3858
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3858
> Project: CarbonData
>  Issue Type: Improvement
>Reporter: Xingjun Hao
>Priority: Minor
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Current there is no deltafiles count check in the testcase, which shall be 
> supplemented.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3862) Insert stage performance optimazation

2020-06-21 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3862:
---

 Summary: Insert stage performance optimazation
 Key: CARBONDATA-3862
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3862
 Project: CarbonData
  Issue Type: New Feature
Reporter: Xingjun Hao


There are two major performance bottlenecks of insert stage.

1) Get LastModifyTime of stagefiles requires a lot of access to OBS
2) Parallelism is not supported

Which shall be optimazed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3860) Fix IndexServer keeps loading some segments index repeatly

2020-06-20 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3860:
---

 Summary: Fix IndexServer keeps loading some segments index repeatly
 Key: CARBONDATA-3860
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3860
 Project: CarbonData
  Issue Type: Bug
Reporter: Xingjun Hao


In current 

getTableBlockIndexUniqueIdentifiers function. if the 
segmentBlockIndexInfo.getSegmentMetaDataInfo() is null, the IndexServer will 
keeps loading the index of this segment repeatly. We shall avoid to let it 
affect query performance, considering MetaDataInfo doesn't matter with the 
query processing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3859) Lock and retry to read tablestatus before throwing EOFException or JsonSyntaxException

2020-06-20 Thread Xingjun Hao (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingjun Hao updated CARBONDATA-3859:

Summary: Lock and retry to read tablestatus before throwing EOFException or 
JsonSyntaxException  (was: Enhance lock and retry of Reading tablestatus files 
while loading)

> Lock and retry to read tablestatus before throwing EOFException or 
> JsonSyntaxException
> --
>
> Key: CARBONDATA-3859
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3859
> Project: CarbonData
>  Issue Type: Improvement
>Reporter: Xingjun Hao
>Priority: Major
>
> when storing table status file in object store, reading of table status file 
> mayfail (receive EOFException or JsonSyntaxException)
> when table status file is being modifying
> we shall retry multiple times and add the lock before throwing EOFException 
> or JsonSyntaxException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3859) Enhance lock and retry of Reading tablestatus files while loading

2020-06-20 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3859:
---

 Summary: Enhance lock and retry of Reading tablestatus files while 
loading
 Key: CARBONDATA-3859
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3859
 Project: CarbonData
  Issue Type: Improvement
Reporter: Xingjun Hao


when storing table status file in object store, reading of table status file 
mayfail (receive EOFException or JsonSyntaxException)
when table status file is being modifying
we shall retry multiple times and add the lock before throwing EOFException or 
JsonSyntaxException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3858) Increase the parallelism of CDC intermediate files processing

2020-06-19 Thread Xingjun Hao (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingjun Hao updated CARBONDATA-3858:

Summary: Increase the parallelism of CDC intermediate files processing  
(was: Increase the parallelism of CDC deltafiles processing)

> Increase the parallelism of CDC intermediate files processing
> -
>
> Key: CARBONDATA-3858
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3858
> Project: CarbonData
>  Issue Type: Improvement
>Reporter: Xingjun Hao
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In the CDC flow. the parallelism of deltafiles processing is the same as 
> executor number, which reduce the parallelism heavily. The insufficient 
> parallelism limits CPU overhead, hampers CDC's performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3858) Increase the parallelism of CDC deltafiles processing

2020-06-18 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3858:
---

 Summary: Increase the parallelism of CDC deltafiles processing
 Key: CARBONDATA-3858
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3858
 Project: CarbonData
  Issue Type: Improvement
Reporter: Xingjun Hao


In the CDC flow. the parallelism of deltafiles processing is the same as 
executor number, which reduce the parallelism heavily. The insufficient 
parallelism limits CPU overhead, hampers CDC's performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3856) Support the LIMIT operator for show segments command

2020-06-17 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3856:
---

 Summary: Support the LIMIT operator for show segments command
 Key: CARBONDATA-3856
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3856
 Project: CarbonData
  Issue Type: New Feature
  Components: spark-integration
Affects Versions: 2.0.0
Reporter: Xingjun Hao
 Fix For: 2.0.2


Now, in the 2.0.0 release, CarbonData doesn't support LIMIT operator in the 
SHOW SEGMENTS command. The time cost is expensive when there are too many 
segments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3820) Fix CDC failure when sort columns present in source dataframe

2020-06-16 Thread Xingjun Hao (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingjun Hao updated CARBONDATA-3820:

Summary: Fix CDC failure when sort columns present in source dataframe  
(was: Support GlobalSort in the CDC)

> Fix CDC failure when sort columns present in source dataframe
> -
>
> Key: CARBONDATA-3820
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3820
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: Xingjun Hao
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> If there is GloabalSort table in the CDC Flow. The following exception will
> be throwed:
> Exception in thread "main" java.lang.RuntimeException: column: id specified
> in sort columns does not exist in schema
>         at
> org.apache.carbondata.sdk.file.CarbonWriterBuilder.buildTableSchema(CarbonWriterBuilder.java:828)
>         at
> org.apache.carbondata.sdk.file.CarbonWriterBuilder.buildCarbonTable(CarbonWriterBuilder.java:794)
>         at
> org.apache.carbondata.sdk.file.CarbonWriterBuilder.buildLoadModel(CarbonWriterBuilder.java:720)
>         at
> org.apache.spark.sql.carbondata.execution.datasources.CarbonSparkDataSourceUtil$.prepareLoadModel(CarbonSparkDataSourceUtil.scala:281)
>         at
> org.apache.spark.sql.carbondata.execution.datasources.SparkCarbonFileFormat.prepareWrite(SparkCarbonFileFormat.scala:141)
>         at
> org.apache.spark.sql.execution.command.mutation.merge.CarbonMergeDataSetCommand.processIUD(CarbonMergeDataSetCommand.scala:269)
>         at
> org.apache.spark.sql.execution.command.mutation.merge.CarbonMergeDataSetCommand.processData(CarbonMergeDataSetCommand.scala:152)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3820) Support GlobalSort in the CDC

2020-05-12 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3820:
---

 Summary: Support GlobalSort in the CDC
 Key: CARBONDATA-3820
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3820
 Project: CarbonData
  Issue Type: New Feature
Reporter: Xingjun Hao


If there is GloabalSort table in the CDC Flow. The following exception will
be throwed:

Exception in thread "main" java.lang.RuntimeException: column: id specified
in sort columns does not exist in schema
        at
org.apache.carbondata.sdk.file.CarbonWriterBuilder.buildTableSchema(CarbonWriterBuilder.java:828)
        at
org.apache.carbondata.sdk.file.CarbonWriterBuilder.buildCarbonTable(CarbonWriterBuilder.java:794)
        at
org.apache.carbondata.sdk.file.CarbonWriterBuilder.buildLoadModel(CarbonWriterBuilder.java:720)
        at
org.apache.spark.sql.carbondata.execution.datasources.CarbonSparkDataSourceUtil$.prepareLoadModel(CarbonSparkDataSourceUtil.scala:281)
        at
org.apache.spark.sql.carbondata.execution.datasources.SparkCarbonFileFormat.prepareWrite(SparkCarbonFileFormat.scala:141)
        at
org.apache.spark.sql.execution.command.mutation.merge.CarbonMergeDataSetCommand.processIUD(CarbonMergeDataSetCommand.scala:269)
        at
org.apache.spark.sql.execution.command.mutation.merge.CarbonMergeDataSetCommand.processData(CarbonMergeDataSetCommand.scala:152)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3816) Support Float and Decimal in the Merge Flow

2020-05-11 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3816:
---

 Summary: Support Float and Decimal in the Merge Flow
 Key: CARBONDATA-3816
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3816
 Project: CarbonData
  Issue Type: New Feature
  Components: data-load
Affects Versions: 2.0.0
Reporter: Xingjun Hao
 Fix For: 2.0.1


We don't support FLOAT and DECIMAL datatype in the CDC Flow. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3722) Create filterExecuter for each segment instead of blocklet, To improve prune performance

2020-02-25 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3722:
---

 Summary: Create filterExecuter for each segment instead of 
blocklet, To improve prune performance
 Key: CARBONDATA-3722
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3722
 Project: CarbonData
  Issue Type: Improvement
Reporter: Xingjun Hao


In the prunning, It will create filterexecuter for each blocklet, which 
involves a huge performance degrade when there are serveral million blocklet. 

We shall create filterexecuter per segment instead of that per blocklet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3712) Support insert stage in parallel

2020-02-17 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3712:
---

 Summary: Support insert stage in parallel
 Key: CARBONDATA-3712
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3712
 Project: CarbonData
  Issue Type: Improvement
Reporter: Xingjun Hao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3703) Support insert stage in parallel

2020-02-14 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3703:
---

 Summary: Support insert stage in parallel
 Key: CARBONDATA-3703
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3703
 Project: CarbonData
  Issue Type: Improvement
Reporter: Xingjun Hao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3702) Clean temp index files in parallel in merge index flow

2020-02-14 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3702:
---

 Summary: Clean temp index files in parallel in merge index flow
 Key: CARBONDATA-3702
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3702
 Project: CarbonData
  Issue Type: Improvement
Reporter: Xingjun Hao


Now, Cleaning temp index files merge index flow takes a lot of time, sometimes 
it will take 2~3 mins, which should be optimized



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3700) Optimize prune performance when prunning with multi-threads

2020-02-12 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3700:
---

 Summary: Optimize prune performance when prunning with 
multi-threads
 Key: CARBONDATA-3700
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3700
 Project: CarbonData
  Issue Type: Bug
  Components: data-query
Affects Versions: 2.0.0
Reporter: Xingjun Hao


When pruning with multi-threads, there is a bug hambers the prunning 
performance heavily. 

When the datamap pruning results in no blocklet to map filter, The 
getExtendblocklet function aims to get the extend blocklet metadata, when the 
Input is a empty blocklet list, this function should return a extend blocklet 
list directyly , but now there is a bug leading to a hashset add operation 
overhead.

Meanwhile ,When pruning with multi-threads, the getExtendblocklet function will 
be triggerd for each blocklet. This should avoided by trgger this function for 
each segment.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3683) Support compress offheap data directly in the columnpage in IndexStorageCodec

2020-02-08 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3683:
---

 Summary: Support compress offheap data directly in the columnpage 
in IndexStorageCodec
 Key: CARBONDATA-3683
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3683
 Project: CarbonData
  Issue Type: Sub-task
Reporter: Xingjun Hao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3682) Support compress offheap data directly in the columnpage if the dataype is primitive

2020-02-07 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3682:
---

 Summary: Support compress offheap data directly in the columnpage 
if the dataype is primitive
 Key: CARBONDATA-3682
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3682
 Project: CarbonData
  Issue Type: Sub-task
Reporter: Xingjun Hao


If the datatype is primitve, like 
BOOLEAN/BYTE/SHORT/SHORT_INT/INT/LONG/FLOAT/DOUBLE/Decimal, the columnpage 
should be compressed based on the direct bytebuffer on the offheap directly, To 
avoid a copy from offheap to heap to reduce the GC overhead



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3671) Support compress direct bytebuffer in the SNAPPY/ZSTD/GZIP compressor

2020-01-29 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3671:
---

 Summary: Support compress direct bytebuffer in the 
SNAPPY/ZSTD/GZIP compressor
 Key: CARBONDATA-3671
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3671
 Project: CarbonData
  Issue Type: Sub-task
Reporter: Xingjun Hao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3670) Support compress offheap columnpage directly, avoding a copy of data from offhead to heap when compressed.

2020-01-29 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3670:
---

 Summary: Support compress offheap columnpage directly, avoding a 
copy of data from offhead to heap when compressed.
 Key: CARBONDATA-3670
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3670
 Project: CarbonData
  Issue Type: Wish
  Components: core
Affects Versions: 2.0.0
Reporter: Xingjun Hao
 Fix For: 2.0.0


When writing data, the columnpages are stored on the offheap,  the pages will 
be compressed to save storage cost. Now, in the compression processing, the 
data will be copied from the offheap to the heap before compressed, which leads 
to heavier GC overhead compared with compress offhead directly.

To sum up, we support compress offheap columnpage directly, avoding a copy of 
data from offhead to heap when compressed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3669) Delete Physical Partition When Drop Partition

2020-01-21 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3669:
---

 Summary: Delete Physical Partition When Drop Partition
 Key: CARBONDATA-3669
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3669
 Project: CarbonData
  Issue Type: Improvement
Reporter: Xingjun Hao


When drop partition, the data will not be clean, which is different with HIVE.

The customers will confuse about that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3667) Insert stage recover processing of the partition table throw exception “the unexpected 0 segment found”

2020-01-19 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3667:
---

 Summary: Insert stage recover processing of the partition table 
throw exception “the unexpected 0 segment found”
 Key: CARBONDATA-3667
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3667
 Project: CarbonData
  Issue Type: Bug
  Components: core
Affects Versions: 2.0.0
Reporter: Xingjun Hao


Insert stage recover processing of the partition table throw exception “the 
unexpected 0 segment found”



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3648) Support Alter Table Compaction Level Threshold

2019-12-31 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3648:
---

 Summary: Support Alter Table Compaction Level Threshold
 Key: CARBONDATA-3648
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3648
 Project: CarbonData
  Issue Type: Improvement
Reporter: Xingjun Hao


The Alter Table sould support Compaction Level Threshold. Also, the upper limit 
is 100, which is too small to meet the scenario with massive small files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3644) Support Configuration of Complex Delimiters in Carbon Properties

2019-12-31 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3644:
---

 Summary: Support Configuration of Complex Delimiters in Carbon 
Properties
 Key: CARBONDATA-3644
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3644
 Project: CarbonData
  Issue Type: Improvement
Reporter: Xingjun Hao


In the insert carbontable select from a parquet table processing, if the binary 
column has the content '\001', like 'col1\001col2', the content before '\001' 
will be truncated as '\001' is the Complex Delimiter. The problem is that 
Complex Delimiter can't be configured in the insert flow, which needs to 
improve.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3643) Insert array('')/array() into Struct column will result in array(null), which is inconsist with Parquet

2019-12-30 Thread Xingjun Hao (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingjun Hao updated CARBONDATA-3643:

Description: 
sql("create table datatype_struct_parquet(price struct>) stored 
as parquet")
 sql("insert into table datatype_struct_parquet values(named_struct('b', 
array('')))")
 sql("create table datatype_struct_carbondata(price struct>) 
stored as carbondata")
 sql("insert into datatype_struct_carbondata select * from 
datatype_struct_parquet")

 
{code:java}
//

sql("create table datatype_struct_parquet(price struct>) stored 
as parquet") 
sql("insert into table datatype_struct_parquet values(named_struct('b', 
array('')))") 
sql("create table datatype_struct_carbondata(price struct>) 
stored as carbondata") sql("insert into datatype_struct_carbondata select * 
from datatype_struct_parquet")

checkAnswer( sql("SELECT * FROM datatype_struct_carbondata"), sql("SELECT * 
FROM datatype_struct_parquet"))

!== Correct Answer - 1 == == Spark Answer - 1 == 
![[WrappedArray()]] [[WrappedArray(null)]]
{code}
 

  was:
sql("create table datatype_struct_parquet(price struct>) stored 
as parquet")
 sql("insert into table datatype_struct_parquet values(named_struct('b', 
array('')))")
 sql("create table datatype_struct_carbondata(price struct>) 
stored as carbondata")
 sql("insert into datatype_struct_carbondata select * from 
datatype_struct_parquet")

 
{code:java}
//
checkAnswer( sql("SELECT * FROM datatype_struct_carbondata"), sql("SELECT * 
FROM datatype_struct_parquet"))

!== Correct Answer - 1 == == Spark Answer - 1 == 
![[WrappedArray()]] [[WrappedArray(null)]]
{code}
 


> Insert array('')/array() into Struct column will result in 
> array(null), which is inconsist with Parquet
> --
>
> Key: CARBONDATA-3643
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3643
> Project: CarbonData
>  Issue Type: Bug
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Xingjun Hao
>Priority: Minor
> Fix For: 2.0.0
>
>
> sql("create table datatype_struct_parquet(price struct>) 
> stored as parquet")
>  sql("insert into table datatype_struct_parquet values(named_struct('b', 
> array('')))")
>  sql("create table datatype_struct_carbondata(price struct>) 
> stored as carbondata")
>  sql("insert into datatype_struct_carbondata select * from 
> datatype_struct_parquet")
>  
> {code:java}
> //
> sql("create table datatype_struct_parquet(price struct>) 
> stored as parquet") 
> sql("insert into table datatype_struct_parquet values(named_struct('b', 
> array('')))") 
> sql("create table datatype_struct_carbondata(price struct>) 
> stored as carbondata") sql("insert into datatype_struct_carbondata select * 
> from datatype_struct_parquet")
> checkAnswer( sql("SELECT * FROM datatype_struct_carbondata"), sql("SELECT * 
> FROM datatype_struct_parquet"))
> !== Correct Answer - 1 == == Spark Answer - 1 == 
> ![[WrappedArray()]] [[WrappedArray(null)]]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3643) Insert array('')/array() into Struct column will result in array(null), which is inconsist with Parquet

2019-12-30 Thread Xingjun Hao (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingjun Hao updated CARBONDATA-3643:

Description: 
sql("create table datatype_struct_parquet(price struct>) stored 
as parquet")
 sql("insert into table datatype_struct_parquet values(named_struct('b', 
array('')))")
 sql("create table datatype_struct_carbondata(price struct>) 
stored as carbondata")
 sql("insert into datatype_struct_carbondata select * from 
datatype_struct_parquet")

 
{code:java}
//
checkAnswer( sql("SELECT * FROM datatype_struct_carbondata"), sql("SELECT * 
FROM datatype_struct_parquet"))

!== Correct Answer - 1 == == Spark Answer - 1 == 
![[WrappedArray()]] [[WrappedArray(null)]]
{code}
 

  was:
sql("create table datatype_struct_parquet(price struct>) stored 
as parquet")
sql("insert into table datatype_struct_parquet values(named_struct('b', 
array('')))")
sql("create table datatype_struct_carbondata(price struct>) 
stored as carbondata")
sql("insert into datatype_struct_carbondata select * from 
datatype_struct_parquet")

 
{code:java}
//
!== Correct Answer - 1 == == Spark Answer - 1 == 
![[WrappedArray()]] [[WrappedArray(null)]]
{code}
 


> Insert array('')/array() into Struct column will result in 
> array(null), which is inconsist with Parquet
> --
>
> Key: CARBONDATA-3643
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3643
> Project: CarbonData
>  Issue Type: Bug
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Xingjun Hao
>Priority: Minor
> Fix For: 2.0.0
>
>
> sql("create table datatype_struct_parquet(price struct>) 
> stored as parquet")
>  sql("insert into table datatype_struct_parquet values(named_struct('b', 
> array('')))")
>  sql("create table datatype_struct_carbondata(price struct>) 
> stored as carbondata")
>  sql("insert into datatype_struct_carbondata select * from 
> datatype_struct_parquet")
>  
> {code:java}
> //
> checkAnswer( sql("SELECT * FROM datatype_struct_carbondata"), sql("SELECT * 
> FROM datatype_struct_parquet"))
> !== Correct Answer - 1 == == Spark Answer - 1 == 
> ![[WrappedArray()]] [[WrappedArray(null)]]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3643) Insert array('')/array() into Struct column will result in array(null), which is inconsist with Parquet

2019-12-30 Thread Xingjun Hao (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingjun Hao updated CARBONDATA-3643:

Description: 
 
{code:java}
//
sql("create table datatype_struct_parquet(price struct>) stored 
as parquet") 
sql("insert into table datatype_struct_parquet values(named_struct('b', 
array('')))") 
sql("create table datatype_struct_carbondata(price struct>) 
stored as carbondata") 
sql("insert into datatype_struct_carbondata select * from 
datatype_struct_parquet")

checkAnswer( sql("SELECT * FROM datatype_struct_carbondata"), sql("SELECT * 
FROM datatype_struct_parquet"))

!== Correct Answer - 1 == == Spark Answer - 1 == 
![[WrappedArray()]] [[WrappedArray(null)]]
{code}
 

  was:
sql("create table datatype_struct_parquet(price struct>) stored 
as parquet")
 sql("insert into table datatype_struct_parquet values(named_struct('b', 
array('')))")
 sql("create table datatype_struct_carbondata(price struct>) 
stored as carbondata")
 sql("insert into datatype_struct_carbondata select * from 
datatype_struct_parquet")

 
{code:java}
//

sql("create table datatype_struct_parquet(price struct>) stored 
as parquet") 
sql("insert into table datatype_struct_parquet values(named_struct('b', 
array('')))") 
sql("create table datatype_struct_carbondata(price struct>) 
stored as carbondata") sql("insert into datatype_struct_carbondata select * 
from datatype_struct_parquet")

checkAnswer( sql("SELECT * FROM datatype_struct_carbondata"), sql("SELECT * 
FROM datatype_struct_parquet"))

!== Correct Answer - 1 == == Spark Answer - 1 == 
![[WrappedArray()]] [[WrappedArray(null)]]
{code}
 


> Insert array('')/array() into Struct column will result in 
> array(null), which is inconsist with Parquet
> --
>
> Key: CARBONDATA-3643
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3643
> Project: CarbonData
>  Issue Type: Bug
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Xingjun Hao
>Priority: Minor
> Fix For: 2.0.0
>
>
>  
> {code:java}
> //
> sql("create table datatype_struct_parquet(price struct>) 
> stored as parquet") 
> sql("insert into table datatype_struct_parquet values(named_struct('b', 
> array('')))") 
> sql("create table datatype_struct_carbondata(price struct>) 
> stored as carbondata") 
> sql("insert into datatype_struct_carbondata select * from 
> datatype_struct_parquet")
> checkAnswer( sql("SELECT * FROM datatype_struct_carbondata"), sql("SELECT * 
> FROM datatype_struct_parquet"))
> !== Correct Answer - 1 == == Spark Answer - 1 == 
> ![[WrappedArray()]] [[WrappedArray(null)]]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3643) Insert array('')/array() into Struct column will result in array(null), which is inconsist with Parquet

2019-12-30 Thread Xingjun Hao (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingjun Hao updated CARBONDATA-3643:

Fix Version/s: 2.0.0
Affects Version/s: 2.0.0
   1.6.1
  Description: 
sql("create table datatype_struct_parquet(price struct>) stored 
as parquet")
sql("insert into table datatype_struct_parquet values(named_struct('b', 
array('')))")
sql("create table datatype_struct_carbondata(price struct>) 
stored as carbondata")
sql("insert into datatype_struct_carbondata select * from 
datatype_struct_parquet")

 
{code:java}
//
!== Correct Answer - 1 == == Spark Answer - 1 == 
![[WrappedArray()]] [[WrappedArray(null)]]
{code}
 

> Insert array('')/array() into Struct column will result in 
> array(null), which is inconsist with Parquet
> --
>
> Key: CARBONDATA-3643
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3643
> Project: CarbonData
>  Issue Type: Bug
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Xingjun Hao
>Priority: Minor
> Fix For: 2.0.0
>
>
> sql("create table datatype_struct_parquet(price struct>) 
> stored as parquet")
> sql("insert into table datatype_struct_parquet values(named_struct('b', 
> array('')))")
> sql("create table datatype_struct_carbondata(price struct>) 
> stored as carbondata")
> sql("insert into datatype_struct_carbondata select * from 
> datatype_struct_parquet")
>  
> {code:java}
> //
> !== Correct Answer - 1 == == Spark Answer - 1 == 
> ![[WrappedArray()]] [[WrappedArray(null)]]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3643) Insert array('')/array() into Struct column will result in array(null), which is inconsist with Parquet

2019-12-30 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3643:
---

 Summary: Insert array('')/array() into Struct column will 
result in array(null), which is inconsist with Parquet
 Key: CARBONDATA-3643
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3643
 Project: CarbonData
  Issue Type: Bug
Reporter: Xingjun Hao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3635) 【Carbon-Flink】Reduce the time interval at which data is visible

2019-12-28 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3635:
---

 Summary: 【Carbon-Flink】Reduce the time interval at which data is 
visible
 Key: CARBONDATA-3635
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3635
 Project: CarbonData
  Issue Type: Improvement
Reporter: Xingjun Hao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3634) Flink Loading support Complex\Array\Map\Binary

2019-12-28 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3634:
---

 Summary: Flink Loading support Complex\Array\Map\Binary 
 Key: CARBONDATA-3634
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3634
 Project: CarbonData
  Issue Type: New Feature
Reporter: Xingjun Hao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3633) Support custom CHARSET for encode and decode binary

2019-12-28 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3633:
---

 Summary: Support custom CHARSET for encode and decode binary 
 Key: CARBONDATA-3633
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3633
 Project: CarbonData
  Issue Type: New Feature
Reporter: Xingjun Hao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3632) Support configure ComplexDelimiters when INSERT

2019-12-28 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3632:
---

 Summary: Support configure ComplexDelimiters when INSERT 
 Key: CARBONDATA-3632
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3632
 Project: CarbonData
  Issue Type: New Feature
Reporter: Xingjun Hao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3631) StringIndexOutOfBoundsException When Inserting Select From a Parquet Table with Empty array/map

2019-12-27 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3631:
---

 Summary: StringIndexOutOfBoundsException When Inserting Select 
From a Parquet Table with Empty array/map
 Key: CARBONDATA-3631
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3631
 Project: CarbonData
  Issue Type: Bug
Affects Versions: 1.6.1, 2.0.0
Reporter: Xingjun Hao
 Fix For: 2.0.0


sql("insert into datatype_array_parquet values(array())")
sql("insert into datatype_array_carbondata select f from 
datatype_array_parquet")

 
{code:java}
java.lang.StringIndexOutOfBoundsException: String index out of range: -1

at java.lang.AbstractStringBuilder.substring(AbstractStringBuilder.java:935)
at java.lang.StringBuilder.substring(StringBuilder.java:76)
at scala.collection.mutable.StringBuilder.substring(StringBuilder.scala:166)
at 
org.apache.carbondata.streaming.parser.FieldConverter$.objectToString(FieldConverter.scala:77)
at 
org.apache.carbondata.spark.util.CarbonScalaUtil$.getString(CarbonScalaUtil.scala:71)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CARBONDATA-3619) NoSuchMethodError(registerCurrentOperationLog) While Creating Table

2019-12-16 Thread Xingjun Hao (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingjun Hao updated CARBONDATA-3619:

Description: 
ExecuteStatementOperation.java exists in hive-service model and 
spark-hive-thriftserver model, Leading "NoSuchMethodError: 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.registerCurrentOperationLog()V"
{code:java}
Caused by: java.lang.NoSuchMethodError: 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.registerCurrentOperationLog()V
 
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.protected$registerCurrentOperationLog(SparkExecuteStatementOperation.scala:173)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:173)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
 
at java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
 
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:185)
 
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more
{code}

  was:
ExecuteStatementOperation.java exists in hive-service model and 
spark-hive-thriftserver model, Leading "NoSuchMethodError: 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.registerCurrentOperationLog()V"
{code:java}
/2019-12-17 11:18:00 WARN  CLIService:396 - OperationHandle 
[opType=EXECUTE_STATEMENT, 
getHandleIdentifier()=29bea7eb-0638-47a9-b177-23d50cc5676a]: The background 
operation was aborted2019-12-17 11:18:00 WARN  CLIService:396 - OperationHandle 
[opType=EXECUTE_STATEMENT, 
getHandleIdentifier()=29bea7eb-0638-47a9-b177-23d50cc5676a]: The background 
operation was abortedjava.util.concurrent.ExecutionException: 
java.lang.NoSuchMethodError: 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.registerCurrentOperationLog()V
 at java.util.concurrent.FutureTask.report(FutureTask.java:122) at 
java.util.concurrent.FutureTask.get(FutureTask.java:206) at 
org.apache.hive.service.cli.CLIService.getOperationStatus(CLIService.java:387) 
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.GetOperationStatus(ThriftCLIService.java:610)
 at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$GetOperationStatus.getResult(TCLIService.java:1473)
 at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$GetOperationStatus.getResult(TCLIService.java:1458)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at 
org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
 at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.NoSuchMethodError: 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.registerCurrentOperationLog()V
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.protected$registerCurrentOperationLog(SparkExecuteStatementOperation.scala:173)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:173)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
 at java.security.AccessController.doPrivileged(Native Method) at 
javax.security.auth.Subject.doAs(Subject.java:422) at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:185)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) ... 3 more/ code 
placeholder
{code}


> NoSuchMethodError(registerCurrentOperationLog) While Creating Table
> ---
>
> Key: CARBONDATA-3619
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3619
> Project: CarbonData
>  Issue Type: Bug
>  Components: spark-integration
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Xingjun Hao
>Priority: Minor
> Fix 

[jira] [Updated] (CARBONDATA-3619) NoSuchMethodError(registerCurrentOperationLog) While Creating Table

2019-12-16 Thread Xingjun Hao (Jira)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingjun Hao updated CARBONDATA-3619:

  Docs Text:   (was: 2019-12-17 11:18:00 WARN  CLIService:396 - 
OperationHandle [opType=EXECUTE_STATEMENT, 
getHandleIdentifier()=29bea7eb-0638-47a9-b177-23d50cc5676a]: The background 
operation was aborted
java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.registerCurrentOperationLog()V
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:206)
at 
org.apache.hive.service.cli.CLIService.getOperationStatus(CLIService.java:387)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.GetOperationStatus(ThriftCLIService.java:610)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$GetOperationStatus.getResult(TCLIService.java:1473)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$GetOperationStatus.getResult(TCLIService.java:1458)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.registerCurrentOperationLog()V
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.protected$registerCurrentOperationLog(SparkExecuteStatementOperation.scala:173)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:173)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:185)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more)
Description: 
ExecuteStatementOperation.java exists in hive-service model and 
spark-hive-thriftserver model, Leading "NoSuchMethodError: 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.registerCurrentOperationLog()V"
{code:java}
/2019-12-17 11:18:00 WARN  CLIService:396 - OperationHandle 
[opType=EXECUTE_STATEMENT, 
getHandleIdentifier()=29bea7eb-0638-47a9-b177-23d50cc5676a]: The background 
operation was aborted2019-12-17 11:18:00 WARN  CLIService:396 - OperationHandle 
[opType=EXECUTE_STATEMENT, 
getHandleIdentifier()=29bea7eb-0638-47a9-b177-23d50cc5676a]: The background 
operation was abortedjava.util.concurrent.ExecutionException: 
java.lang.NoSuchMethodError: 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.registerCurrentOperationLog()V
 at java.util.concurrent.FutureTask.report(FutureTask.java:122) at 
java.util.concurrent.FutureTask.get(FutureTask.java:206) at 
org.apache.hive.service.cli.CLIService.getOperationStatus(CLIService.java:387) 
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.GetOperationStatus(ThriftCLIService.java:610)
 at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$GetOperationStatus.getResult(TCLIService.java:1473)
 at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$GetOperationStatus.getResult(TCLIService.java:1458)
 at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at 
org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
 at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.NoSuchMethodError: 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.registerCurrentOperationLog()V
 at 

[jira] [Created] (CARBONDATA-3619) NoSuchMethodError(registerCurrentOperationLog) While Creating Table

2019-12-16 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3619:
---

 Summary: NoSuchMethodError(registerCurrentOperationLog) While 
Creating Table
 Key: CARBONDATA-3619
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3619
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Affects Versions: 1.6.1, 2.0.0
Reporter: Xingjun Hao
 Fix For: 2.0.0


ExecuteStatementOperation.java exists in hive-service model and 
spark-hive-thriftserver model, Leading "NoSuchMethodError: 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.registerCurrentOperationLog()V"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-3617) loadDataUsingGlobalSort should based on SortColumns Instead Of Whole CarbonRow

2019-12-11 Thread Xingjun Hao (Jira)
Xingjun Hao created CARBONDATA-3617:
---

 Summary: loadDataUsingGlobalSort should based on SortColumns 
Instead Of Whole CarbonRow
 Key: CARBONDATA-3617
 URL: https://issues.apache.org/jira/browse/CARBONDATA-3617
 Project: CarbonData
  Issue Type: Improvement
  Components: data-load
Affects Versions: 1.6.1, 2.0.0
Reporter: Xingjun Hao
 Fix For: 1.6.1, 2.0.0


During loading Data usesing globalsort, the sortby processing is based the 
whole carbon row, the overhead of gc is huge when there are many columns. 
Theoretically, the sortby processing can works well just based on the sort 
columns, which will brings less time overhead and gc overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)