[DISCUSSION] Proposing new PMC member

2022-08-05 Thread Jacky Li
carbondata in data analytics domain. So I'd like to promote him as a new PMC member. Regards, Jacky Li

Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

2021-09-01 Thread Jacky Li
+1 It is a really good feature, looking forward to it. Suggest to break it down to small tasks so that it is easy to review Regards, Jackhy On 2021/08/31 17:47:35, Akash Nilugal wrote: > Hi Community, > > OLTP systems like Mysql are used heavily for storing transactional data in > real-time

Re: [VOTE] Apache CarbonData 2.2.0(RC1) release

2021-07-08 Thread Jacky Li
-1, I suggest following PR to be merged before release #4148 #4157 #4158 #4162 Regards, Jacky Li > 2021年7月6日 下午3:14,Akash Nilugal 写道: > > Hi All, > > I submit the *Apache CarbonData 2.2.0(RC1) *for your vote. > > > > *1. Release Notes:* > https:/

Re: [VOTE] Apache CarbonData 2.0.1(RC1) release

2020-06-01 Thread Jacky Li
+1 (binding) Regards, Jacky > 2020年6月1日 下午7:03,Kunal Kapoor 写道: > > Hi All, > > I submit the Apache CarbonData 2.0.1(RC1) for your vote. > > > *1.Release Notes:* > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12347870 > > *2. The tag to be voted upon* : >

[DISCUSSION] About global sort in 2.0.0

2020-05-31 Thread Jacky Li
Hi All, In CarbonData version 2.0.0, there is a bug that making global-sort using incorrect sort value when sorting column is String. This is impacting all existing global-sort table when doing new loading or insert into. So I suggest community should have a patch release to fix this bug ASAP.

Re: [VOTE] Apache CarbonData 2.0.0(RC3) release

2020-05-17 Thread Jacky Li
+1 Regards, Jacky > 2020年5月17日 下午4:50,Kunal Kapoor 写道: > > Hi All, > > I submit the Apache CarbonData 2.0.0(RC3) for your vote. > > > *1.Release Notes:* > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12346046=Html=12320220 > > *Some key features and improvements in this

Re: [Disscussion] Remove 'Create Stream'

2020-05-16 Thread Jacky Li
+1, we should improve it in community Regards, Jacky > 2020年5月13日 上午9:20,David CaiQiang 写道: > > How about mark the stream SQL as experimental? > > Now in some cases, it is an easy way for the user to understand the > streaming table. > > We can improve it in the future. > > > - > Best

Re: [Discussion]Float and Double compatibility issue with external segments to Carbon

2020-05-10 Thread Jacky Li
Hi, Yes, I think we should correct it. In the schema, it should be float for float type. In internal store, it is usinng adaptive encoding, so I think it is ok anyway. Regards, Jacky > 2020年5月8日 上午10:23,David CaiQiang 写道: > > It is a historical legacy issue and easy to reuse the solution of

Re: [VOTE] Apache CarbonData 2.0.0(RC2) release

2020-05-01 Thread Jacky Li
I have to give -1 to this RC. Regards, Jacky Li > 2020年5月1日 上午1:44,Kunal Kapoor 写道: > > Hi All, > > I submit the Apache CarbonData 2.0.0(RC2) for your vote. > > > *1.Release Notes:* > https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12346046=Html

Re: Propose to upgrade hive version to 3.1.0

2020-02-22 Thread Jacky Li
+1 Regards, Jacky > 2020年2月21日 下午7:09,Kunal Kapoor 写道: > > Hi All, > > The hive community has already released version 3.1.0 which has a lot of > bug fixes and new features. > Many of the users have already migrated to 3.1.0 in their production > environment and i think its time we should

Re: Discussion: change default compressor to ZSTD

2020-02-20 Thread Jacky Li
we do for every version and decide based on that. >> >> Regards, >> Ravindra. >> >> On Fri, 7 Feb 2020 at 10:40 AM, Jacky Li wrote: >> >>> Hi Ajantha, >>> >>> >>> Yes, decoder will use the compressorName stored in C

Re: Improving show segment info

2020-02-17 Thread Jacky Li
> 2020年2月17日 下午2:00,akashrn5 写道: > > Hi, > >>> *1. How about creating a "tableName.segmentInfo" child table for each main >>> table?* user can query this table and easy to support filter, group by. we >>> just have to finalize the schema of this table. > We already have many things like

Re: Improving show segment info

2020-02-17 Thread Jacky Li
> 2020年2月17日 下午2:03,akashrn5 写道: > > Hi Ajantha, > > I think event time comes into picture when the user has the timestamp > column, like in timeseries. So only in that case, this column makes sense. > > Else it won't be there. > > @Likun, correct me if my understanding is wrong. > Yes,

Re: Improving show segment info

2020-02-16 Thread Jacky Li
> 2020年2月16日 下午4:58,akashrn5 写道: > > Hi likun, > > Thanks for proposing this > > +1, its a good way and its better to provide user more info about segment. > > I have following doubts and suggestions. > > 1. You have mentioned DDL as Show segments On table, but currently it is > show

Improving show segment info

2020-02-16 Thread Jacky Li
Hi community, Currently for SHOW SEGMENT command, carbon will print: +---+-+---+---+-+---+-+--++ |SegmentSequenceId |Status |Load Start Time|Load End Time |Merged To|File Format|Data

Re: Regarding presto carbondata integration

2020-02-15 Thread Jacky Li
> 2020年2月12日 下午1:33,Ajantha Bhat 写道: > > Hi all, > > Currently master code of carbondata works with *prestodb 0.217* > We all know about competing *presto-sql* also. > Some of the users doesn't want to migrate to *presto-sql *as their cloud > vendor doesn't support presto sql (Example, AWS

??????[DISCUSSION] Multi-tenant support by refactoring datamaps

2020-02-15 Thread Jacky Li
Hi, +1 for moving the DataMapSchema json file to database folder, for supporting multi-tenancy. Furthermore, I suggest we further refactor the datamap. The reason is that now the Sencodary Index feature have been introduced into CarbonData, and it stores the index metadata as the table

?????? Discussion: change default compressor to ZSTD

2020-02-06 Thread Jacky Li
Hi Ajantha, Yes, decoder will use the compressorName stored in ChunkCompressionMeta from the file header, but I think it is better to put it in the name so that user can know the compressor in the shell without reading it by launching engine. In spark, for parquet/orc the file name written

Discussion: change default compressor to ZSTD

2020-02-06 Thread Jacky Li
Hi, I compared snappy and zstd compressor using TPCH for carbondata. For TPCH lineitem table: carbon-zstdcarbon-snappy loading (s)5351 size795MB1.2GB TPCH-query: Q14.2898.29 Q212.60912.986 Q314.90214.458 Q46.2765.954 Q523.14721.946 Q61.120.945 Q723.01728.007 Q814.55415.077 Q928.47227.473

?????? [Discussion] Support Secondary Index on Carbon Table

2020-02-05 Thread Jacky Li
+1 Thanks for proposing this :) Regards, Jacky ---- ??:"Kunal Kapoor"https://issues.apache.org/jira/browse/CARBONDATA-3680; Thanks Regards, Indhumathi M

Re: [Discussion] Support SegmentLevel MinMax for better Pruning and less driver memory usage

2020-01-14 Thread Jacky Li
+1 This can reduce the memory footprint in spark driver, it is great for ultra big data Regards, Jacky > 2020年1月14日 下午4:38,Indhumathi 写道: > > Hello all, > > In Cloud scenarios, index is too big to store in SparkDriver, since VM may > not have so much memory. > Currently in Carbon, we will

[DISCUSSION] Deprecating V1 and V2 file format

2019-12-27 Thread Jacky Li
Hi all, CarbonData introduced V3 file format in 2017, almost 3 years ago. I have not seen issues reporting for V1 and V2 format for very long time, so guessing all users have migrated to V3 format. So, to reduce community’s maintenance effort, I suggest to keep V3 format support only since

Re: Optimize and refactor insert into command

2019-12-27 Thread Jacky Li
Definitely +1, please feel free to create JIRA issue and PR Regards, Jacky > 2019年12月20日 上午7:55,Ajantha Bhat 写道: > > Currently carbondata "insert into" uses the CarbonLoadDataCommand itself. > Load process has steps like parsing and converter step with bad record > support. > Insert into

DISCUSSION: removing support for Spark 2.1 and 2.2 in CarbonData 2.0

2019-12-21 Thread Jacky Li
Hi, For following reasons, I suggest to deprecate the support of spark 2.1 and 2.2 integration in CarbonData 2.0: 1. As spark community is moving to 3.0, versions below 2.3 will be out of maintaining lifecycle in spark community. 2. In PR #3378 (Integrating with Spark 2.4.4), the

Re: Apply to open 'Issues' tab in Apache CarbonData github

2019-12-21 Thread Jacky Li
+1 It is easier for developer and user to discuss and track progress of issues Regards, Jacky -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION] Changing default spark dependency to 2.3

2019-12-18 Thread Jacky Li
Yes, it applies for CarbonData 2.0 Regards, Jacky -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion]Gson version problem

2019-12-08 Thread Jacky Li
Hi Akash, I check the "alternate" annotation is already used in DataMapSchema line 56, and TableSchema line 70. So I wondering how this can work in hadoop cluster earlier? Regards, Jacky -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

gson dependency problem

2019-12-08 Thread Jacky Li
Hi Akash, Currently in master branch, we have introduce "alternate" from gson 2.4 to make table status file smaller, and I got message from you that gson 2.4 is conflicting with the gson 2.2.4 dependency in hadoop-common, so there is failure in hadoop cluster scenario. However, I check the

Re: Propose feature change in CarbonData 2.0

2019-12-05 Thread Jacky Li
Hi, Thanks for all your input, the voting summary is as below: 1. Global dictionary No -1 2. Bucket Two -1 3. Carbon custom partition No -1 4. BATCH_SORT No -1 5. Page level inverse index One -1 5. old preaggregate and time series datamap implementation No -1 6. Lucene datamap Five

Propose feature change in CarbonData 2.0

2019-11-28 Thread Jacky Li
Hi Community, As we are moving to CarbonData 2.0, in order to keep the project moving forward fast and stable, it is necessary to do some refactory and clean up obsoleted features before introducing new features. To do that, I propose making following features obsoleted and not supported

Re: [DISCUSSION] Page Level Bloom Filter

2019-11-28 Thread Jacky Li
Hi, Since the new bloom filter integration is designed to work in the executor side, so in term of keeping it simple, I actually prefer to keep it inside the data file itself instead of keep in a separated index file. So in executor side, only read one file is enough. And it is always better if

Re: [DISCUSSION]Support for Geospatial indexing

2019-11-11 Thread Jacky Li
I am not familliar with Apache SIS, is it already integrated with other storage system? Is there any pointer to learn about this? In my opinion, this thread was discussing the indexing part in the CarbonData to accelerate geosptial related queries. If Apache SIS offers integration framework and

[DISCUSSION] Changing default spark dependency to 2.3

2019-11-11 Thread Jacky Li
Hi, The current carbondata jar release put in maven repo is built with spark-2.2, as spark-2.2 release is almost 2 years, I suggest to change this dependency to spark-2.3. So future realease of CarbonData will by default depends on spark-2.3 What do you think? Regards, Jacky

Re: Data Load performance degrade when number of segment increase

2019-11-11 Thread Jacky Li
Hi, My suggestion is: 1. Reduce the number of call of readTableStatusFile as less as possible in both loading and query. 2. Cache maybe added inside SegmentStatusManager for LoadMetadtaDetails, and cache invalidation should be carefully done, like for case when dropping table. 3. Do compaction

Re: [DISCUSSION] Page Level Bloom Filter

2019-11-11 Thread Jacky Li
s. We should leverage the datachunk3 and check whether the column pages are needed before reading. This can reduce the IO dramatically for some use case, for example, high selectivity filter query. > > Anyone interesting in this part is welcomed to share you ideas also. > > Thanks. >

Re: [DISCUSSION] Page Level Bloom Filter

2019-11-04 Thread Jacky Li
Hi Manhua, +1 for this feature. One question: Since one column chunk in one blocklet is carbon's minimum IO unit, why not create bloom filter in blocklet level? If it is page level, we still need to read page data into memory, the saving is only for decompression. Regards, Jacky -- Sent

Re: [DISCUSSION] Support write Flink streaming data to Carbon

2019-10-30 Thread Jacky Li
+1 for these feature, in my opinion, flink-carbon is a good fit for near realtiem analytics One doubt is that in your design, the Collect Segment command and Compaction command are two separate commands, right? Collect Segment command modify the metadata files (tablestatus file and segment

Re: [DISCUSSION]Support for Geospatial indexing

2019-10-24 Thread Jacky Li
Thanks for the analysis. Please be careful of the code reuse from other "opensource" repo, especially for the License. Regards, Jacky On 2019/10/24 06:25:40, Ajantha Bhat wrote: > Hi Jacky, > > we have checked about geomesa > > [image: Screenshot from 2019-10-23 16-25-23.png] > > a.

Re: [DISCUSSION]Support for Geospatial indexing

2019-10-17 Thread Jacky Li
definitely +1. Before going through the design doc, I have two questions: 1. In this domain, there are some opensource solutions with SQL extension or DSL designed for geographical analytic, such as geomesa (it also works with spark). So is there considerations to integration with geomesa also?

Re: [DISCUSSION] Support heterogeneous format segments in carbondata

2019-09-28 Thread Jacky Li
IMHO On 2019/09/11 06:46:21, chetan bhat wrote: > Hi Ravi, > > 1. What are the data formats that shall be supported to add segment. ? I think for the first phase we can target the tables that user may want to migrate to carbon, like orc and parquet tables. In future, we can consider CSV

Re: Adapt to SparkSessionExtensions

2019-08-26 Thread Jacky Li
I have created branch-2.0, let's work on this feature in this branch. Regards, Jacky On 2019/08/22 04:58:53, Ajith shetty wrote: > Hi Community > > From https://issues.apache.org/jira/browse/SPARK-18127 Spark provides > SparkSessionExtensions in order to extended capabilities of spark.

Re: Adapt to SparkSessionExtensions

2019-08-22 Thread Jacky Li
+1 And since we are starting this refactory for CarbonData 2.0 which is a major version upgrade, I suggest to consider optimize following features: 1. make global dictionary obsolete so that planning phase is cleaner. After spark tungsten project, actually the benefit get from global dictionary

Re: [VOTE] Apache CarbonData 1.6.0(RC3) release

2019-08-15 Thread Jacky Li
+1 Regards, Jacky On 2019/08/13 11:41:45, Raghunandan S wrote: > Hi > > > I submit the Apache CarbonData 1.6.0 (RC3) for your vote. > > > 1.Release Notes: > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344965 > > > Some key features and improvements

Re: [VOTE] Apache CarbonData 1.6.0(RC3) release

2019-08-15 Thread Jacky Li
+1 Regards, Jacky On 2019/08/13 11:41:45, Raghunandan S wrote: > Hi > > > I submit the Apache CarbonData 1.6.0 (RC3) for your vote. > > > 1.Release Notes: > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344965 > > > Some key features and improvements

Re: [VOTE] Apache CarbonData 1.6.0(RC2) release

2019-08-08 Thread Jacky Li
Hi, I'd suggest the NPE related fix merged into this version, like CARBONDATA-3466, CARBONDATA-3482, CARBONDATA-3452. So -1. Regards, Jacky On 2019/08/03 07:10:24, Raghunandan S wrote: > Hi > > > I submit the Apache CarbonData 1.6.0 (RC2) for your vote. > > > 1.Release Notes: > >

Re: Query About Carbon Write Process : why always 10 Task get created when we write dataframe or rdd in carbon format in a write job or save job

2019-05-25 Thread Jacky Li
one correction for my last reply, the property to control the number of threads for sorting during data load is: "carbon.number.of.cores.while.loading" You can set it like CarbonProperties.getInstance().addProperty("carbon.number.of.cores.while.loading", 8) Regards, Jacky -- Sent from:

Re: Query About Carbon Write Process : why always 10 Task get created when we write dataframe or rdd in carbon format in a write job or save job

2019-05-25 Thread Jacky Li
Hi Anshul Jain, If you have specified the SORT_COLUMNS table property when creating table, by default carbon will sort the input data during data loading (to build index). The sorting is controlled by a table property called SORT_SCOPE, by default it is LOCAL_SORT, it means it will sort the data

Re: [Collection] Collect the requirement of hive + CarbonData

2019-05-25 Thread Jacky Li
+1 for Dhatchayani's advice. It is better carbon set those properties instead of user Regards, Jacky -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [VOTE] Apache CarbonData 1.5.4(RC1) release

2019-05-25 Thread Jacky Li
+1 This version adds some very good features! Regards, Jacky -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Support Apache arrow vector filling from carbondata SDK

2019-05-09 Thread Jacky Li
+1 Regards, Jacky -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion] Migrate CarbonData to support PrestoSQL

2019-05-09 Thread Jacky Li
+1 for carbondata to support PrestoSQL, which is an active presto community for Presto. Regards, Jacky -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION] Add new compaction type for compacting delta data file

2019-03-30 Thread Jacky Li
I guess your intention is to rewrite a single segment by merging base file and delta files to improve the query performance of that segment, right? I think this is doable and note that this operation may be time consuming since it is rewriting the whole segment. Regards, Jacky > 在

Re: [VOTE] Apache CarbonData 1.5.2(RC2) release

2019-01-31 Thread Jacky Li
+1 Regards, Jacky > 在 2019年1月31日,上午1:23,Raghunandan S 写道: > > Hi > > > I submit the Apache CarbonData 1.5.2 (RC2) for your vote. > > > 1.Release Notes: > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344321 > > >Some key features and improvements in

Re: [DISCUSS] Move to gitbox as per ASF infra team mail

2019-01-10 Thread Jacky Li
+1 Regards, Jacky > 在 2019年1月5日,上午10:08,Liang Chen 写道: > > Hi all, > > Background : > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/NOTICE-Mandatory-migration-of-git-repositories-to-gitbox-apache-org-td72614.html > > Apache Hadoop git repository is in git-wip-us

Re: [carbondata-presto enhancements] support reading carbon SDK writer output in presto

2018-12-10 Thread Jacky Li
Thanks. Can we do the same for spark integration also, I see there are two datasource now: “carbon” and “carbondata” It is not easy for user to differentiate when to use which one. Since we are discussing “support transactional table in SDK”, so I think we can make unify “carbon” and

Re: [DISCUSS] Support transactional table in SDK

2018-12-10 Thread Jacky Li
> 在 2018年12月8日,下午3:53,Liang Chen 写道: > > Hi > > Good idea, thank you started this discussion. > > Agree with Ravi comments, we need to double-check some limitations after > introducing the feature. > > Flink and Kafka integration can be discussed later. > For using SDK to write new data

Re: [carbondata-presto enhancements] support reading carbon SDK writer output in presto

2018-12-10 Thread Jacky Li
Hi Ajantha, Currently for carbon-presto integration, there is a plugin called “carbondata”. I wonder will you introduce new plugin into the project? I suggest we re-use the same plugin and decide the read path within the plugin. What do you think? Regards, Jacky > 在 2018年12月10日,下午2:31,Ajantha

Re: [DISCUSS] Support transactional table in SDK

2018-12-10 Thread Jacky Li
Hi Nicholas, Yes, this is a feature required for flink-carbon to write to transactional table. You are welcomed to participate in this. I think you can contribute by reviewing the design doc in CARBONDATA-3152 firstly, after we settle down the API we can open sub-tasks for this ticket.

Re: [DISCUSS] Support transactional table in SDK

2018-12-10 Thread Jacky Li
> 在 2018年12月7日,下午11:05,ravipesala 写道: > > Hi Jacky, > > Its a good idea to support writing transactional table from SDK. But we need > to add following limitations as well > 1. It can work on file systems which can take append lock like HDFS. Likun: yes, since we need to overwrite table

[DISCUSS] Support transactional table in SDK

2018-12-06 Thread Jacky Li
Hi All, In order to support application integration without central coordinator like Flink and Kafka Stream, transaction table need to be supported in SDK, and a new type of segment called Online Segment is proposed. Since it is hard to describe the motivation and design in a good format in

Re: [Discussion]Alter table column rename feature

2018-12-06 Thread Jacky Li
+1 Wondering will you add new command or re-use the ALTER TABLE for change datatype command, since it is also for the specified column Regards, Jacky > 在 2018年12月5日,下午7:03,Akash Nilugal 写道: > > Hi community, > > Currently carbon supports alter table rename, add column, drop column and >

Re: Carbondata support flink feature

2018-12-06 Thread Jacky Li
Yes, I think we can integrate like this by using SDK. But before that Carbon should support transactional table in SDK. I have sent another mail to describe the idea. (Support transactional table in SDK). Regards, Jacky > 在 2018年12月6日,上午8:44,Nicholas 写道: > > Hi,flink integration designs

[Discussion] Support transactional table in SDK

2018-12-06 Thread Jacky Li
Hi All, In order to support application integration without central coordinator like Flink and Kafka Stream, transaction table need to be supported in SDK, and a new type of segment called Online Segment is proposed. Since it is hard to describe the motivation and design in a good format in

Re: [VOTE] Apache CarbonData 1.5.1(RC2) release

2018-12-02 Thread Jacky Li
I think there are other places that are using apache-common-log, like https://github.com/apache/carbondata/blob/382ce430a18ca3d7d0b444777c66591e2c2e705f/hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonInputFormat.java#L103 Since this is not introduced in this version, I think it is

Re: Enhancement on compaction performance

2018-11-08 Thread Jacky Li
Hi Xuchuanyin, This feature is great for compaction. I wonder do you observe more memory is used since it prefetch data in the memory? Do you have any number? Regards, Jacky > 在 2018年11月7日,下午11:54,xuchuanyin 写道: > > Hi all: > I am raising a PR to enhance the performance of compaction. The PR

Re: Change the 'comment' content for column when execute command 'desc formatted table_name'

2018-11-07 Thread Jacky Li
The example is missing in my last mail, now I have put the example in CARBONDATA-3087 , please go to the JIRA and reply if you have any comment Regards, Jacky -- Sent from:

Re: Change the 'comment' content for column when execute command 'desc formatted table_name'

2018-11-07 Thread Jacky Li
Hi, I revisit this discussion again, and suggest to change the DESC FORMATTED output to following: The information is outline in 6 sections: 1. Table basic information 2. Index information 3. Encoding information 4. Compaction information 5. Partition information (only for partition table) 6.

Re: Proposal to integrate QATCodec into Carbondata

2018-11-01 Thread Jacky Li
Hi, Good to know about QATCodec. I have a quick question. Is the QATCodec an independent compression/decompression library or it depends on any hardware to achieve the performance improvement you have mentioned? Is there any link for QATCodec project or source code? Regards, Jacky > 在

Re: [Discussion] Refactor dynamic configuration

2018-10-31 Thread Jacky Li
Hi Xubo, Since you have modified so many place for this feature, I think instead of adding more annotation, it is better to the CarbonProperty more object-oriented. For example, we can change the CarbonProperty to class Property { String name; T value; T default; String doc;

Re: [Discussion] Provide separate audit log

2018-10-31 Thread Jacky Li
I am planning to do it by adding a small framework in AtomicRunnableCommand so that all command will be audited automatically. After that I will remove all the old audit log in each command. OK, I will change the tableId to table. Regards, Jacky > 在 2018年10月31日,下午3:49,xuchuanyin 写道: > > +1

[Discussion] Provide separate audit log

2018-10-31 Thread Jacky Li
Hi, Currently, CarbonData outputs audit log with other level log together in one log file, it is not easy for user to check the audit. And sometimes the audit information is not complete since it depends on each Command to invoke Logger in its run function. To improve it, I propose a new

Re: [1.5.2] Gzip Compression Support

2018-10-18 Thread Jacky Li
maybe we can move the position of the ByteBuffer and Gzip can uncompress start from the position we give? I remember ZSTD supports like this. > > Regards, > Shardul > > On Thu, Oct 18, 2018 at 9:09 AM Jacky Li wrote: > >> +1 >> >> I have som

Re: [Proposal] Proposal to change default value of two parameters for data loading

2018-10-17 Thread Jacky Li
+1 > 在 2018年10月15日,下午9:03,xuchuanyin 写道: > > Hi, all: > > About a year ago, we introduced 'multiple dirs for temp data' to solve disk > hotspot problem in data loading. > > This feature enables carbon randomly pick one of the local directories > configured in yarn-local-dirs when it writes

Re: [1.5.2] Gzip Compression Support

2018-10-17 Thread Jacky Li
+1 I have some question: 1. Other than uncompressByteArray, Does Gzip offers uncompressShortArray, uncompresssIntArray? 2. Does Gzip need uncompress size to allocate the target array before uncompressing? 3. Does you solution require data copy? Regards, Jacky > 在 2018年10月12日,下午6:49,shardul

Re: Change the 'comment' content for column when execute command 'desc formatted table_name'

2018-10-08 Thread Jacky Li
value for these properties may change, if they are changes, user will not know what is the table property that was used when writing the file. Regards, Jacky Li > 在 2018年10月8日,上午12:20,xm_zzc <441586...@qq.com> 写道: > > Hi: > I agree with Jacky. > Currently if i use the default v

Re: Change the 'comment' content for column when execute command 'desc formatted table_name'

2018-10-07 Thread Jacky Li
Looking at the DESC FORMATTED command again, I still feel it is not very clear for the table property section. For table properties, I think it is not very good for DESC command to print the default value if the user does not specify when creating the table. Because the default value in

Re: CarbonData Performance Optimization

2018-09-26 Thread Jacky Li
+1 > 在 2018年9月21日,上午10:20,Ravindra Pesala 写道: > > Hi, > > In case of querying data using Spark or Presto, carbondata is not well > optimized for reading data and fill the vector. The major issues are as > follows. > 1. CarbonData has long method stack for reading and filling out the data to >

Re: [VOTE] Apache CarbonData 1.5.0(RC1) release

2018-09-26 Thread Jacky Li
There is a bug related to global dictionary reported by Aaron in the maillist, I suggest include this bug fix in this version. Regards, Jacky > 在 2018年9月26日,下午1:52,Ravindra Pesala 写道: > > Hi > > I submit the Apache CarbonData 1.5.0 (RC1) for your vote. > > 1.Release Notes: >

Re: [DISCUSSION] Support Binary DataType

2018-09-14 Thread Jacky Li
There is an existing PR 2665 that works on binary data type, is your work based on that PR and a new one? Regards, Jacky > 在 2018年9月14日,下午2:30,Indhumathi 写道: > > Hello All, > > I am working on supporting Binary DataType. Please find below > the scope and design approach for the same. > >

Re: [Discussion] Support for Float and Byte data types

2018-09-14 Thread Jacky Li
I think your proposal will support CarbonSession also, but not only SDK and FileFormat, right? Regards, Jacky > 在 2018年9月14日,下午12:34,Kunal Kapoor 写道: > > Hi xuchuanyin, > Yes your understanding is correct and i agree that documentation has to be > updated to mention that for old store double

CarbonWriterBuild issue

2018-09-12 Thread Jacky Li
Hi Dev, I observed that we have added many buildXXXWriter in CarbonWriterBuilder in SDK, but all of them is just different parameter combination. I think it is better to add those parameter by withXXX function, which complies to Builder pattern. Otherwise it is hard to maintain if we keep

Re: Support Zstd as Column Compressor

2018-09-12 Thread Jacky Li
Great, thanks for your effort. For the lz4 task,I checked lz4 compressor (lz4-java), and found it needs the decompressed size before decompressing the data. In CarbonData V3 format, we have stored the uncompressed size of data page in ChunkCompressionMeta.total_uncompressed_size in the data

Re: Feature Proposal: CarbonCli tool

2018-09-05 Thread Jacky Li
For “summary” command, I just pick the first carbondata file and read the schema from its header. The intention here is just to show one schema, assuming all schema in all data files in this folder is the same. If there is need to validate schema in all files, we can add a “validate” command.

Feature Proposal: CarbonCli tool

2018-09-04 Thread Jacky Li
Hi All, When I am tuning carbon performance, very often that I want to check the metadata in carbon files without launching spark shell or sql. In order to do that, I am writing a tool to print metadata information of a given data folder. Currently, I am planning to do like this: usage:

Re: [DISCUSSION] Remove BTree related code

2018-08-31 Thread Jacky Li
+1 Better to clean it if it is not used Regards, Jacky > 在 2018年8月24日,下午6:01,Kunal Kapoor 写道: > > +1 for removing unused code > > > > Regards > Kunal Kapoor > > > On Fri, Aug 24, 2018, 2:09 PM Ravindra Pesala wrote: > >> +1 >> We can remove unused code >> >> Regards, >> Ravindra >> >>

Re: Change the 'comment' content for column when execute command 'desc formatted table_name'

2018-08-21 Thread Jacky Li
Hi ZZC, I have checked the doc in CARBONDATA-2595. I have following comments: 1. In the Table Basic Information section, it is better to print the Table Path instead of "CARBON Store Path” 2. For the Table Data Size and Index Size, can you format the output in GB, MB, KB, etc 3. For the Last

Re: Change the 'comment' content for column when execute command 'desc formatted table_name'

2018-08-20 Thread Jacky Li
Hi ZZC, Can you create a JIRA ticket and upload the design doc, in mail list we can not get the attachment Regards, Jacky > 在 2018年8月20日,上午11:20,xm_zzc <441586...@qq.com> 写道: > > Hi dev: > Now I am working on this, the new format is shown in attachment, please > give me some feedback. >

Re: [Discussion] Refactor Segment Management Interface.

2018-08-05 Thread Jacky Li
+1 on the idea. Since segment manager is very important for CarbonData, and there was multiple target we wanted to achieve, so let’s first make sure we have same understanding of the refactor goal. Let me first describe what is in my mind 1. Segment metadata (TableStatus) need to be read

Re: [DISCUSS] Distributed CarbonStore

2018-08-05 Thread Jacky Li
+1 I think it is a new good feature to have, but the effort to develop is quite high. I am worried about the release cycle getting longer. Can you define a roadmap for this new feature, so it can be deliver in phases across future versions. Do you have anything in mind for the roadmap?

Re: [VOTE] Apache CarbonData 1.4.1(RC1) release

2018-08-01 Thread Jacky Li
Besides these, there is also a PR2579 which is a correctness fix for MV, it is ready to merge. I think we ‘d better to merge it into 1.4.1 So, -1 Regards, Jacky > 在 2018年7月31日,下午4:02,Liang Chen 写道: > > Hi > > These PR, it is better to merge in 1.4.1 also >

Re: [Discussion] Propose to upgrade the version of integration/presto from 0.187 to 0.206

2018-07-25 Thread Jacky Li
+1 > 在 2018年7月25日,下午1:14,Ajantha Bhat 写道: > > Hi Dev, > > +1 for the 0.206 upgrade of Presto. > Currently, while running some test (TPCH) queries on 500GB data, we > observed some memory issues in the presto cluster. > Couldn't narrow down the exact memory problem in the presto. We restart >

Re: Why not support global sort in partition table?

2018-07-23 Thread Jacky Li
I think there is no technical reason that it can’t be supported, it is just because it is not implemented yet. I think it is not implement because: 1. In case of partition plus sorting, it will be like global sort when the query leverage partition pruning if you give partition column in

Re: Use RowStreamParserImp as default value of config 'carbon.stream.parser'

2018-06-08 Thread Jacky Li
+1. I think this change is fine. Regards, Jacky > 在 2018年6月8日,下午3:10,David CaiQiang 写道: > > +1, I agree with using RowStreamParserImpl by default. > > > - > Best Regards > David Cai > -- > Sent from: > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ >

Re: carbondata partitioned by date generate many small files

2018-06-08 Thread Jacky Li
then automatically after jobs done. > > > thanks > > ChenXingYu > > > ------ Original -- > From: "Jacky Li"; > Date: Tue, Jun 5, 2018 08:43 PM > To: "dev"; > Subject: Re: carbondata partit

Re: [Discussion] Carbon Local Dictionary Support

2018-06-05 Thread Jacky Li
+1 Good feature to add in CarbonData Regards, Jacky > 在 2018年6月4日,下午11:10,Kumar Vishal 写道: > > Hi Community,Currently CarbonData supports global dictionary or > No-Dictionary (Plain-Text stored in LV format) for storing dimension column > data. > > *Bottleneck with Global Dictionary* > >

Re: [VOTE] Apache CarbonData 1.4.0(RC2) release

2018-05-25 Thread Jacky Li
+1 > 在 2018年5月24日,上午11:32,Lu Cao 写道: > > +1 > > Regards, > Lionel > > On Wed, May 23, 2018 at 10:45 PM, xuchuanyin wrote: > >> +1 FROM MOBILE EMAIL CLIENT 在2018年05月23日 03:41,Ravindra Pesala 写道: Hi I >> submit the Apache CarbonData 1.4.0 (RC2) for

Re: [VOTE] Apache CarbonData 1.4.0(RC1) release

2018-04-28 Thread Jacky Li
Hi, I am afraid I had to vote -1. Following issue should be taken care of, otherwise will impact user understanding and usage of datamap: CARBONDATA-2415 CARBONDATA-2416 Regards, Jacky > 在 2018年4月27日,下午6:28,Ravindra Pesala 写道: > > Hi > > I submit the Apache CarbonData

Re: Memory leak issue when using DataFrame.coalesce

2018-03-30 Thread Jacky Li
Hi, Good catch! I think proposal 1 is ok, please feel free to open jira ticket and submit PR. Let the CI and SDV test suite to run and see whether it is ok. Regards, Jacky > 在 2018年3月31日,上午11:35,yaojinguo 写道: > > Hi, > I am using CarbonData1.3 + Spark2.1,My code is: >

Re: CarbonData

2018-03-24 Thread Jacky Li
Thanks, we will correct it Regards, Jacky > 在 2018年3月22日,下午2:06,Flying <1295182...@qq.com> 写道: > > 在 http://carbondata.apache.org/timeseries-datamap-guide.html 中 单词 > granualrity是错误的,正确的应该是 granularity

Re: The size of the tablestatus file is getting larger, does it impact the performance of reading this file?

2018-03-15 Thread Jacky Li
Hi, I think this approach (maitaining a history tablestatus file) is good. Xm_zzc, please continue with this approach if you want to work on it. Regards, Jacky > 在 2018年3月15日,下午1:47,manish gupta 写道: > > I think maintaining a tablestatus backlog file is a good idea.

  1   2   >