Github user xuchuanyin commented on the issue:
https://github.com/apache/carbondata-site/pull/46
hi, @jatin9896
I only changed this file(In the commit ), but after running
`carbonscript.sh`, I found other files changed as well. Should I just commit
this file that changed
If adding a new statement, I suggest to learn from hive:
desc formatted table_name;
VS
desc table_name;
Show segment...
VS
Show formatted segment...
On 09/21/2017 14:02, Ravindra Pesala wrote:
Hi,
I agree with Jacky and David.
But it is suggested to keep current 'show segments' command
The two options both prefer to make all the sortscope in all segments (loads)
same.
Since carbondata supports different sortscope in different segment (load), I
think there should be a third option.
Option 3: The sortscope in load data command is in higher priority than that
specified in
Hi, dev: Recently I found the bug in compressing sort temp file and tried to
fix this bug in PR#1632 (https://github.com/apache/carbondata/pull/1632). In
this PR, Carbondata will compress the records in batch and write the compressed
content to file if we turn on this feature. However, I found
Hi, dev: Recently I found the bug in compressing sort temp file and tried to
fix this bug in PR#1632 (https://github.com/apache/carbondata/pull/1632). In
this PR, Carbondata will compress the records in batch and write the compressed
content to file if we turn on this feature. However, I found
+1 FROM MOBILE EMAIL CLIENT 在2018年05月23日 03:41,Ravindra Pesala 写道: Hi I submit
the Apache CarbonData 1.4.0 (RC2) for your vote. 1.Release Notes:
*https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=1234100
Hi, Kumar:
Can you raise a Jira and provide the document as attachment? I cannot open
the links since it is blocked.
About query filtering
1. “during filter, actual filter values will be generated using column local
dictionary values...then filter will be applied on the dictionary encode
data”
---
If the filter is not 'equal' but 'like','greater than', can it also run on
encode data.
2. "As dictionary data
Hi, Kumar:
Local dictionary will be nice feature and other formats like parquet all
support this.
My concern is that: How will you implement this feature?
1. What's the scope of the `local`? Page level (for all containing rows),
Blocklet level (for all containing pages), Block level(for
Hi, community:
I'm implementing supporting string longer than 32000 characters in carbondata
and have a question about the grammar of this feature. Here I'd like to explain
it and want to receive your feedbacks.
DESCRIPTION:
In previous implementation, carbondata internally uses a short to
Maybe you can try `hive.server2.thrift.max.worker.threads` and set a smaller
value for it.
You can configure it in hive-site.xml or pass the configuration through
??hiveconf when you start the thrift-server.
At last, you need to find out the root cause of the failed sqls. 60 concurrent
In traditional RDBMS, varchar(N) means the value contains at least N
characters, at the DBMS will truncate the value if its length is longer than N.
Will we implement like this too? Truncate the string value to N if its length
is longer than N?
Yes, it is really a bug. You can raise a jira for this problem.
I tried the following queries and they are OK. Hope it will help you bypass the
bug.
```
create table IF NOT EXISTS test.Account(CAP_CHARGE Array,CAP_CR_INT
Array) partitioned by (current_dt DATE) STORED BY 'carbondata'
```
```
Will delete/update affect the schema?
What's the meaning of 'schema' here?
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
I think you can start by reviewing the project docs.
If there are any problems, you can raise jira to fix it.
OR if you have some problems in understanding the docs, you can ask
questions in the mailing list. If it is really problem, you can raise jira
to fix it.
--
Sent from:
1. I think it's OK to query one table through two sessions.
2.You can refer to Carbon-SDK. The performance is undocumented now. You can
try it and give feed back. If it is below expectation, we may try to
optimize it later.
--
Sent from:
Hi, did you create the dataframe through SparkSession?
If it is so, you do better to create that with CarbonSession, which extends
the SparkSession.
You may need to refer to
https://github.com/apache/carbondata/blob/master/docs/quick-start-guide.md
--
Sent from:
This means, no need to keep the actual data along with encoded data in
encoded column page.
---
A problem is that, currently index datamap needs the actual data to generate
index. You may affect this procedure if you do not keep the actual data.
--
Sent from:
I think even we split the carbondata command into DDL and DML, it is still
too large for one document.
For example, there are many TBLProperties for creating table in DDL. Some
descriptions of the TBLProperties is long and now we do not have TOC for
them. It's difficult to locate one property in
I find the PR in github and leave a comment. Here I copy the comments:
I have doubt about the below scenario:
For sort_columns, the minmax is ordered for all the blocks/blocklets in one
segment.
Suppose that we are doing filtering on sort_columns and the filter looks
like Col1='bb'.
If the
In the above example, you specify one directory and get two segments.
But it only shows one schema info. I thought the number of the schema is the
same as that of data directories. Since you mentioned that we can support
nested folder, what if the schema in these files are not the same?
Another
Then what's the final output looks like?
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi, jacky, please check the following comments:
1. Do we need to provide other inferfaces, such as `listTable`,
`renameTable`...
2. What's the difference between the function of 'Carbon-SDK' and
'CarbonStore'
As for the CarbonStore API `createTable`:
3. Will it make use of the existing
Hi, kumarvishal:
As the local dictionary feature will be released in 1.4.1, Is there any
difference between the implementation and the previous design document?
I'm trying to understand the implementation of local dictionary. If there
is any difference, please help to update the document in
Hi, liang, I think it may be a problem. The segment with LOAD_FAILED should
not affect the query on the normal segment.
In the previous mail, the second data loading is successful and query on
this segment should use the index file cache.
Besides, if the dataloading is failed, will the failed
Hi, all:
Here I am to make a conclusion of my opinion and provide option 4.
Option 4:
4) Extending existing SQL syntax of Major and Minor compaciton based on
syntax of delete segment:
ALTER TABLE tablename COMPACT 'MAJOR' WHERE SEGMENT.ID IN (1,2,3,4)
ALTER TABLE tablename COMPACT 'MINOR'
I think the problem may be metadata related. What's your thrift version? Have
you update carbon version recently after the data is loaded? FROM MOBILE EMAIL
CLIENT On 04/16/2018 15:51, Liang Chen wrote: Hi From the log message, seems
like can't find the data files. Can you provide more detail
emm, if it only needs to extend another compressor for software
implementation, I think it will be quite easy to integrate.
Actually a PR has already been raised weeks ago to support customize
compressor in carbondata, you can refer to this link:
https://github.com/apache/carbondata/pull/2715.
Hi, aaron:
A PR has been raised for this issue
https://github.com/apache/carbondata/pull/2812, please check.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi, all:
About a year ago, we introduced 'multiple dirs for temp data' to solve disk
hotspot problem in data loading.
This feature enables carbon randomly pick one of the local directories
configured in yarn-local-dirs when it writes any temp files to disk (for
example: sort temp files and fact
Yes, it needs further modification to meet the requirement -- an additional
property is needed to handle this, we can configure multiple directories
there.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi, ajantha.
I just go through your PR and think we may need to rethink about this
feature especially its impact. I leaved a comment under your PR and will
paste it here for further communication in community.
I'm afraid that in common scenarios even we do not face the page size
problems and
OK, anyway please take care of the loading performance. The validation can
only be checked for those fields that may cross the boundary (e.g. varchar
and complex), and for the ordinary fields, just skip the validation.
--
Sent from:
Hi, arron, I go through the code and find the root cause.
While writing dataframe to carbontable, we have to keep the order of the
fields in dataframe the same as that in carbontable. The code lies in
`NewCarbonDataLoadRDD.scala#486`. This is because we judge whether the field
is a
Hi all,
I go through the code and get another formula to estimate the unsafe working
memory. It is inaccurate too but we can open this thread to optimize it.
# Memory Required For Data Loading per Table
## version from Community
(carbon.number.of.cores.while.loading) *
+1
I've few questions about this:
1. Is it OK to call it 'tableId' or 'table'
2. For what kind of statements will you audit the operations?
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Does the annotations have other effects other than just providing literal
coding contract?
For example we can use these annotations to
1. generate docs
2. restrict some operations (for example some configurations should not
support SET command)
3. limit scope for usage (for example some
Instead of supporting encryption, I think carbondata can provide another
common feature:
A framework that support some hooks while reading/writing column chunk.
User can specify the hooks while creating table and implement the encryption
feature as a special instance as they need.
--
Sent
A question here:
"""
3. Add concurrent reading functionality to Carbon Reader. This can be
enabled by passing the number of splits required by the user. If the user
passes 2 as the split for reader then the user would be returned 2
CarbonReaders with equal number of RecordReaders in each.
The
+1
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi, all:
The previous experiment uses 3 huawei ecs instances as workers each with 16
cores and 32GB. Spark executor use 12 cores and 24GB. Using 74GB LineItem in
100GB TPCH.
Today I run another experiment using 1 huawei RH2288 machine with 32 cores
and 128GB. Spark executor use 30 cores and
Hi, aaron.
For the wrong pruning information statistics in the query plan, do you
execute the queries concurrently?
I noticed that the pruning collector is single thread, if you ran queries
concurrently, the statistics for pruning will be incorrect.
--
Sent from:
+1
Q1: When will we start and finish the optimization in carbon-presto
integration? Any plan for this?
Another question:
Q2: Is it possible to use carbon reader to implement the similar function
of search mode?
--
Sent from:
Oh, I didn't notice the memory consumption at that time.
We all know that the resource utilization is low during compaction.
Using prefetch means that We are doing query background and it will surely
consume more resources.
Current size of prefetch is controlled by the 'carbon.detail.batch.size'
In addition to the last mail, for the numCoresOfAlterPartition, you can
handle it similarly.
Please remember to fix these in another PR, not in PR#2907.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi all:
I am raising a PR to enhance the performance of compaction. The PR number is
#2906.
Based on my experiments using about 72GB LineItem data ( in 100GB TPCH data), I
got the following results.
Code Branch PrefetchBatch Size (default 100)Load1 (s)
Load2 (s)
Yeah, aaron, the problem may lies in the dataframe and long_string_columns.
Can you try the following statement? It is from the test code in
'VarcharDataTypesBasicTestCase', which suggests you to specify the
'long_string_columns' while writing the dataframe.
```scala
test("write from dataframe
Did you build carbon with -Pbuild-with-format? it introduced Map datatype and
changed the thrift, so you need to add it. On 09/04/2018 09:10, aaron wrote:
Compile failed. My env is, aaron:carbondata aaron$ java -version java version
"1.8.0_144" Java(TM) SE Runtime Environment (build
Hi, arron.
Actually your query will not use the time series datamap since the filter
use filed 'product_id' which is not contained in your preagg datamap.
Even I remove the preagg datamap, the query with bloomfilter datamap still
failed with the same error logs as that in your post.
Then I add
enable/disable datamap only works for index datamap, we do not support other
types of datamap such as preagg/MV/ default Block/Blocklet datamap yet.
If you found any confusing document in datamap, you can help to revise it.
--
Sent from:
More details about this issue. I've add some logs in
`BloomCoarseGrainDataMap.createQueryModel` to print the input parameter
'expression'.
# Before applying PR2665
```
XU expression:
org.apache.carbondata.core.scan.expression.logical.AndExpression@3b035d0c
XU expression
Yeah, I am able to reproduce this problem using current master code. I'll
look into it.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
hi, aaron, thanks for your feedback.
Which version of carbondata are you using?
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Did you use the query in the first post? I tested it and it's OK, we can see
the bloomfilter while explaining.
1. If bloomfilter is not there, the reason may be that the main datamap has
already pruned all the blocklets. In this case, the following index datamap
will be skipped for shortcut.
You can download the patch and apply it to master, then you can rebuild the jar
and perform testing.
On Tue, Sep 25, 2018 at 5:02 PM +0800, "aaron" <949835...@qq.com> wrote:
Great! thanks for your so quick response! I will have a try. Do you mean
that I merge
+1
seems the committers only need to change the url for asf repo, that's OK.
On 5/1/2019 10:08, Liang Chen wrote:
Hi all,
Background :
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/NOTICE-Mandatory-migration-of-git-repositories-to-gitbox-apache-org-td72614.html
I think we can just rephrase the proposal.
We want to make the `sort_columns` by default is empty, that is to say if
the user does not explicitly specify the sort_columns, the corresponding
property will be 'sort_columns'=''.
And when the sort_columns is empty, carbondata will use no_sort for it
I think the no_sort is default only in case if the user doesnot specify the
sort_columns explicitly. Not for all the scenarios, right?
+1 for keeping the ‘sort_columns’ unchanged cause the fields in sort_columns
have different encoding strategy compared with others.
@Ajantha, Please make a
Hi, Ravindra
Using hierarchical index was in our previous plan too, We wanted to build
Block/task level index at the same time, but we postponed this feature due to
the following reasons:
1. It requires different configurations (bloom_size, bloom_fpp) for different
index level and it will
Hi, What's the number of cores in your executor?
And is there only one loading while you encounter this failure?
Besides, can you check if the local dictionary is enabled for your table
using 'desc formatter table_name'? If it is enabled, more memory will be
needed and the provided formula does
Each time we introduce a new feature, I do like to know the final usage for the
user. So what’s the grammar to load a json file to carbon?
Moreover, there maybe more and more kind of datasources in the future, so can
we just keep the integration simple by
1. Reading the input files using spark
What’s your proposal for the corresponding grammar to do that?
Besides, if we only sort after compaction, will it be proper to keep the
sort_scope in table level? It should be in segment level in this situation and
keep it in table level will confuse the user. How do you consider this?
Sent
So what’s your proposal for the grammar of this feature?
Do you want carbon to do it silently without any configurations or choices from
user?
What I am concerned about is that the performance of compaction. If user use
auto-compaction, the loading will be more delayed if we do compaction using
create task level bloom with the same configuration along with blocklet
bloom.
===
Since the number of distinct values in the task level is much bigger than
that in blocklet level, using the same configuration may cause the task
level bloomfilter work inefficiently.
This is just what I’m
Hi, please consider this line of code:
https://github.com/apache/carbondata/blob/master/core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java#L78
It uses apache-common-log directly instead of carbondata log. I’m not sure
about the impact of this.
Please take care of this
'Parallelize pruning' is in my plan long time ago, nice to see your proposal
here.
While implementing this, I'd like you to make it common, that is to say not
only default datamap but also other index datamaps can also use parallelize
pruning.
--
Sent from:
Hi ravin, Very nice to see this proposal in community!
The guidelines are better if they are easy to be performed. Even though I care
more about the code quality, I do also care about the convenience for
developers to contribute.
After I go through the points, I think
1,3,5,8,9,10 : +1
2,4,6:
I think it's a good proposal, but it will introduce too many changes.
In my opinion, the different order between Java and Scala files is
acceptable since it will not cause serious (even minor) problems.
Anyway, thanks for your investigation on this. But it's hard to tell
whether we should
Thanks ALL!
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Snappy and Zstd both know the decompress size of the content since they stored
that size along with the compressed content. But LZ4 didn't do this, you can
refer to the issue#26 in the lz4-java github page.
To work around this, You can store the original size in metadata for
decompression.
Yeah. Zstd and Snappy knows the size of decompress size from the compressed
data, but LZ4 don't. I find a link to describe this:
https://github.com/lz4/lz4-java/issues/26
To work around with LZ4, you can go with your proposal and save the
decompress size in the meta.
But I'd like to wrap the LZ4
Yeah, it actually belongs to 'Builder Pattern'. We should simplify this
before they are widely used.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
The actual storage datatype for that column is stored in ColumnPage level.
In previous implementation, columns with literal datatype 'float' and
'double' shared the same storage datatype 'double' and you want to
distinguish them by adding support for storage datatype 'float'.
Is my understanding
As a result of the latest implementation, I store the compressor name in the
thrift and the old enum for compression_codec has been deprecated. This
makes it easier to support other compressors. Take LZ4 for example, the
following changes are required:
1 Implement Lz4Compressor
2 Add
Hi ManishNalla:
"""
merging the overlapping intervals and getting new intervals(ranges) out of
them
"""
===
What do you mean by saying this? Can you give an example for it.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
+1 for advices from manish
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
I think there is still misunderstanding between us.
Here I only concern about the lazy build for index datamap.
I think each segment should has its own datamap status and based on this we
can support pruning by index datamap for each segment. After this, even the
datamap is lazy, during query we
+1 for ravin's advice.
We only support lazy/incremental load/rebuild for olap datamap (MV/preagg),
not for index datamap currently.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi kunal,
At last I'd suggest again that the code for pruning procedure should be
moved to a separate module.
The earlier we do this, the easier will be if we want to implement other
types of IndexServer later.
--
Sent from:
+1
looking forward for the PRs for that
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi Kunal,
IndexServer is quiet an efficient method to solve the problem of index
cache and it's great that someone finally tries to implement this. However
after I went through your design document, I get some questions for this and
I'll explain those as following:
1. For the 'backgroud'
Hi kunal, can you attach the document directly to the jira? I cannot access
the doc on google drive. Thanks.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
this reply is just for testing the the functionality of mailing list.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi, I've two questions about the current index server implementation:
1. Currently do we need to load all the index data of all segments to cache
server while doing filter query OR only load the segments required by this
query?
2. When do we trigger the cache loading action during the query?
As
Hi, so glad to see Carbondata will enter stage 2.x and I have the following
suggestions for your consideration as following:
1. Evolution for Carbondata file format.
Previously I thought one of the key highlights of Carbondata is the
Carbondata file format, is there any evolution for that?
While
yeah, please feel free to correct it, do not forget to correct all the 'show
datamaps' (at least 5 occurrences) in the project.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi, ravipesala, previously I have a similar proposal, please check if this
can make any help:
https://gist.github.com/xuchuanyin/cb264f2d7e94d6e185a55ea962e91ce1
Besides, for the problem in your proposal, the user can create a
`table_with_old_format_data` and create another
+1 with ravipesala, please use corresponding hive grammar and take delta
grammar as reference.
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Hi akash, glad to see the feature proposed and I have some questions about
this. Please notice that some of the following descriptions are comments
followed by '===' described in the design document attached in the
corresponding jira.
1.
"Currently carbondata supports timeseries on preaggregate
Sorry that I cannot access the document in jira.
In my opinion, both for the SORT_COLUMNS in current implementation and for
the LOCATION_COLUMNS in the proposal, carbondata tries to organize the data
in some order.
So the kernel of the proposal is that, for the SORT_COLUMNS, we can specify
a
Hi, concurrent load will not cause the problem and I've tried that months
ago.
Seen from the log, it seems that the problem lies in Compation that
automatically triggered after loading.
To solve the problem, I think you can:
1. firstly turn off auto-compaction to increase loading performance,
+1 for this feature.
Additionally, It is described in your draft that "specify the bloom columns
using table properties", I do recommend that for the first phrase, we should
not use this information from table properties while querying.
We can store the index information in blocklet(or page)
Glad to see you making this proposal! The features you mentioned are really
not popular even the heavy user neither try them nor know their usage.
For 1/2/3/4/5.1/5.2/7, we can remove this features with their code. But if
we consider compatibility, the query processing will still be complex. How
xuchuanyin created CARBONDATA-1281:
--
Summary: Disk hotspot found during data loading
Key: CARBONDATA-1281
URL: https://issues.apache.org/jira/browse/CARBONDATA-1281
Project: CarbonData
xuchuanyin created CARBONDATA-1267:
--
Summary: Failure in data loading due to bugs in delta-integer-codec
Key: CARBONDATA-1267
URL: https://issues.apache.org/jira/browse/CARBONDATA-1267
Project
xuchuanyin created CARBONDATA-1114:
--
Summary: Failed to run tests in windows env
Key: CARBONDATA-1114
URL: https://issues.apache.org/jira/browse/CARBONDATA-1114
Project: CarbonData
Issue
xuchuanyin created CARBONDATA-1167:
--
Summary: Mismatched between class name and logger class name
Key: CARBONDATA-1167
URL: https://issues.apache.org/jira/browse/CARBONDATA-1167
Project: CarbonData
97 matches
Mail list logo