from:"bill.zhou"

【DISCUSS】add more index for sort columns

2017-03-14 Thread bill.zhou

hi all

Carbon will add min/max index for sort columns which used for better
filter query. So can we add more index for the sort column to make filter
faster.

This is one idea which I get from anther database design.
For example this is one student, and the column: score in the student
table which will be sorted column. And the score range is from 1 to 100.
The table as following:

id namescore
1 bill001 83
2 bill002 84
3 bill003 90
4 bill004 89
5 bill005 93
6 bill006 76
7 bill007 87
8 bill008 90
9 bill009 89
10 bill010 96
11 bill011 96
12 bill012 100
13 bill013 84
14 bill014 90
15 bill015 79
16 bill016 1
17 bill017 97
18 bill018 79
19 bill019 88
20 bill068 95

After load the data into Cabron the score column will sort as following:
1 76 79 79 83 84 84 87 88 89
89 90 90 90 93 95 96 96 97 100

the min/max index is 1/100.
So for the query as following will take all the block data.
query1:select sum(score) from student when score score > 90
query2:select sum(score) from student when score score > 60 and score < 70.

Following two suggestion to reduce the block scan.
Suggestion 1: according the score range to divide into multiple small range
for example 4:

0: meas this block has not the score rang value
1: meas this block has the score rang value
If add this index, for the query1 only need scan 1/4 data of the block and
query2 no need scan any data, directly sckip this block

Suggestion 2: record more min/max for the score, for example every 5 rows
record one min/max

If add this index for query1 only need scan 1/2 data of the block and query2
only need scan 1/4 data of the block

this is the raw idea, please Jacky, Ravindra and liang correct it whether we
can add this feature. thanks

Regards
Bill

--
View this message in context:
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-add-more-index-for-sort-columns-tp8891.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at
Nabble.com.

CarbonDictionaryDecoder should support codegen

2017-03-10 Thread bill.zhou

hi All
   Now for the canrbon scan support codegen, but carbonditionarydecoder
does't support codegen, I think it should support. 
   For example, toady I do one test and the query plan is as following left,
if CarbondictionaryDecoder support codegen the plan will change to following
right.  I think that will improve the performance. 


 

Please Ravindra,Jacky,Liang correct it.  thank you. 

Regards
Bill 



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/CarbonDictionaryDecoder-should-support-codegen-tp8600.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: Improving Non-dictionary storage & performance.

2017-03-07 Thread bill.zhou

hi Jacky 
I think this is not easy for user to control if cabron is online
running. May be for one table two different load can be different
cardinality for the same column but user cannot give different dictionary
columns for one table.

Regards


Jacky Li wrote
> Hi Ravindra,
> 
> Another suggestion is that, to avoid creating trouble for user while
> loading, for single-pass, if dictionary key generated for certain column
> is more than the configured value, then the loading process should stop
> and log this error explicitly telling the cardinality of all columns. 
> By doing this, user should know what is the reason causing data load
> failure.
> How about this idea?
> 
> Regards,
> Jacky
> 
>> 在 2017年3月3日，上午1:26，Ravindra Pesala 

> ravi.pesala@

>  写道：
>> 
>> Hi Likun,
>> 
>> Yes, Likun we better keep dictionary as default until we optimize
>> no-dictionary columns.
>> As you mentioned we can suggest 2-pass for first load and subsequent
>> loads
>> will use single-pass to improve the performance.
>> 
>> Regards,
>> Ravindra.
>> 
>> On 2 March 2017 at 06:48, Jacky Li 

> jacky.likun@

>  wrote:
>> 
>>> Hi Ravindra & Vishal,
>>> 
>>> Yes, I think these works need to be done before switching no-dictionary
>>> as
>>> default. So as of now, we should use dictionary as default.
>>> I think we can suggest user to do loading as:
>>> 1. First load: use 2-pass mode to load, the first scan should discover
>>> the
>>> cardinality, and check with user specified option. We should define
>>> rules
>>> to pass or fail the validation, and finalize the load option for
>>> subsequent
>>> load.
>>> 2. Subsequent load: use single-pass mode to load, use the options
>>> defined
>>> by first load
>>> 
>>> What is your idea?
>>> 
>>> Regards,
>>> Jacky
>>> 
 在 2017年3月1日，下午11:31，Ravindra Pesala 

> ravi.pesala@

>  写道：
 
 Hi Vishal,
 
 You are right, thats why we can do no-dictionary only for String
>>> datatype.
 Please look at my first point. we can always use direct dictionary for
 possible data types like short, int, long, double & float for
>>> sort_columns.
 
 Regards,
 Ravindra.
 
 On 1 March 2017 at 18:18, Kumar Vishal 

> kumarvishal1802@

> 
>>> wrote:
 
> Hi Ravi,
> Sorting of data for no dictionary should be based on data type + same
>>> for
> filter . Please add this point.
> 
> -Regards
> Kumar Vishal
> 
> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala 

> ravi.pesala@

> 
> wrote:
> 
>> Hi,
>> 
>> In order to make non-dictionary columns storage and performance more
>> efficient, I am suggesting following improvements.
>> 
>> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct
>> dictionary.
>>  Right now only date and timestamp are direct dictionary columns. We
> can
>> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
> columns
>> are included in SORT_COLUMNS
>> 
>> 2. Consider delta/value compression while storing direct dictionary
> values.
>> Right now it always uses INT datatype to store direct dictionary
>>> values.
> So
>> we can consider value/Delta compression to compact the storage.
>> 
>> 3. Use the Separator instead of LV format to store String value in
>> no-dictionary format.
>> Currently String datatypes for non-dictionary colums are stored as
>> LV(length value) format, here we are using Short(2 bytes) as length
> always.
>> In order to keep storage compact we can use separator (0 byte as
> separator)
>> it just takes single byte. And while reading we can traverse through
>>> data
>> and get the offsets like we are doing now.
>> 
>> 4. Add Range filters for no-dictionary columns.
>> Currently range filters like greater/ less than filters are not
> implemented
>> for no-dictionary columns. So we should implement them to avoid row
>>> level
>> filter and improve the performance.
>> 
>> Regards,
>> Ravindra.
>> 
> 
 
 
 --
 Thanks & Regards,
 Ravi
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> -- 
>> Thanks & Regards,
>> Ravi





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Improving-Non-dictionary-storage-performance-tp8146p8402.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: [DISCUSS] For the dimension default should be no dictionary

2017-03-02 Thread bill.zhou

hi All
 I summary this discussion.
1. to make carbonData compatibility for older vesion, keep
DICTIONARY_INCLUDE and DICTIONARY_EXCLUDE, default is no dictionary. do not
suggestion change this properties to table_dictionary. 
2. Suggestion keep the sort_column properties as the same style for
dictionary. so this new properties suggestion use SORT_INCLUDE and
SORT_EXCLUDE, default is no sort.

Regards
Bill 


ravipesala wrote
> Hi All,
> 
> In order to make no-dictionary columns as default we should improve the
> storage and performance for these columns. I have sent another mail to
> discuss the improvement points. Please comment on it.
> 
> Regards,
> Ravindra
> 
> On 1 March 2017 at 10:12, Ravindra Pesala 

> ravi.pesala@

>  wrote:
> 
>> Hi Likun,
>>
>> It would be same case if we use all non dictionary columns by default, it
>> will increase the store size and decrease the performance so it is also
>> does not encourage more users if performance is poor.
>>
>> If we need to make no-dictionary columns as default then we should first
>> focus on reducing the store size and improve the filter queries on
>> non-dictionary columns.Even memory usage is higher while querying the
>> non-dictionary columns.
>>
>> Regards,
>> Ravindra.
>>
>> On 1 March 2017 at 06:00, Jacky Li 

> jacky.likun@

>  wrote:
>>
>>> Yes, I agree to your point. The only concern I have is for loading, I
>>> have seen many users accidentally put high cardinality column into
>>> dictionary column then the loading failed because out of memory or
>>> loading
>>> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for
>>> these columns, or they do not have a easy way to identify the high card
>>> columns. I feel preventing such misusage is important in order to
>>> encourage
>>> more users to use carbondata.
>>>
>>> Any suggestion on solving this issue?
>>>
>>>
>>> Regards,
>>> Likun
>>>
>>>
>>> > 在 2017年2月28日，下午10:20，Ravindra Pesala 

> ravi.pesala@

>  写道：
>>> >
>>> > Hi Likun,
>>> >
>>> > You mentioned that if user does not specify dictionary columns then by
>>> > default those are chosen as no dictionary columns.
>>> > But we have many disadvantages as I mentioned in above mail if you
>>> keep
>>> no
>>> > dictionary as default. We have initially introduced no dictionary
>>> columns
>>> > to handle high cardinality dimensions, but now making every thing as
>>> no
>>> > dictionary columns by default looses our unique feature compare to
>>> parquet.
>>> > Dictionary columns are introduced not only for aggregation queries, it
>>> is
>>> > for better compression and better filter queries as well. With out
>>> > dictionary store size will be increased a lot.
>>> >
>>> > Regards,
>>> > Ravindra.
>>> >
>>> > On 28 February 2017 at 18:05, Liang Chen 

> chenliang6136@

> 
>>> wrote:
>>> >
>>> >> Hi
>>> >>
>>> >> A couple of questions:
>>> >>
>>> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
>>> >> index" for these columns which be specified into the option(SORT_KEY)
>>> ?
>>> >>
>>> >> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't
>>> make
>>> >> dictionary encoding, and all shuffle operations are based on fact
>>> value, is
>>> >> my understanding right ?
>>> >> 
>>> >> ---
>>> >> If this option is not specified by user, means all columns encoding
>>> without
>>> >> global dictionary support. Normal shuffle on decoded value will be
>>> applied
>>> >> when doing group by operation.
>>> >>
>>> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
>>> >> supposed  if "C2" be specified into SORT_KEY, but not be specified
>>> into
>>> >> TABLE_DICTIONARY, then system how to handle this case ?
>>> >> 
>>> >> ---
>>> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>>> encoded as
>>> >> Inverted Index and with Minmax Index
>>> >>
>>> >> Regards
>>> >> Liang
>>> >>
>>> >> 2017-02-28 19:35 GMT+08:00 Jacky Li 

> jacky.likun@

> :
>>> >>
>>> >>> Yes, first we should simplify the DDL options. I propose following
>>> >> options,
>>> >>> please check weather it miss some scenario.
>>> >>>
>>> >>> 1. SORT_COLUMNS, or SORT_KEY
>>> >>> This indicates three things:
>>> >>> 1) All columns specified in options will be used to construct
>>> >>> Multi-Dimensional Key, which will be sorted along this key
>>> >>> 2) They will be encoded as Inverted Index and thus again sorted
>>> within
>>> >>> column chunk in one blocklet
>>> >>> 3) Minmax index will also be created for these columns
>>> >>>
>>> >>> When to use: This option is designed for accelerating filter query,
>>> so
>>> >> put
>>> >>> all filter columns into this option. The order of it can be:
>>> >>> 1) From low cardinality to high cardinality, this will make most
>>> >>> compression
>>> >>> and fit for

Re: [DISCUSS] For the dimension default should be no dictionary

2017-02-28 Thread bill.zhou

hi Ravindra

That is a good idea to conside the sort column and dictioanry column
together. 
For the DDL usability I have following suggestion. please share your
suggestion
1. sort columns properties better keep the same style like dictionary. 
   so the key word suggestion changed to SORT_INCLUDE and SORT_EXECLUDE

2. The user may be confusion if the DICTIONARY_EXCLUDE= 'ALL' and
DICTIONARY_INCLUDE='C3' come together. 

3.the value in the sort and dictioanry properties better only allow column
   If allowed DICTIONARY_EXCLUDE= 'ALL', the "ALL" may be conflict with
actually table column name. 

So I think the key point is how conside the default value which don't set in
INCLUDE or EXECLUDE. because for end user, if he put the column in INCLUDE
or EXECLUDE, that means this column is important and concered for user. 

So my suggestion as following: add one more properties called xxx_DEFAULT
For example we have 6 columns , we can mention DDL as below. 
case 1 : 
SORT_INCLUDE="C1,C2,C3" 
SORT_EXCLUDE="C4,C5,C6" 
In above case C1, C2 , C3 are sort columns and part of MDK key. And 
C4,C5,C6 are become non sort columns(measure/complex) 

*DICTIONARY_DEFAULT*= 'EXECLUDE' 
DICTIONARY_INCLUDE='C3' 
In the above case all sort columns((C1,C2,C3) are non-dictionary columns 
except C3, here C3 is dictionary column. 

case 2: 
*SORT_DEFAULT*="INCLUDE" 
SORT_EXCLUDE="C6" 
In this case all columns are sort columns except C6. 

DICTIONARY_EXCLUDE= 'C2' 
*DICTIONARY_DEFAULT*='INCLUDE' 
In the above case all sort columns(C1,C2,C3,C4,C5) are dictionary columns 
except C2, here C2 is no-dictionary column. 

ravipesala wrote
> Hi Bill,
> 
> I got your point, but the solution of making no-dictionary as default may
> not be perfect solution. Basically no-dictionary columns are only meant
> for
> high cardinality dimensions, so the usage may change from user to user or
> scenario to scenario .
> This is the basic issue of usability of DDL, please first focus on to
> simplify DDL usability.
> 
> For example we have 6 columns , we can mention DDL as below.
> case 1 :
> SORT_COLUMNS="C1,C2,C3"
> NON_SORT_COLUMNS="C4,C5,C6"
> In above case C1, C2 , C3 are sort columns and part of MDK key. And
> C4,C5,C6 are become non sort columns(measure/complex)
> 
> DICTIONARY_EXCLUDE= 'ALL'
> DICTIONARY_INCLUDE='C3'
> In the above case all sort columns((C1,C2,C3) are non-dictionary columns
> except C3, here C3 is dictionary column.
> 
> case 2:
> SORT_COLUMNS="ALL"
> NON_SORT_COLUMNS="C6"
> In this case all columns are sort columns except C6.
> 
> DICTIONARY_EXCLUDE= 'C2'
> DICTIONARY_INCLUDE='ALL'
> In the above case all sort columns(C1,C2,C3,C4,C5) are dictionary columns
> except C2, here C2 is no-dictionary column.
> 
> Above mentioned are just my idea of how to simplify DDL to handle all
> scenarios. We can have more discussion towards it to simplify the DDL.
> 
> Regards,
> Ravindra.
> 
> On 27 February 2017 at 12:38, bill.zhou 

> zgcsky08@

>  wrote:
> 
>> Dear Vishal & Ravindra
>>
>>   Thanks for you reply,  I think I didn't describe it clearly so that you
>> don't get full idea.
>> 1. dictionary is important feature in CarbonData, for every new customer
>> we
>> will introduce this feature to him. So for new customer will know it
>> clearly, will set the dictionary column when create table.
>> 2. For all customer like bank customer, telecom customer and traffic
>> customer have a same scenario is: have more column but only set few
>> column
>> as dictionary.
>> like telecom customer, 300 column only set 5 column dictionary, other
>> dim don't set dictionary.
>> like bank customer, 100 column only set about 5 column dictionary,
>> other
>> dim don't set dictionary.
>> *For currently customer actually user scenario, they only set the dim
>> which
>> used for filter and group by related column as dictionary*
>> 3. mys suggestion is that: dim column default as no dictionary is only
>> for
>> the dim which not put into the dictionary_include properties, not for all
>> dim column. If customer always used 5 columns add into dictionary_include
>> and others column no dictionary, this will not impact the query
>> performance.
>>
>> So that I suggestion the dim column default set as no dictionary which
>> not
>> added in to dictionary_include properties.
>>
>> Regards
>> Bill
>>
>>
>>
>> kumarvishal09 wrote
>> > Hi,
>> > I completely agree with Ravindra's points, more number of no
>> > dictionary
>> > column will impact the IO reading+writing both as in ca

Re: [DISCUSS] For the dimension default should be no dictionary

2017-02-26 Thread bill.zhou

Dear Vishal & Ravindra 
 
  Thanks for you reply,  I think I didn't describe it clearly so that you
don't get full idea. 
1. dictionary is important feature in CarbonData, for every new customer we
will introduce this feature to him. So for new customer will know it
clearly, will set the dictionary column when create table.
2. For all customer like bank customer, telecom customer and traffic
customer have a same scenario is: have more column but only set few column
as dictionary.
like telecom customer, 300 column only set 5 column dictionary, other
dim don't set dictionary. 
like bank customer, 100 column only set about 5 column dictionary, other
dim don't set dictionary.
*For currently customer actually user scenario, they only set the dim which
used for filter and group by related column as dictionary*
3. mys suggestion is that: dim column default as no dictionary is only for
the dim which not put into the dictionary_include properties, not for all
dim column. If customer always used 5 columns add into dictionary_include
and others column no dictionary, this will not impact the query performance. 

So that I suggestion the dim column default set as no dictionary which not
added in to dictionary_include properties. 

Regards
Bill



kumarvishal09 wrote
> Hi,
> I completely agree with Ravindra's points, more number of no
> dictionary
> column will impact the IO reading+writing both as in case of no dictionary
> data size will increase. Late decoding is one of main advantage, no
> dictionary column aggregation will be slower. Filter query will suffer as
> in case of dictionary column we are comparing on byte pack value, in case
> of no dictionary it will be on actual value.
> 
> -Regards
> Kumar Vishal
> 
> On Mon, Feb 27, 2017 at 12:34 AM, Ravindra Pesala 

> ravi.pesala@

> 
> wrote:
> 
>> Hi,
>>
>> I feel there are more disadvantages than advantages in this approach. In
>> your current scenario you want to set dictionary only for columns which
>> are
>> used as filters, but the usage of dictionary is not only limited for
>> filters, it can reduce the store size and improve the aggregation
>> queries.
>> I think you should set no_inverted_index false on non filtered columns to
>> reduce the store size and improve the performance.
>>
>> If we make no dictionary as default then user no need set them in DDL but
>> user needs to set the dictionary columns. If user wants to set more
>> dictionary columns then the same problem what you mentioned arises again
>> so
>> it does not solve the problem. I feel we should give more flexibility in
>> our DDL to simplify these scenarios and we should have more discussion on
>> it.
>>
>> Pros & Cons of your suggestion.
>> Advantages :
>> 1. Decoding/Encoding of dictionary could be avoided.
>>
>> Disadvantages :
>> 1. Store size will increase drastically.
>> 2. IO will increase so query performance will come down.
>> 3. Aggregation queries performance will suffer.
>>
>>
>>
>> Regards,
>> Ravindra.
>>
>> On 26 February 2017 at 20:04, bill.zhou 

> zgcsky08@

>  wrote:
>>
>> > hi All
>> > Now when create the CarbonData table,if  the dimension don't add
>> into
>> > the dictionary_exclude properties, the dimension will be consider as
>> > dictionary default. I think default should be no dictionary.
>> >
>> > For example when I do the POC for one customer, it has 300 columns
>> and
>> > 200 dimensions, but only 5 columns is used for filter, so he only need
>> set
>> > this 5 columns to dictionary and leave other 195 columns to no
>> dictionary.
>> > But now he need specify for the 195 columns to dictionary_exclude
>> > properties
>> > the will waste time and make the create table command huge, also will
>> > impact
>> > the load performance.
>> >
>> > So I suggestion dimension default should be no dictionary and this
>> can
>> > also help customer easy to know the dictionary column which is useful.
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-carbondata-
>> > mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>> > dimension-default-should-be-no-dictionary-tp8010.html
>> > Sent from the Apache CarbonData Mailing List archive mailing list
>> archive
>> > at Nabble.com.
>> >
>>
>>
>>
>> --
>> Thanks & Regards,
>> Ravi
>>





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010p8027.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

[DISCUSS] For the dimension default should be no dictionary

2017-02-26 Thread bill.zhou

hi All 
Now when create the CarbonData table,if  the dimension don't add into
the dictionary_exclude properties, the dimension will be consider as
dictionary default. I think default should be no dictionary. 

For example when I do the POC for one customer, it has 300 columns and
200 dimensions, but only 5 columns is used for filter, so he only need set
this 5 columns to dictionary and leave other 195 columns to no dictionary.
But now he need specify for the 195 columns to dictionary_exclude properties
the will waste time and make the create table command huge, also will impact
the load performance.

So I suggestion dimension default should be no dictionary and this can
also help customer easy to know the dictionary column which is useful.



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-dimension-default-should-be-no-dictionary-tp8010.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

compile error from spark project: scala.reflect.internal.MissingRequirmentError: object scala.runtime in compiler mirror not found

2016-11-29 Thread bill.zhou

hi all 

 today fetch the latest code from the master branch, then compile the
project. 
when compile the project spark it gives following issue. who knows this
issue ?


 



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/compile-error-from-spark-project-scala-reflect-internal-MissingRequirmentError-object-scala-runtime-d-tp3399.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: CarbonData propose major version number increment for next version (to 1.0.0)

2016-11-25 Thread bill.zhou

+1 
Regards
Bill

Venkata Gollamudi wrote
> Hi All,
> 
> CarbonData 0.2.0 has been a good work and stable release with lot of
> defects fixed and with number of performance improvements.
> https://issues.apache.org/jira/browse/CARBONDATA-320?jql=project%20%3D%20CARBONDATA%20AND%20fixVersion%20%3D%200.2.0-incubating%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> 
> Next version has many major and new value added features are planned,
> taking CarbonData capability to next level.
> Like
> - IUD(Insert-Update-Delete) support,
> - complete rewrite of data load flow with out Kettle,
> - Spark 2.x support,
> - Standardize CarbonInputFormat and CarbonOutputFormat,
> - alluxio(tachyon) file system support,
> - Carbon thrift format optimization for fast query,
> - Data loading performance improvement and In memory off heap sorting,
> - Query performance improvement using off heap,
> - Support Vectorized batch reader.
> 
> https://issues.apache.org/jira/browse/CARBONDATA-301?jql=project%20%3D%20CARBONDATA%20AND%20fixVersion%20%3D%200.3.0-incubating%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> 
> I think it makes sense to change CarbonData Major version in next version
> to 1.0.0.
> Please comment and vote on this.
> 
> Thanks,
> Ramana





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/CarbonData-propose-major-version-number-increment-for-next-version-to-1-0-0-tp3131p3219.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: As planed, we are ready to make Apache CarbonData 0.2.0 release:

2016-11-10 Thread bill.zhou

+1 
Regards 
bill.zhou 

Liang Chen wrote
> Hi all
> 
> In 0.2.0 version of CarbonData, there are major performance improvements
> like blocklets distribution, support BZIP2 compressed files, and so on
> added to enhance the CarbonData performance significantly. Along with
> performance improvement, there are new features added to enhance
> compatibility and usability of CarbonData like remove thrift compiler
> dependency.
> 
> 
> I can be this release manager, can JB guide me to finish this release?
> 
> Thanks.
> 
> 
> Regards
> Liang





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/As-planed-we-are-ready-to-make-Apache-CarbonData-0-2-0-release-tp2738p2861.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: In load data, CSV row contains invalid quote char and results are invalid

2016-10-29 Thread bill.zhou

hi Singh

   quotechar in the csv should be in pairs. like 
name, description, salary, age, dob 
tammy,$my name$,$90$,22,19/10/2019 
tammy1,$delhi$,$32345%*$*,22,19/10/2019
tammy2,$banglore$,$543$,$44$,19/10/2019
tammy3,$city$,$343$,$22$,12/10/2019 
tammy4,$punjab$,$23423$,$55$,19/10/2019 


Harmeet Singh wrote
> Hi Team,
> 
> I am trying to load the CSV fie, which contains invalid quote char. But in
> the results the row is inserted and values are mix with next row without
> any waring and error. Following are the details: 
*
> CSV File: 
*
> 
> name, description, salary, age, dob
> tammy,$my name$,$90$,22,19/10/2019
*
> tammy1,$delhi$,$32345%,22,19/10/2019
*
> 
*
> tammy2,$banglore$,$543$,$44$,19/10/2019
*
> tammy3,$city$,$343$,$22$,12/10/2019
> tammy4,$punjab$,$23423$,$55$,19/10/2019
> 
> In CSV row 2 contains invalid quote char.
*
> Table :
*
> 
> create table one (name string, description string, salary double, age int,
> dob timestamp) stored by 'carbondata';
*
> Load data: 
*
> 
> load data local inpath
> 'hdfs://localhost:54310/home/harmeet/dollarquote3.csv' into table one
> OPTIONS('QUOTECHAR'="$");
*
> Actual Results: 
*
> 
> +-+--+---+---+--+--+
> |  name   | description  |  dob  |  salary   | age  |
> +-+--+---+---+--+--+
> | tammy   | my name  | NULL  | 90.0  | 22   |
*
> | tammy1  | delhi| NULL  | NULL  | 543  |
*
> | tammy3  | city | NULL  | 343.0 | 22   |
> | tammy4  | punjab   | NULL  | 23423.0   | 55   |
> +-+--+---+---+--+--+
> 
> In the result the tammy1 value contains salary of tammy2 record.
*
> Expected Result: 
*
> 
> May be an error while reading invalid quote char  OR Just ignore the
> invalid quote char row.





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/In-load-data-CSV-row-contains-invalid-quote-char-and-results-are-invalid-tp2347p2463.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: [Discussion] Support Date/Time format for Timestamp columns to be defined at column level

2016-09-29 Thread bill.zhou

+1 I agree Vimal's opinion.
if user want other formatted, he can use function to convert. 

Regards
Bill



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-Support-Date-Time-format-for-Timestamp-columns-to-be-defined-at-column-level-tp1422p1560.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: [discussion]When table properties is repeated it only set the last one

2016-09-29 Thread bill.zhou

+1



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/discussion-When-table-properties-is-repeated-it-only-set-the-last-one-tp1539p1559.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: Open discussion and Vote: What kind of JIRA issue events need send mail to dev@carbondata.incubator.apache.org

2016-08-18 Thread bill.zhou

Option2, better add Issue closed event



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Open-discussion-and-Vote-What-kind-of-JIRA-issue-events-need-send-mail-to-dev-carbondata-incubator-ag-tp321p325.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

【DISCUSS】add more index for sort columns

CarbonDictionaryDecoder should support codegen

Re: Improving Non-dictionary storage & performance.

Re: [DISCUSS] For the dimension default should be no dictionary

Re: [DISCUSS] For the dimension default should be no dictionary

Re: [DISCUSS] For the dimension default should be no dictionary

[DISCUSS] For the dimension default should be no dictionary

compile error from spark project: scala.reflect.internal.MissingRequirmentError: object scala.runtime in compiler mirror not found

Re: CarbonData propose major version number increment for next version (to 1.0.0)

Re: As planed, we are ready to make Apache CarbonData 0.2.0 release:

Re: In load data, CSV row contains invalid quote char and results are invalid

Re: [Discussion] Support Date/Time format for Timestamp columns to be defined at column level

Re: [discussion]When table properties is repeated it only set the last one

Re: Open discussion and Vote: What kind of JIRA issue events need send mail to dev@carbondata.incubator.apache.org

14 matches

Site Navigation

Mail list logo

Footer information