[jira] [Created] (CARBONDATA-449) Remove unnecessary log property

2016-11-24 Thread Jacky Li (JIRA)
Jacky Li created CARBONDATA-449:
---

 Summary: Remove unnecessary log property
 Key: CARBONDATA-449
 URL: https://issues.apache.org/jira/browse/CARBONDATA-449
 Project: CarbonData
  Issue Type: Improvement
Affects Versions: 0.2.0-incubating
Reporter: Jacky Li
 Fix For: 0.3.0-incubating


When creating Log object, there are some unnecessary properties



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-24 Thread Liang Chen
Hi xiaoqiao

ok, look forward to seeing your test result.
Can you take this task for this improvement? Please let me know if you need
any support :)

Regards
Liang


hexiaoqiao wrote
> Hi Kumar Vishal,
> 
> Thanks for your suggestions. As you said, choose Trie replace HashMap we
> can get better memory footprint and also good performance. Of course, DAT
> is not only choice, and I will do test about DAT vs Radix Trie and release
> the test result as soon as possible. Thanks your suggestions again.
> 
> Regards,
> Xiaoqiao
> 
> On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal <

> kumarvishal1802@

> >
> wrote:
> 
>> Hi XIaoqiao He,
>> +1,
>> For forward dictionary case it will be very good optimisation, as our
>> case
>> is very specific storing byte array to int mapping[data to surrogate key
>> mapping], I think we will get much better memory footprint and
>> performance
>> will be also good(2x). We can also try radix tree(radix trie), it is more
>> optimise for storage.
>>
>> -Regards
>> Kumar Vishal
>>
>> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen <

> chenliang6136@

> >
>> wrote:
>>
>> > Hi xiaoqiao
>> >
>> > For the below example, 600K dictionary data:
>> > It is to say that using "DAT" can save 36M memory against
>> > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
>> >
>> > One more question:if increases the dictionary data size, what's the
>> > comparison results "ConcurrentHashMap" VS "DAT"
>> >
>> > Regards
>> > Liang
>> > 
>> > --
>> > a. memory footprint (approximate quantity) in 64-bit JVM:
>> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
>> >
>> > b. retrieval performance: total time(ms) of 500 million query:
>> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
>> >
>> > Regards
>> > Liang
>> >
>> > hexiaoqiao wrote
>> > > hi Liang,
>> > >
>> > > Thanks for your reply, i need to correct the experiment result
>> because
>> > > it's
>> > > wrong order NO.1 column of result data table.
>> > >
>> > > In order to compare performance between Trie and HashMap, Two
>> different
>> > > structures are constructed using the same dictionary data which size
>> is
>> > > 600K and each item's length is between 2 and 50 bytes.
>> > >
>> > > ConcurrentHashMap (structure which is used in CarbonData currently)
>> vs
>> > > Double
>> > > Array Trie (one implementation of Trie Structures)
>> > >
>> > > a. memory footprint (approximate quantity) in 64-bit JVM:
>> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
>> > >
>> > > b. retrieval performance: total time(ms) of 500 million query:
>> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
>> > >
>> > > Regards,
>> > > He Xiaoqiao
>> > >
>> > >
>> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen <
>> >
>> > > chenliang6136@
>> >
>> > > > wrote:
>> > >
>> > >> Hi xiaoqiao
>> > >>
>> > >> This improvement looks great!
>> > >> Can you please explain the below data, what does it mean?
>> > >> --
>> > >> ConcurrentHashMap
>> > >> ~68MB 14543
>> > >> Double Array Trie
>> > >> ~104MB 12825
>> > >>
>> > >> Regards
>> > >> Liang
>> > >>
>> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He <
>> >
>> > > xq.he2009@
>> >
>> > > >:
>> > >>
>> > >> >  Hi All,
>> > >> >
>> > >> > I would like to propose Dictionary improvement which using Trie in
>> > >> place
>> > >> of
>> > >> > HashMap.
>> > >> >
>> > >> > In order to speedup aggregation, reduce run-time memory footprint,
>> > >> enable
>> > >> > fast
>> > >> > distinct count etc, CarbonData encodes data using dictionary at
>> file
>> > >> level
>> > >> > or table level based on cardinality. It is a general and efficient
>> way
>> > >> in
>> > >> > many big data systems, but when apply ConcurrentHashMap
>> > >> > to maintain Dictionary in CarbonData currently, memory overhead of
>> > >> > Driver is very huge since it has to load whole Dictionary to
>> decode
>> > >> actual
>> > >> > data value, especially column cardinality is a large number. and
>> > >> CarbonData
>> > >> > will not do dictionary if cardinality > 1 million at default
>> behavior.
>> > >> >
>> > >> > I propose using Trie in place of HashMap for the following three
>> > >> reasons:
>> > >> > (1) Trie is a proper structure for Dictionary,
>> > >> > (2) Reduce memory footprint,
>> > >> > (3) Not impact retrieval performance
>> > >> >
>> > >> > The experimental results show that Trie is able to meet the
>> > >> requirement.
>> > >> > a. ConcurrentHashMap vs Double Array Trie
>> > >> > ;(one
>> > >> implementation of
>> > >> > Trie Structures)
>> > >> > b. Dictionary size: 600K
>> > >> > c. Memory footprint and query time
>> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
>> > >> > ConcurrentHashMap
>> > >> > ~68MB 14543
>> > >> > Double Array Trie
>> > >> > ~104MB 12825
>> > >> >
>> > >> > Please share your suggestions about the proposed improvement of
>> > >> Dictionary.

Re: CarbonData propose major version number increment for next version (to 1.0.0)

2016-11-24 Thread Aniket Adnaik
+1 for changing to major version given the list of items being covered in
the release.

Agree with Jacky's comment on IUD, lets correct it to Update/Delete support
instead of IUD.

Best Regards,
Aniket


On Thu, Nov 24, 2016 at 6:36 PM, Jacky Li  wrote:

> +1,
> and comments inline
>
> > 在 2016年11月24日,上午12:09,Venkata Gollamudi  写道:
> >
> > Hi All,
> >
> > CarbonData 0.2.0 has been a good work and stable release with lot of
> > defects fixed and with number of performance improvements.
> > https://issues.apache.org/jira/browse/CARBONDATA-320?jql=project%20%3D%
> 20CARBONDATA%20AND%20fixVersion%20%3D%200.2.0-incubating%20ORDER%20BY%
> 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> >
> > Next version has many major and new value added features are planned,
> > taking CarbonData capability to next level.
> > Like
> > - IUD(Insert-Update-Delete) support,
>
> Actually, the design doc Aniket has shared is for Update and Delete only,
> Insert is not covered. I think Insert is a feature need to be designed in
> the future.
>
> > - complete rewrite of data load flow with out Kettle,
> > - Spark 2.x support,
>
> Since Spark2.x has made changes to user level API and SQL, it will also
> make some CarbonData’s SQL command incompatible with earlier version. So I
> think upgrading CarbonData version to 1.0.0 also indicating this
> incompatibility.
>
> > - Standardize CarbonInputFormat and CarbonOutputFormat,
> > - alluxio(tachyon) file system support,
> > - Carbon thrift format optimization for fast query,
> > - Data loading performance improvement and In memory off heap sorting,
> > - Query performance improvement using off heap,
> > - Support Vectorized batch reader.
> >
> > https://issues.apache.org/jira/browse/CARBONDATA-301?jql=project%20%3D%
> 20CARBONDATA%20AND%20fixVersion%20%3D%200.3.0-incubating%20ORDER%20BY%
> 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> >
> > I think it makes sense to change CarbonData Major version in next version
> > to 1.0.0.
> > Please comment and vote on this.
> >
> > Thanks,
> > Ramana
>
>
>
>


Re: [Feature ]Design Document for Update/Delete support in CarbonData

2016-11-24 Thread Aniket Adnaik
Hi Kumar Vishal,

Yes, valid point. And there have been thoughts about it, there is lot of
scope for optimization of compaction strategies. We may even consider
background monitor process(or cron job or similar) to monitor and trigger
compaction automatically in future.

Best Regards,
Aniket

On Thu, Nov 24, 2016 at 1:32 AM, Kumar Vishal 
wrote:

> HI Ankiet,
>
> I think If update/delete is for less data then horizontal compaction can
> based on user configuration, but if more data is getting updated then
> better to start vertical compaction immediately , this is because we are
> not physically deleting the data from disk, if more data is getting
> updated(more than 60%) then during query first we will query the older data
> + exclude the deleted records+ include the update delta file data. So in
> this case more data will come into memory, we can avoid this by starting
> vertical compaction immediately after update/delete.
>
> -Regards
> Kumar Vishal
>
> On Thu, Nov 24, 2016 at 2:43 PM, Kumar Vishal 
> wrote:
>
> > Hi Aniket,
> >
> > I agree with Vimal opinion, but that use case will be very less.
> >
> > I have one query for this update and delete feature.
> > When we will start compaction after each update or delete operation?
> >
> > -Regards
> > Kumar Vishal
> >
> >
> >
> > On Thu, Nov 24, 2016 at 12:05 AM, Aniket Adnaik  >
> > wrote:
> >
> >> Hi Vimal,
> >>
> >> Thanks for your suggestions.
> >> For the 1st point, i tend to agree with Manish's comments. But, it's
> worth
> >> looking into different ways to optimize the performance.
> >> I guess, query performance may take priority over update performance.
> >> Basically, we may need better compaction approach to merge
> >> delta files into regular carbon files to maintain adequate performance.
> >> For the 2nd point, CarbonData will support updating multiple rows, but
> not
> >> the same row multiple times in a single update operation. It is possible
> >> that join condition in sub-select of original update statement can
> result
> >> into multiple rows from source table for the same row in the target
> table.
> >> This is ambiguous condition and common ways to solve this is to error
> out
> >> ,
> >> or to apply first matching row, or to apply last matching row.
> CarbonData
> >> will choose to error out and let user resolve the ambiguity, which a
> >> safer/standard choice.
> >>
> >> Best Regards,
> >> Aniket
> >>
> >> On Wed, Nov 23, 2016 at 4:54 AM, manish gupta <
> tomanishgupt...@gmail.com>
> >> wrote:
> >>
> >> > Hi Vimal,
> >> >
> >> > I have few queries regarding regarding the 1st suggestion.
> >> >
> >> > 1. Dimensions can both be dictionary and no dictionary. If we update
> the
> >> > dictionary file then we will have to maintain 2 flows one for
> dictionary
> >> > columns and 1 for no dictionary columns. Will that be ok?
> >> >
> >> > 2. We write dictionary files in append mode. Updating dictionary files
> >> will
> >> > be like completely rewriting the dictionary file which will also
> modify
> >> the
> >> > dictionary metadata and sort index file OR there is some other
> approach
> >> > that needs to be followed like maintaining a update delta mapping for
> >> > dictionary file.
> >> >
> >> > Regards
> >> > Manish Gupta
> >> >
> >> > On Wed, Nov 23, 2016 at 10:47 AM, Vimal Das Kammath <
> >> > vimaldas.kamm...@gmail.com> wrote:
> >> >
> >> > > Hi Aniket,
> >> > >
> >> > > The design looks sound and the documentation is great.
> >> > > I have few suggestions.
> >> > >
> >> > > 1) Measure update vs dimension update : In case of dimension update.
> >> for
> >> > > example user wants to change dept1 to dept2 for all users who are
> >> under
> >> > > dept1. Can we just update the dictionary for faster performance?
> >> > > 2) Update Semantics (one matching record vs multiple matching
> >> record): I
> >> > > could not understand this section. Wanted to confirm if we will
> >> support
> >> > one
> >> > > update statement updating multiple rows.
> >> > >
> >> > > -Vimal
> >> > >
> >> > > On Tue, Nov 22, 2016 at 2:30 PM, Liang Chen <
> chenliang6...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Hi  Aniket
> >> > > >
> >> > > > Thanks you finished the good design documents. A couple of inputs
> >> from
> >> > my
> >> > > > side:
> >> > > >
> >> > > > 1.Please add the below mentioned info(Rowid definition etc.) to
> >> design
> >> > > > documents also.
> >> > > > 2.In page6 :"Schema change operation can run in parallel with
> >> Update or
> >> > > > Delte operations, but not with another schema change operation" ,
> >> can
> >> > you
> >> > > > explain this item ?
> >> > > > 3.Please unify the description:  use "CarbonData" to replace
> >> "Carbon",
> >> > > > unify the description for "destination table" and "target table".
> >> > > > 4.The Update operation's delete delta is same with Delete
> >> operation's
> >> > > > delete
> >> > > > delta?
> >> > > >
> >> > > > BTW, it would be much better if you could provide google docs for
> >> > review

Re: CarbonData propose major version number increment for next version (to 1.0.0)

2016-11-24 Thread Jacky Li
+1,
and comments inline

> 在 2016年11月24日,上午12:09,Venkata Gollamudi  写道:
> 
> Hi All,
> 
> CarbonData 0.2.0 has been a good work and stable release with lot of
> defects fixed and with number of performance improvements.
> https://issues.apache.org/jira/browse/CARBONDATA-320?jql=project%20%3D%20CARBONDATA%20AND%20fixVersion%20%3D%200.2.0-incubating%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> 
> Next version has many major and new value added features are planned,
> taking CarbonData capability to next level.
> Like
> - IUD(Insert-Update-Delete) support,

Actually, the design doc Aniket has shared is for Update and Delete only, 
Insert is not covered. I think Insert is a feature need to be designed in the 
future.

> - complete rewrite of data load flow with out Kettle,
> - Spark 2.x support,

Since Spark2.x has made changes to user level API and SQL, it will also make 
some CarbonData’s SQL command incompatible with earlier version. So I think 
upgrading CarbonData version to 1.0.0 also indicating this incompatibility.

> - Standardize CarbonInputFormat and CarbonOutputFormat,
> - alluxio(tachyon) file system support,
> - Carbon thrift format optimization for fast query,
> - Data loading performance improvement and In memory off heap sorting,
> - Query performance improvement using off heap,
> - Support Vectorized batch reader.
> 
> https://issues.apache.org/jira/browse/CARBONDATA-301?jql=project%20%3D%20CARBONDATA%20AND%20fixVersion%20%3D%200.3.0-incubating%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> 
> I think it makes sense to change CarbonData Major version in next version
> to 1.0.0.
> Please comment and vote on this.
> 
> Thanks,
> Ramana





[jira] [Created] (CARBONDATA-448) Solve compilation error in core for spark2

2016-11-24 Thread Jacky Li (JIRA)
Jacky Li created CARBONDATA-448:
---

 Summary: Solve compilation error in core for spark2
 Key: CARBONDATA-448
 URL: https://issues.apache.org/jira/browse/CARBONDATA-448
 Project: CarbonData
  Issue Type: Bug
Affects Versions: 0.2.0-incubating
Reporter: Jacky Li
 Fix For: 0.3.0-incubating


Currently, carbon-core module compile will fail with -Pspark-2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: CarbonData propose major version number increment for next version (to 1.0.0)

2016-11-24 Thread Naresh P R
+1

Regards,
Naresh

On Nov 24, 2016 11:30 PM, "Ravindra Pesala"  wrote:

> +1
>
> On Thu, Nov 24, 2016, 10:37 PM manish gupta 
> wrote:
>
> > +1
> >
> > Regards
> > Manish Gupta
> >
> > On Thu, Nov 24, 2016 at 7:30 PM, Kumar Vishal  >
> > wrote:
> >
> > > +1
> > >
> > > -Regards
> > > Kumar Vishal
> > >
> > > On Thu, Nov 24, 2016 at 2:41 PM, Raghunandan S <
> > > carbondatacontributi...@gmail.com> wrote:
> > >
> > > > +1
> > > > On Thu, 24 Nov 2016 at 2:30 PM, Liang Chen 
> > > > wrote:
> > > >
> > > > > Hi
> > > > >
> > > > > Ya, good proposal.
> > > > > CarbonData 0.x version integrate with spark 1.x,  and the load data
> > > > > solution
> > > > > of 0.x version is using kettle.
> > > > > CarbonData 1.x version integrate with spark 2.x, the load data
> > solution
> > > > of
> > > > > 1.x version will not use kettle .
> > > > >
> > > > > That would be helpful to reduce maintenance cost through
> > distinguishing
> > > > the
> > > > > major different version.
> > > > >
> > > > > +1 for the proposal.
> > > > >
> > > > > Regards
> > > > > Liang
> > > > >
> > > > >
> > > > > Venkata Gollamudi wrote
> > > > > > Hi All,
> > > > > >
> > > > > > CarbonData 0.2.0 has been a good work and stable release with lot
> > of
> > > > > > defects fixed and with number of performance improvements.
> > > > > >
> > > > > https://issues.apache.org/jira/browse/CARBONDATA-320?
> > > jql=project%20%3D%
> > > > 20CARBONDATA%20AND%20fixVersion%20%3D%200.2.0-
> incubating%20ORDER%20BY%
> > > > 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> > > > > >
> > > > > > Next version has many major and new value added features are
> > planned,
> > > > > > taking CarbonData capability to next level.
> > > > > > Like
> > > > > > - IUD(Insert-Update-Delete) support,
> > > > > > - complete rewrite of data load flow with out Kettle,
> > > > > > - Spark 2.x support,
> > > > > > - Standardize CarbonInputFormat and CarbonOutputFormat,
> > > > > > - alluxio(tachyon) file system support,
> > > > > > - Carbon thrift format optimization for fast query,
> > > > > > - Data loading performance improvement and In memory off heap
> > > sorting,
> > > > > > - Query performance improvement using off heap,
> > > > > > - Support Vectorized batch reader.
> > > > > >
> > > > > >
> > > > > https://issues.apache.org/jira/browse/CARBONDATA-301?
> > > jql=project%20%3D%
> > > > 20CARBONDATA%20AND%20fixVersion%20%3D%200.3.0-
> incubating%20ORDER%20BY%
> > > > 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> > > > > >
> > > > > > I think it makes sense to change CarbonData Major version in next
> > > > version
> > > > > > to 1.0.0.
> > > > > > Please comment and vote on this.
> > > > > >
> > > > > > Thanks,
> > > > > > Ramana
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > View this message in context:
> > > > > http://apache-carbondata-mailing-list-archive.1130556.
> > > > n5.nabble.com/CarbonData-propose-major-version-number-
> > > > increment-for-next-version-to-1-0-0-tp3131p3157.html
> > > > > Sent from the Apache CarbonData Mailing List archive mailing list
> > > archive
> > > > > at Nabble.com.
> > > > >
> > > >
> > >
> >
>


Re: [jira] [Created] (CARBONDATA-440) Provide Update/Delete functionality support in CarbonData

2016-11-24 Thread sujith chacko
Hi Aniket,

I think it will be better if we can consider delete and update compaction
feature also in the high level design list.

Thanks,
Sujith

On Nov 23, 2016 4:29 AM, "Aniket Adnaik (JIRA)"  wrote:

> Aniket Adnaik created CARBONDATA-440:
> 
>
>  Summary: Provide Update/Delete functionality support in
> CarbonData
>  Key: CARBONDATA-440
>  URL: https://issues.apache.org/jira/browse/CARBONDATA-440
>  Project: CarbonData
>   Issue Type: New Feature
>   Components: core, data-query, file-format, spark-integration, sql
> Affects Versions: 0.1.1-incubating, 0.1.0-incubating, 0.2.0-incubating
> Reporter: Aniket Adnaik
>  Fix For: 0.3.0-incubating
>
>
> Currently, CarbonData does not support modification of existing rows in
> the table. This is a major limitation for many real world desirable use
> cases in data warehousing, such as slow changing dimension tables, data
> correction of fact tables or data cleanup, etc. Many users want to be able
> to update and delete rows from the CarbonData table.
>
> Following are some high level design goals to support this functionality,
> 1. Provide a standard SQL interface to perform Update and Delete
> operations.
> 2. Perform Update and Delete operations on CarbonData table without having
> to rewrite the entire CarbonData block (file) by making use of differential
> files (a.k.a delta files).
> 3. After Update or Delete operation, CarbonData readers should skip
> deleted records and read updated records seamlessly without having to
> modify user applications.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


Re: CarbonData propose major version number increment for next version (to 1.0.0)

2016-11-24 Thread Ravindra Pesala
+1

On Thu, Nov 24, 2016, 10:37 PM manish gupta 
wrote:

> +1
>
> Regards
> Manish Gupta
>
> On Thu, Nov 24, 2016 at 7:30 PM, Kumar Vishal 
> wrote:
>
> > +1
> >
> > -Regards
> > Kumar Vishal
> >
> > On Thu, Nov 24, 2016 at 2:41 PM, Raghunandan S <
> > carbondatacontributi...@gmail.com> wrote:
> >
> > > +1
> > > On Thu, 24 Nov 2016 at 2:30 PM, Liang Chen 
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > Ya, good proposal.
> > > > CarbonData 0.x version integrate with spark 1.x,  and the load data
> > > > solution
> > > > of 0.x version is using kettle.
> > > > CarbonData 1.x version integrate with spark 2.x, the load data
> solution
> > > of
> > > > 1.x version will not use kettle .
> > > >
> > > > That would be helpful to reduce maintenance cost through
> distinguishing
> > > the
> > > > major different version.
> > > >
> > > > +1 for the proposal.
> > > >
> > > > Regards
> > > > Liang
> > > >
> > > >
> > > > Venkata Gollamudi wrote
> > > > > Hi All,
> > > > >
> > > > > CarbonData 0.2.0 has been a good work and stable release with lot
> of
> > > > > defects fixed and with number of performance improvements.
> > > > >
> > > > https://issues.apache.org/jira/browse/CARBONDATA-320?
> > jql=project%20%3D%
> > > 20CARBONDATA%20AND%20fixVersion%20%3D%200.2.0-incubating%20ORDER%20BY%
> > > 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> > > > >
> > > > > Next version has many major and new value added features are
> planned,
> > > > > taking CarbonData capability to next level.
> > > > > Like
> > > > > - IUD(Insert-Update-Delete) support,
> > > > > - complete rewrite of data load flow with out Kettle,
> > > > > - Spark 2.x support,
> > > > > - Standardize CarbonInputFormat and CarbonOutputFormat,
> > > > > - alluxio(tachyon) file system support,
> > > > > - Carbon thrift format optimization for fast query,
> > > > > - Data loading performance improvement and In memory off heap
> > sorting,
> > > > > - Query performance improvement using off heap,
> > > > > - Support Vectorized batch reader.
> > > > >
> > > > >
> > > > https://issues.apache.org/jira/browse/CARBONDATA-301?
> > jql=project%20%3D%
> > > 20CARBONDATA%20AND%20fixVersion%20%3D%200.3.0-incubating%20ORDER%20BY%
> > > 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> > > > >
> > > > > I think it makes sense to change CarbonData Major version in next
> > > version
> > > > > to 1.0.0.
> > > > > Please comment and vote on this.
> > > > >
> > > > > Thanks,
> > > > > Ramana
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > > > http://apache-carbondata-mailing-list-archive.1130556.
> > > n5.nabble.com/CarbonData-propose-major-version-number-
> > > increment-for-next-version-to-1-0-0-tp3131p3157.html
> > > > Sent from the Apache CarbonData Mailing List archive mailing list
> > archive
> > > > at Nabble.com.
> > > >
> > >
> >
>


Re: CarbonData propose major version number increment for next version (to 1.0.0)

2016-11-24 Thread manish gupta
+1

Regards
Manish Gupta

On Thu, Nov 24, 2016 at 7:30 PM, Kumar Vishal 
wrote:

> +1
>
> -Regards
> Kumar Vishal
>
> On Thu, Nov 24, 2016 at 2:41 PM, Raghunandan S <
> carbondatacontributi...@gmail.com> wrote:
>
> > +1
> > On Thu, 24 Nov 2016 at 2:30 PM, Liang Chen 
> > wrote:
> >
> > > Hi
> > >
> > > Ya, good proposal.
> > > CarbonData 0.x version integrate with spark 1.x,  and the load data
> > > solution
> > > of 0.x version is using kettle.
> > > CarbonData 1.x version integrate with spark 2.x, the load data solution
> > of
> > > 1.x version will not use kettle .
> > >
> > > That would be helpful to reduce maintenance cost through distinguishing
> > the
> > > major different version.
> > >
> > > +1 for the proposal.
> > >
> > > Regards
> > > Liang
> > >
> > >
> > > Venkata Gollamudi wrote
> > > > Hi All,
> > > >
> > > > CarbonData 0.2.0 has been a good work and stable release with lot of
> > > > defects fixed and with number of performance improvements.
> > > >
> > > https://issues.apache.org/jira/browse/CARBONDATA-320?
> jql=project%20%3D%
> > 20CARBONDATA%20AND%20fixVersion%20%3D%200.2.0-incubating%20ORDER%20BY%
> > 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> > > >
> > > > Next version has many major and new value added features are planned,
> > > > taking CarbonData capability to next level.
> > > > Like
> > > > - IUD(Insert-Update-Delete) support,
> > > > - complete rewrite of data load flow with out Kettle,
> > > > - Spark 2.x support,
> > > > - Standardize CarbonInputFormat and CarbonOutputFormat,
> > > > - alluxio(tachyon) file system support,
> > > > - Carbon thrift format optimization for fast query,
> > > > - Data loading performance improvement and In memory off heap
> sorting,
> > > > - Query performance improvement using off heap,
> > > > - Support Vectorized batch reader.
> > > >
> > > >
> > > https://issues.apache.org/jira/browse/CARBONDATA-301?
> jql=project%20%3D%
> > 20CARBONDATA%20AND%20fixVersion%20%3D%200.3.0-incubating%20ORDER%20BY%
> > 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> > > >
> > > > I think it makes sense to change CarbonData Major version in next
> > version
> > > > to 1.0.0.
> > > > Please comment and vote on this.
> > > >
> > > > Thanks,
> > > > Ramana
> > >
> > >
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > > http://apache-carbondata-mailing-list-archive.1130556.
> > n5.nabble.com/CarbonData-propose-major-version-number-
> > increment-for-next-version-to-1-0-0-tp3131p3157.html
> > > Sent from the Apache CarbonData Mailing List archive mailing list
> archive
> > > at Nabble.com.
> > >
> >
>


[jira] [Created] (CARBONDATA-447) Use Carbon log service instead of spark Logging

2016-11-24 Thread Jacky Li (JIRA)
Jacky Li created CARBONDATA-447:
---

 Summary: Use Carbon log service instead of spark Logging
 Key: CARBONDATA-447
 URL: https://issues.apache.org/jira/browse/CARBONDATA-447
 Project: CarbonData
  Issue Type: Improvement
Reporter: Jacky Li


Use Carbon log service instead of spark Logging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: CarbonData propose major version number increment for next version (to 1.0.0)

2016-11-24 Thread Kumar Vishal
+1

-Regards
Kumar Vishal

On Thu, Nov 24, 2016 at 2:41 PM, Raghunandan S <
carbondatacontributi...@gmail.com> wrote:

> +1
> On Thu, 24 Nov 2016 at 2:30 PM, Liang Chen 
> wrote:
>
> > Hi
> >
> > Ya, good proposal.
> > CarbonData 0.x version integrate with spark 1.x,  and the load data
> > solution
> > of 0.x version is using kettle.
> > CarbonData 1.x version integrate with spark 2.x, the load data solution
> of
> > 1.x version will not use kettle .
> >
> > That would be helpful to reduce maintenance cost through distinguishing
> the
> > major different version.
> >
> > +1 for the proposal.
> >
> > Regards
> > Liang
> >
> >
> > Venkata Gollamudi wrote
> > > Hi All,
> > >
> > > CarbonData 0.2.0 has been a good work and stable release with lot of
> > > defects fixed and with number of performance improvements.
> > >
> > https://issues.apache.org/jira/browse/CARBONDATA-320?jql=project%20%3D%
> 20CARBONDATA%20AND%20fixVersion%20%3D%200.2.0-incubating%20ORDER%20BY%
> 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> > >
> > > Next version has many major and new value added features are planned,
> > > taking CarbonData capability to next level.
> > > Like
> > > - IUD(Insert-Update-Delete) support,
> > > - complete rewrite of data load flow with out Kettle,
> > > - Spark 2.x support,
> > > - Standardize CarbonInputFormat and CarbonOutputFormat,
> > > - alluxio(tachyon) file system support,
> > > - Carbon thrift format optimization for fast query,
> > > - Data loading performance improvement and In memory off heap sorting,
> > > - Query performance improvement using off heap,
> > > - Support Vectorized batch reader.
> > >
> > >
> > https://issues.apache.org/jira/browse/CARBONDATA-301?jql=project%20%3D%
> 20CARBONDATA%20AND%20fixVersion%20%3D%200.3.0-incubating%20ORDER%20BY%
> 20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> > >
> > > I think it makes sense to change CarbonData Major version in next
> version
> > > to 1.0.0.
> > > Please comment and vote on this.
> > >
> > > Thanks,
> > > Ramana
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> > http://apache-carbondata-mailing-list-archive.1130556.
> n5.nabble.com/CarbonData-propose-major-version-number-
> increment-for-next-version-to-1-0-0-tp3131p3157.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>


Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-24 Thread Xiaoqiao He
Hi Kumar Vishal,

Thanks for your suggestions. As you said, choose Trie replace HashMap we
can get better memory footprint and also good performance. Of course, DAT
is not only choice, and I will do test about DAT vs Radix Trie and release
the test result as soon as possible. Thanks your suggestions again.

Regards,
Xiaoqiao

On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal 
wrote:

> Hi XIaoqiao He,
> +1,
> For forward dictionary case it will be very good optimisation, as our case
> is very specific storing byte array to int mapping[data to surrogate key
> mapping], I think we will get much better memory footprint and performance
> will be also good(2x). We can also try radix tree(radix trie), it is more
> optimise for storage.
>
> -Regards
> Kumar Vishal
>
> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen 
> wrote:
>
> > Hi xiaoqiao
> >
> > For the below example, 600K dictionary data:
> > It is to say that using "DAT" can save 36M memory against
> > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
> >
> > One more question:if increases the dictionary data size, what's the
> > comparison results "ConcurrentHashMap" VS "DAT"
> >
> > Regards
> > Liang
> > 
> > --
> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >
> > b. retrieval performance: total time(ms) of 500 million query:
> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >
> > Regards
> > Liang
> >
> > hexiaoqiao wrote
> > > hi Liang,
> > >
> > > Thanks for your reply, i need to correct the experiment result because
> > > it's
> > > wrong order NO.1 column of result data table.
> > >
> > > In order to compare performance between Trie and HashMap, Two different
> > > structures are constructed using the same dictionary data which size is
> > > 600K and each item's length is between 2 and 50 bytes.
> > >
> > > ConcurrentHashMap (structure which is used in CarbonData currently) vs
> > > Double
> > > Array Trie (one implementation of Trie Structures)
> > >
> > > a. memory footprint (approximate quantity) in 64-bit JVM:
> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > >
> > > b. retrieval performance: total time(ms) of 500 million query:
> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > >
> > > Regards,
> > > He Xiaoqiao
> > >
> > >
> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen <
> >
> > > chenliang6136@
> >
> > > > wrote:
> > >
> > >> Hi xiaoqiao
> > >>
> > >> This improvement looks great!
> > >> Can you please explain the below data, what does it mean?
> > >> --
> > >> ConcurrentHashMap
> > >> ~68MB 14543
> > >> Double Array Trie
> > >> ~104MB 12825
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He <
> >
> > > xq.he2009@
> >
> > > >:
> > >>
> > >> >  Hi All,
> > >> >
> > >> > I would like to propose Dictionary improvement which using Trie in
> > >> place
> > >> of
> > >> > HashMap.
> > >> >
> > >> > In order to speedup aggregation, reduce run-time memory footprint,
> > >> enable
> > >> > fast
> > >> > distinct count etc, CarbonData encodes data using dictionary at file
> > >> level
> > >> > or table level based on cardinality. It is a general and efficient
> way
> > >> in
> > >> > many big data systems, but when apply ConcurrentHashMap
> > >> > to maintain Dictionary in CarbonData currently, memory overhead of
> > >> > Driver is very huge since it has to load whole Dictionary to decode
> > >> actual
> > >> > data value, especially column cardinality is a large number. and
> > >> CarbonData
> > >> > will not do dictionary if cardinality > 1 million at default
> behavior.
> > >> >
> > >> > I propose using Trie in place of HashMap for the following three
> > >> reasons:
> > >> > (1) Trie is a proper structure for Dictionary,
> > >> > (2) Reduce memory footprint,
> > >> > (3) Not impact retrieval performance
> > >> >
> > >> > The experimental results show that Trie is able to meet the
> > >> requirement.
> > >> > a. ConcurrentHashMap vs Double Array Trie
> > >> > ;(one
> > >> implementation of
> > >> > Trie Structures)
> > >> > b. Dictionary size: 600K
> > >> > c. Memory footprint and query time
> > >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> > >> > ConcurrentHashMap
> > >> > ~68MB 14543
> > >> > Double Array Trie
> > >> > ~104MB 12825
> > >> >
> > >> > Please share your suggestions about the proposed improvement of
> > >> Dictionary.
> > >> >
> > >> > Regards
> > >> > He Xiaoqiao
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Regards
> > >> Liang
> > >>
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> > Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> > Dictionary-tp3132p3143.html
> > Sent from the Apache CarbonData Mailing List 

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-24 Thread Xiaoqiao He
Hi Liang,

Generally, yes, because the same prefix of items in dictionary does not
require to repeat in DAT, and more data better result.

Actually the cost of DAT is building Tree, and i don't think we need to
consider it since this cost appears only once when load data.

FYI.

Regards,
Xiaoqiao

On Thu, Nov 24, 2016 at 2:42 PM, Liang Chen  wrote:

> Hi xiaoqiao
>
> For the below example, 600K dictionary data:
> It is to say that using "DAT" can save 36M memory against
> "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
>
> One more question:if increases the dictionary data size, what's the
> comparison results "ConcurrentHashMap" VS "DAT"
>
> Regards
> Liang
> 
> --
> a. memory footprint (approximate quantity) in 64-bit JVM:
> ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
>
> b. retrieval performance: total time(ms) of 500 million query:
> 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
>
> Regards
> Liang
>
> hexiaoqiao wrote
> > hi Liang,
> >
> > Thanks for your reply, i need to correct the experiment result because
> > it's
> > wrong order NO.1 column of result data table.
> >
> > In order to compare performance between Trie and HashMap, Two different
> > structures are constructed using the same dictionary data which size is
> > 600K and each item's length is between 2 and 50 bytes.
> >
> > ConcurrentHashMap (structure which is used in CarbonData currently) vs
> > Double
> > Array Trie (one implementation of Trie Structures)
> >
> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >
> > b. retrieval performance: total time(ms) of 500 million query:
> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >
> > Regards,
> > He Xiaoqiao
> >
> >
> > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen <
>
> > chenliang6136@
>
> > > wrote:
> >
> >> Hi xiaoqiao
> >>
> >> This improvement looks great!
> >> Can you please explain the below data, what does it mean?
> >> --
> >> ConcurrentHashMap
> >> ~68MB 14543
> >> Double Array Trie
> >> ~104MB 12825
> >>
> >> Regards
> >> Liang
> >>
> >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He <
>
> > xq.he2009@
>
> > >:
> >>
> >> >  Hi All,
> >> >
> >> > I would like to propose Dictionary improvement which using Trie in
> >> place
> >> of
> >> > HashMap.
> >> >
> >> > In order to speedup aggregation, reduce run-time memory footprint,
> >> enable
> >> > fast
> >> > distinct count etc, CarbonData encodes data using dictionary at file
> >> level
> >> > or table level based on cardinality. It is a general and efficient way
> >> in
> >> > many big data systems, but when apply ConcurrentHashMap
> >> > to maintain Dictionary in CarbonData currently, memory overhead of
> >> > Driver is very huge since it has to load whole Dictionary to decode
> >> actual
> >> > data value, especially column cardinality is a large number. and
> >> CarbonData
> >> > will not do dictionary if cardinality > 1 million at default behavior.
> >> >
> >> > I propose using Trie in place of HashMap for the following three
> >> reasons:
> >> > (1) Trie is a proper structure for Dictionary,
> >> > (2) Reduce memory footprint,
> >> > (3) Not impact retrieval performance
> >> >
> >> > The experimental results show that Trie is able to meet the
> >> requirement.
> >> > a. ConcurrentHashMap vs Double Array Trie
> >> > ;(one
> >> implementation of
> >> > Trie Structures)
> >> > b. Dictionary size: 600K
> >> > c. Memory footprint and query time
> >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> >> > ConcurrentHashMap
> >> > ~68MB 14543
> >> > Double Array Trie
> >> > ~104MB 12825
> >> >
> >> > Please share your suggestions about the proposed improvement of
> >> Dictionary.
> >> >
> >> > Regards
> >> > He Xiaoqiao
> >> >
> >>
> >>
> >>
> >> --
> >> Regards
> >> Liang
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-maili
> ng-list-archive.1130556.n5.nabble.com/Improvement-Use-Trie-
> in-place-of-HashMap-to-reduce-memory-footprint-of-Dictionary
> -tp3132p3143.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>


Re: [Feature ]Design Document for Update/Delete support in CarbonData

2016-11-24 Thread Kumar Vishal
HI Ankiet,

I think If update/delete is for less data then horizontal compaction can
based on user configuration, but if more data is getting updated then
better to start vertical compaction immediately , this is because we are
not physically deleting the data from disk, if more data is getting
updated(more than 60%) then during query first we will query the older data
+ exclude the deleted records+ include the update delta file data. So in
this case more data will come into memory, we can avoid this by starting
vertical compaction immediately after update/delete.

-Regards
Kumar Vishal

On Thu, Nov 24, 2016 at 2:43 PM, Kumar Vishal 
wrote:

> Hi Aniket,
>
> I agree with Vimal opinion, but that use case will be very less.
>
> I have one query for this update and delete feature.
> When we will start compaction after each update or delete operation?
>
> -Regards
> Kumar Vishal
>
>
>
> On Thu, Nov 24, 2016 at 12:05 AM, Aniket Adnaik 
> wrote:
>
>> Hi Vimal,
>>
>> Thanks for your suggestions.
>> For the 1st point, i tend to agree with Manish's comments. But, it's worth
>> looking into different ways to optimize the performance.
>> I guess, query performance may take priority over update performance.
>> Basically, we may need better compaction approach to merge
>> delta files into regular carbon files to maintain adequate performance.
>> For the 2nd point, CarbonData will support updating multiple rows, but not
>> the same row multiple times in a single update operation. It is possible
>> that join condition in sub-select of original update statement can result
>> into multiple rows from source table for the same row in the target table.
>> This is ambiguous condition and common ways to solve this is to error out
>> ,
>> or to apply first matching row, or to apply last matching row. CarbonData
>> will choose to error out and let user resolve the ambiguity, which a
>> safer/standard choice.
>>
>> Best Regards,
>> Aniket
>>
>> On Wed, Nov 23, 2016 at 4:54 AM, manish gupta 
>> wrote:
>>
>> > Hi Vimal,
>> >
>> > I have few queries regarding regarding the 1st suggestion.
>> >
>> > 1. Dimensions can both be dictionary and no dictionary. If we update the
>> > dictionary file then we will have to maintain 2 flows one for dictionary
>> > columns and 1 for no dictionary columns. Will that be ok?
>> >
>> > 2. We write dictionary files in append mode. Updating dictionary files
>> will
>> > be like completely rewriting the dictionary file which will also modify
>> the
>> > dictionary metadata and sort index file OR there is some other approach
>> > that needs to be followed like maintaining a update delta mapping for
>> > dictionary file.
>> >
>> > Regards
>> > Manish Gupta
>> >
>> > On Wed, Nov 23, 2016 at 10:47 AM, Vimal Das Kammath <
>> > vimaldas.kamm...@gmail.com> wrote:
>> >
>> > > Hi Aniket,
>> > >
>> > > The design looks sound and the documentation is great.
>> > > I have few suggestions.
>> > >
>> > > 1) Measure update vs dimension update : In case of dimension update.
>> for
>> > > example user wants to change dept1 to dept2 for all users who are
>> under
>> > > dept1. Can we just update the dictionary for faster performance?
>> > > 2) Update Semantics (one matching record vs multiple matching
>> record): I
>> > > could not understand this section. Wanted to confirm if we will
>> support
>> > one
>> > > update statement updating multiple rows.
>> > >
>> > > -Vimal
>> > >
>> > > On Tue, Nov 22, 2016 at 2:30 PM, Liang Chen 
>> > > wrote:
>> > >
>> > > > Hi  Aniket
>> > > >
>> > > > Thanks you finished the good design documents. A couple of inputs
>> from
>> > my
>> > > > side:
>> > > >
>> > > > 1.Please add the below mentioned info(Rowid definition etc.) to
>> design
>> > > > documents also.
>> > > > 2.In page6 :"Schema change operation can run in parallel with
>> Update or
>> > > > Delte operations, but not with another schema change operation" ,
>> can
>> > you
>> > > > explain this item ?
>> > > > 3.Please unify the description:  use "CarbonData" to replace
>> "Carbon",
>> > > > unify the description for "destination table" and "target table".
>> > > > 4.The Update operation's delete delta is same with Delete
>> operation's
>> > > > delete
>> > > > delta?
>> > > >
>> > > > BTW, it would be much better if you could provide google docs for
>> > review
>> > > in
>> > > > the next time, it is really difficult to give comment based on pdf
>> > > > documents
>> > > > :)
>> > > >
>> > > > Regards
>> > > > Liang
>> > > >
>> > > > Aniket Adnaik wrote
>> > > > > Hi Sujith,
>> > > > >
>> > > > > Please see my comments inline.
>> > > > >
>> > > > > Best Regards,
>> > > > > Aniket
>> > > > >
>> > > > > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko <
>> > > >
>> > > > > sujithchacko.2010@
>> > > >
>> > > > > >
>> > > > > wrote:
>> > > > >
>> > > > >> Hi Aniket,
>> > > > >>
>> > > > >>   Its a well documented design,  just want to know few points
>> > like
>> > > > >>
>> > > > >> a.  Format of the RowID and its datatype
>>

Re: [Feature ]Design Document for Update/Delete support in CarbonData

2016-11-24 Thread Kumar Vishal
Hi Aniket,

I agree with Vimal opinion, but that use case will be very less.

I have one query for this update and delete feature.
When we will start compaction after each update or delete operation?

-Regards
Kumar Vishal



On Thu, Nov 24, 2016 at 12:05 AM, Aniket Adnaik 
wrote:

> Hi Vimal,
>
> Thanks for your suggestions.
> For the 1st point, i tend to agree with Manish's comments. But, it's worth
> looking into different ways to optimize the performance.
> I guess, query performance may take priority over update performance.
> Basically, we may need better compaction approach to merge
> delta files into regular carbon files to maintain adequate performance.
> For the 2nd point, CarbonData will support updating multiple rows, but not
> the same row multiple times in a single update operation. It is possible
> that join condition in sub-select of original update statement can result
> into multiple rows from source table for the same row in the target table.
> This is ambiguous condition and common ways to solve this is to error out ,
> or to apply first matching row, or to apply last matching row. CarbonData
> will choose to error out and let user resolve the ambiguity, which a
> safer/standard choice.
>
> Best Regards,
> Aniket
>
> On Wed, Nov 23, 2016 at 4:54 AM, manish gupta 
> wrote:
>
> > Hi Vimal,
> >
> > I have few queries regarding regarding the 1st suggestion.
> >
> > 1. Dimensions can both be dictionary and no dictionary. If we update the
> > dictionary file then we will have to maintain 2 flows one for dictionary
> > columns and 1 for no dictionary columns. Will that be ok?
> >
> > 2. We write dictionary files in append mode. Updating dictionary files
> will
> > be like completely rewriting the dictionary file which will also modify
> the
> > dictionary metadata and sort index file OR there is some other approach
> > that needs to be followed like maintaining a update delta mapping for
> > dictionary file.
> >
> > Regards
> > Manish Gupta
> >
> > On Wed, Nov 23, 2016 at 10:47 AM, Vimal Das Kammath <
> > vimaldas.kamm...@gmail.com> wrote:
> >
> > > Hi Aniket,
> > >
> > > The design looks sound and the documentation is great.
> > > I have few suggestions.
> > >
> > > 1) Measure update vs dimension update : In case of dimension update.
> for
> > > example user wants to change dept1 to dept2 for all users who are under
> > > dept1. Can we just update the dictionary for faster performance?
> > > 2) Update Semantics (one matching record vs multiple matching record):
> I
> > > could not understand this section. Wanted to confirm if we will support
> > one
> > > update statement updating multiple rows.
> > >
> > > -Vimal
> > >
> > > On Tue, Nov 22, 2016 at 2:30 PM, Liang Chen 
> > > wrote:
> > >
> > > > Hi  Aniket
> > > >
> > > > Thanks you finished the good design documents. A couple of inputs
> from
> > my
> > > > side:
> > > >
> > > > 1.Please add the below mentioned info(Rowid definition etc.) to
> design
> > > > documents also.
> > > > 2.In page6 :"Schema change operation can run in parallel with Update
> or
> > > > Delte operations, but not with another schema change operation" , can
> > you
> > > > explain this item ?
> > > > 3.Please unify the description:  use "CarbonData" to replace
> "Carbon",
> > > > unify the description for "destination table" and "target table".
> > > > 4.The Update operation's delete delta is same with Delete operation's
> > > > delete
> > > > delta?
> > > >
> > > > BTW, it would be much better if you could provide google docs for
> > review
> > > in
> > > > the next time, it is really difficult to give comment based on pdf
> > > > documents
> > > > :)
> > > >
> > > > Regards
> > > > Liang
> > > >
> > > > Aniket Adnaik wrote
> > > > > Hi Sujith,
> > > > >
> > > > > Please see my comments inline.
> > > > >
> > > > > Best Regards,
> > > > > Aniket
> > > > >
> > > > > On Sun, Nov 20, 2016 at 9:11 PM, sujith chacko <
> > > >
> > > > > sujithchacko.2010@
> > > >
> > > > > >
> > > > > wrote:
> > > > >
> > > > >> Hi Aniket,
> > > > >>
> > > > >>   Its a well documented design,  just want to know few points
> > like
> > > > >>
> > > > >> a.  Format of the RowID and its datatype
> > > > >>
> > > > >  AA>> Following format can be used to represent a unique rowed;
> > > > >
> > > > >  [
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > ]
> > > > >  A simple way would be to use String data type and store it as a
> text
> > > > > file.
> > > > > However, more efficient way could be to use Bitsets/Bitmaps as
> > further
> > > > > optimization. Compressed Bitmaps such as Roaring bitmaps can be
> used
> > > for
> > > > > better performance and efficient storage.
> > > > >
> > > > > b.  Impact of this feature in select query since every time query
> > > process
> > > > > has to exclude each deleted records and include corresponding
> updated
> > > > > record, any optimization is considered in tackling the query
> > > performance
> > > > > issue since one of the major highligh

Re: CarbonData propose major version number increment for next version (to 1.0.0)

2016-11-24 Thread Raghunandan S
+1
On Thu, 24 Nov 2016 at 2:30 PM, Liang Chen  wrote:

> Hi
>
> Ya, good proposal.
> CarbonData 0.x version integrate with spark 1.x,  and the load data
> solution
> of 0.x version is using kettle.
> CarbonData 1.x version integrate with spark 2.x, the load data solution of
> 1.x version will not use kettle .
>
> That would be helpful to reduce maintenance cost through distinguishing the
> major different version.
>
> +1 for the proposal.
>
> Regards
> Liang
>
>
> Venkata Gollamudi wrote
> > Hi All,
> >
> > CarbonData 0.2.0 has been a good work and stable release with lot of
> > defects fixed and with number of performance improvements.
> >
> https://issues.apache.org/jira/browse/CARBONDATA-320?jql=project%20%3D%20CARBONDATA%20AND%20fixVersion%20%3D%200.2.0-incubating%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> >
> > Next version has many major and new value added features are planned,
> > taking CarbonData capability to next level.
> > Like
> > - IUD(Insert-Update-Delete) support,
> > - complete rewrite of data load flow with out Kettle,
> > - Spark 2.x support,
> > - Standardize CarbonInputFormat and CarbonOutputFormat,
> > - alluxio(tachyon) file system support,
> > - Carbon thrift format optimization for fast query,
> > - Data loading performance improvement and In memory off heap sorting,
> > - Query performance improvement using off heap,
> > - Support Vectorized batch reader.
> >
> >
> https://issues.apache.org/jira/browse/CARBONDATA-301?jql=project%20%3D%20CARBONDATA%20AND%20fixVersion%20%3D%200.3.0-incubating%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> >
> > I think it makes sense to change CarbonData Major version in next version
> > to 1.0.0.
> > Please comment and vote on this.
> >
> > Thanks,
> > Ramana
>
>
>
>
>
> --
> View this message in context:
> http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/CarbonData-propose-major-version-number-increment-for-next-version-to-1-0-0-tp3131p3157.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>


Re: CarbonData propose major version number increment for next version (to 1.0.0)

2016-11-24 Thread Liang Chen
Hi

Ya, good proposal.
CarbonData 0.x version integrate with spark 1.x,  and the load data solution
of 0.x version is using kettle.
CarbonData 1.x version integrate with spark 2.x, the load data solution of
1.x version will not use kettle .

That would be helpful to reduce maintenance cost through distinguishing the
major different version.

+1 for the proposal.

Regards
Liang


Venkata Gollamudi wrote
> Hi All,
> 
> CarbonData 0.2.0 has been a good work and stable release with lot of
> defects fixed and with number of performance improvements.
> https://issues.apache.org/jira/browse/CARBONDATA-320?jql=project%20%3D%20CARBONDATA%20AND%20fixVersion%20%3D%200.2.0-incubating%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> 
> Next version has many major and new value added features are planned,
> taking CarbonData capability to next level.
> Like
> - IUD(Insert-Update-Delete) support,
> - complete rewrite of data load flow with out Kettle,
> - Spark 2.x support,
> - Standardize CarbonInputFormat and CarbonOutputFormat,
> - alluxio(tachyon) file system support,
> - Carbon thrift format optimization for fast query,
> - Data loading performance improvement and In memory off heap sorting,
> - Query performance improvement using off heap,
> - Support Vectorized batch reader.
> 
> https://issues.apache.org/jira/browse/CARBONDATA-301?jql=project%20%3D%20CARBONDATA%20AND%20fixVersion%20%3D%200.3.0-incubating%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
> 
> I think it makes sense to change CarbonData Major version in next version
> to 1.0.0.
> Please comment and vote on this.
> 
> Thanks,
> Ramana





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/CarbonData-propose-major-version-number-increment-for-next-version-to-1-0-0-tp3131p3157.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: install carbon question

2016-11-24 Thread Liang Chen
Hi

No, it is not typo issues, they are two different configurations.

Regards
Liang


Cao Lu 曹鲁 wrote
> Hi Carbondata team,
> I’m following the guidance below to install Carbondata on yarn cluster:
> https://cwiki.apache.org/confluence/display/CARBONDATA/Cluster+deployment+guide
> 
> I see the #2 and #3 are the same to copy carbon.properties.template and
> you mentioned carbonplugins in #2.
> Is that a typo and the step 2 should be copy carbonplugins under
> /processing/carbonplugins to 
> 
> /carbonlib/ ?
> 
> Installing and Configuring Carbon on "Spark on YARN" Cluster
> 
> This section provides the procedure to install Carbon on "Spark on YARN"
> cluster.
> 
> Prerequisite
> 
> ・ Hadoop HDFS and Yarn should be installed and running.
> 
> ・ Spark should be installed and running in all the clients.
> 
> ・ CarbonData user should have permission to access HDFS.
> 
> Procedure
> 
> Follow the below steps only for Driver nodes (Driver nodes are the one
> which starts the spark context )
> 
> 1.1. Build the CarbonData project
> ;
> and get the assembly jar from
> "./assembly/target/scala-2.10/carbondata_xxx.jar" and put in the "
> 
> /carbonlib" folder.
> 
>  (Note: - Create the carbonlib folder if does not exists inside "
> 
> " path.)
> 
> 2.2. Copy the carbon.properties.template to "
> 
> /conf/carbon.properties" folder from "./conf/" of CarbonData repository.
> 
> carbonplugins will contain .kettle folder.
> 
> 3.3. Copy the "carbon.properties.template" to "
> 
> /conf/carbon.properties" folder from conf folder of carbondata repository.
> 
> 4.4. Modify the parameters in "spark-default.conf" located in the "
> 
> /conf"
> 
> Thanks,
> Lu
> 邮件免责申明- 该电子邮件中的信息是保密的,除收件人外任何人无权访问此电子邮件。
> 如果您不是收件人,公开、复制、分发或基于此封邮件的任何行动,都是禁止的,并可能是违法的。该邮件包含的任何意见与建议均应遵循上汽集团关于信息传递与保密的制度或规定。除经上汽集团信函以正式书面方式确认外,任何相关的内容或信息不得作为正式依据。
> Email Disclaimer- The information in this email is confidential and
> may be legally privileged. It is intended solely for the addressee. Access
> to this email by anyone else is unauthorized. If you are not the intended
> recipient, any disclosure, copying, distribution or any action taken or
> omitted to be taken in reliance on it, is prohibited and may be unlawful.
> Any opinions or advice contained in this email are subject to the terms
> and conditions expressed in the governing SAICMOTOR client engagement
> letter and should not be relied upon unless they are confirmed in writing
> on SAICMOTOR's letterhead.





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/install-carbon-question-tp3151p3156.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-24 Thread Kumar Vishal
Hi XIaoqiao He,
+1,
For forward dictionary case it will be very good optimisation, as our case
is very specific storing byte array to int mapping[data to surrogate key
mapping], I think we will get much better memory footprint and performance
will be also good(2x). We can also try radix tree(radix trie), it is more
optimise for storage.

-Regards
Kumar Vishal

On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen 
wrote:

> Hi xiaoqiao
>
> For the below example, 600K dictionary data:
> It is to say that using "DAT" can save 36M memory against
> "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
>
> One more question:if increases the dictionary data size, what's the
> comparison results "ConcurrentHashMap" VS "DAT"
>
> Regards
> Liang
> 
> --
> a. memory footprint (approximate quantity) in 64-bit JVM:
> ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
>
> b. retrieval performance: total time(ms) of 500 million query:
> 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
>
> Regards
> Liang
>
> hexiaoqiao wrote
> > hi Liang,
> >
> > Thanks for your reply, i need to correct the experiment result because
> > it's
> > wrong order NO.1 column of result data table.
> >
> > In order to compare performance between Trie and HashMap, Two different
> > structures are constructed using the same dictionary data which size is
> > 600K and each item's length is between 2 and 50 bytes.
> >
> > ConcurrentHashMap (structure which is used in CarbonData currently) vs
> > Double
> > Array Trie (one implementation of Trie Structures)
> >
> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >
> > b. retrieval performance: total time(ms) of 500 million query:
> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >
> > Regards,
> > He Xiaoqiao
> >
> >
> > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen <
>
> > chenliang6136@
>
> > > wrote:
> >
> >> Hi xiaoqiao
> >>
> >> This improvement looks great!
> >> Can you please explain the below data, what does it mean?
> >> --
> >> ConcurrentHashMap
> >> ~68MB 14543
> >> Double Array Trie
> >> ~104MB 12825
> >>
> >> Regards
> >> Liang
> >>
> >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He <
>
> > xq.he2009@
>
> > >:
> >>
> >> >  Hi All,
> >> >
> >> > I would like to propose Dictionary improvement which using Trie in
> >> place
> >> of
> >> > HashMap.
> >> >
> >> > In order to speedup aggregation, reduce run-time memory footprint,
> >> enable
> >> > fast
> >> > distinct count etc, CarbonData encodes data using dictionary at file
> >> level
> >> > or table level based on cardinality. It is a general and efficient way
> >> in
> >> > many big data systems, but when apply ConcurrentHashMap
> >> > to maintain Dictionary in CarbonData currently, memory overhead of
> >> > Driver is very huge since it has to load whole Dictionary to decode
> >> actual
> >> > data value, especially column cardinality is a large number. and
> >> CarbonData
> >> > will not do dictionary if cardinality > 1 million at default behavior.
> >> >
> >> > I propose using Trie in place of HashMap for the following three
> >> reasons:
> >> > (1) Trie is a proper structure for Dictionary,
> >> > (2) Reduce memory footprint,
> >> > (3) Not impact retrieval performance
> >> >
> >> > The experimental results show that Trie is able to meet the
> >> requirement.
> >> > a. ConcurrentHashMap vs Double Array Trie
> >> > ;(one
> >> implementation of
> >> > Trie Structures)
> >> > b. Dictionary size: 600K
> >> > c. Memory footprint and query time
> >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> >> > ConcurrentHashMap
> >> > ~68MB 14543
> >> > Double Array Trie
> >> > ~104MB 12825
> >> >
> >> > Please share your suggestions about the proposed improvement of
> >> Dictionary.
> >> >
> >> > Regards
> >> > He Xiaoqiao
> >> >
> >>
> >>
> >>
> >> --
> >> Regards
> >> Liang
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/Improvement-Use-
> Trie-in-place-of-HashMap-to-reduce-memory-footprint-of-
> Dictionary-tp3132p3143.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>


[jira] [Created] (CARBONDATA-446) Add Unit Tests For Scan.collector.impl package

2016-11-24 Thread SWATI RAO (JIRA)
SWATI RAO created CARBONDATA-446:


 Summary: Add Unit Tests For Scan.collector.impl package
 Key: CARBONDATA-446
 URL: https://issues.apache.org/jira/browse/CARBONDATA-446
 Project: CarbonData
  Issue Type: Test
Reporter: SWATI RAO
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


install carbon question

2016-11-24 Thread Cao Lu 曹鲁
Hi Carbondata team,
I’m following the guidance below to install Carbondata on yarn cluster:
https://cwiki.apache.org/confluence/display/CARBONDATA/Cluster+deployment+guide

I see the #2 and #3 are the same to copy carbon.properties.template and you 
mentioned carbonplugins in #2.
Is that a typo and the step 2 should be copy carbonplugins under 
/processing/carbonplugins to /carbonlib/ ?

Installing and Configuring Carbon on "Spark on YARN" Cluster

This section provides the procedure to install Carbon on "Spark on YARN" 
cluster.

Prerequisite

・ Hadoop HDFS and Yarn should be installed and running.

・ Spark should be installed and running in all the clients.

・ CarbonData user should have permission to access HDFS.

Procedure

Follow the below steps only for Driver nodes (Driver nodes are the one which 
starts the spark context )

1.1. Build the CarbonData project 

 and get the assembly jar from 
"./assembly/target/scala-2.10/carbondata_xxx.jar" and put in the 
"/carbonlib" folder.

 (Note: - Create the carbonlib folder if does not exists inside "" 
path.)

2.2. Copy the carbon.properties.template to 
"/conf/carbon.properties" folder from "./conf/" of CarbonData 
repository.

carbonplugins will contain .kettle folder.

3.3. Copy the "carbon.properties.template" to 
"/conf/carbon.properties" folder from conf folder of carbondata 
repository.

4.4. Modify the parameters in "spark-default.conf" located in the 
"/conf"

Thanks,
Lu
邮件免责申明- 该电子邮件中的信息是保密的,除收件人外任何人无权访问此电子邮件。 
如果您不是收件人,公开、复制、分发或基于此封邮件的任何行动,都是禁止的,并可能是违法的。该邮件包含的任何意见与建议均应遵循上汽集团关于信息传递与保密的制度或规定。除经上汽集团信函以正式书面方式确认外,任何相关的内容或信息不得作为正式依据。
 Email Disclaimer- The information in this email is confidential and may be 
legally privileged. It is intended solely for the addressee. Access to this 
email by anyone else is unauthorized. If you are not the intended recipient, 
any disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful. Any opinions or 
advice contained in this email are subject to the terms and conditions 
expressed in the governing SAICMOTOR client engagement letter and should not be 
relied upon unless they are confirmed in writing on SAICMOTOR's letterhead.