Re: [ANNOUNCE] Ravindra as new Apache CarbonData PMC

2017-05-21 Thread manish gupta
Congratulations Ravindra..:)

Regards
Manish Gupta

On Sat, May 20, 2017 at 2:28 PM, Naresh P R <prnaresh.nar...@gmail.com>
wrote:

> Congrats Ravindra.
> ---
> Regards,
> Naresh P R
>
> On May 19, 2017 4:56 PM, "Liang Chen" <chenliang...@apache.org> wrote:
>
> > Hi all
> >
> > We are pleased to announce that the PMC has invited Ravindra as new
> Apache
> > CarbonData PMC member, and the invite has been accepted !
> >
> > Congrats to Ravindra and welcome aboard.
> >
> > Thanks
> > The Apache CarbonData team
> >
>


Re: [DISCUSSION] In 1.2.0, use Spark 2.1 and Hadoop 2.7.2 as default compilation in pom.

2017-06-15 Thread manish gupta
+1
Better to use Spark 2.1 and Hadoop 2.7.2 as default compilation.

Regards
Manish Gupta

On Fri, Jun 16, 2017 at 9:28 AM, xm_zzc <441586...@qq.com> wrote:

> +1, using Spark 2.1 and Hadoop 2.7.2 as default compilation.
>
>
>
> --
> View this message in context: http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-In-1-
> 2-0-use-Spark-2-1-and-Hadoop-2-7-2-as-default-compilation-
> in-pom-tp15278p15280.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> archive at Nabble.com.
>


Re: [VOTE] Apache CarbonData 1.2.0(RC3) release

2017-09-28 Thread manish gupta
+1

Regards
Manish Gupta

On Thu, Sep 28, 2017 at 10:21 AM, Gururaj Shetty <sgururajshe...@gmail.com>
wrote:

> +1
>
> Regards,
> Gururaj
>
> On Thu, Sep 28, 2017 at 5:34 AM, jarray <jarray...@163.com> wrote:
>
> > +1
> >
> >
> >
> >
> >
> >
> > On 09/26/2017 00:25, Kumar Vishal wrote:
> > +1
> > -Regards
> > Kumar Vishal
> >
> >
> >
> >
> > > On 25-Sep-2017, at 20:08, Lu Cao <whuca...@gmail.com> wrote:
> > >
> > > +1
> > >
> > > Regards,
> > > Lionel
> > >
> > >> On Mon, Sep 25, 2017 at 8:26 PM, Jacky Li <jacky.li...@qq.com> wrote:
> > >>
> > >> +1
> > >>
> > >> Regards,
> > >> Jacky
> > >>
> > >>> 在 2017年9月25日,下午6:51,Venkata Gollamudi <g.ramana...@gmail.com> 写道:
> > >>>
> > >>> +1
> > >>>
> > >>> Regards,
> > >>> Venkata Ramana G
> > >>>
> > >>> On Mon, Sep 25, 2017 at 7:32 AM, David CaiQiang <
> david.c...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>>
> > >>>> +1 Release this package as Apache CarbonData 1.2.0
> > >>>>
> > >>>> 1. Release
> > >>>> There are important new features and the integration of new platform
> > >>>>
> > >>>> 2. The tag
> > >>>> " mvn clean -DskipTests -Pspark-2.1 -Pbuild-with-format package"
> > passed
> > >>>> "mvn clean -DskipTests -Pspark-2.1 -Pbuild-with-format install"
> passed
> > >>>>
> > >>>> 3. The artifacts
> > >>>> both md5sum and sha512 are correct
> > >>>>
> > >>>>
> > >>>>
> > >>>> -
> > >>>> Best Regards
> > >>>> David Cai
> > >>>> --
> > >>>> Sent from: http://apache-carbondata-dev-
> mailing-list-archive.1130556.
> > >>>> n5.nabble.com/
> > >>>>
> > >>
> > >>
> > >>
> > >>
> >
>


Re: [DISCUSSION] Support only spark 2 in carbon 1.3.0

2017-10-09 Thread manish gupta
+1

Regards
Manish Gupta

On Mon, 9 Oct 2017 at 5:11 PM, Suprith T Jain <t.supr...@gmail.com> wrote:

> +1
>
> On 09-Oct-2017 7:27 AM, "Lu Cao" <whuca...@gmail.com> wrote:
>
> > Hi community,
> > Currently we have three spark related module in carbondata(spark 1.5,
> 1.6,
> > 2.1), the project has become more and more difficult to maintain and has
> > many redundant code.
> > I propose to stop supporting spark 1.5 &1.6 and focus on spark 2.1(2.2).
> > That will keep the project clean and simple for maintenance.
> > Maybe we can provide some key patch to old version. But new features
> could
> > support spark2 only.
> > Any ideas?
> >
> >
> > Thanks & Regards,
> > Lionel Cao
> >
>


Re: [Discussion] Carbon Local Dictionary Support

2018-06-04 Thread manish gupta
+1

It is a good feature to have. Once the design document is uploaded we will
get a better idea of how it will be implemented.

Regards
Manish Gupta

On Tue, Jun 5, 2018 at 11:18 AM, Kumar Vishal 
wrote:

> Hi Xuchuanyin,
>
> I am working on design document, and all the points you have mentioned I
> have already captured. I will share once it is finished.
>
> -Regards
> Kumar Vishal
>
> On Tue, Jun 5, 2018 at 9:22 AM, xuchuanyin  wrote:
>
> > Hi, Kumar:
> >   Local dictionary will be nice feature and other formats like parquet
> all
> > support this.
> >
> >   My concern is that: How will you implement this feature?
> >
> >   1. What's the scope of the `local`? Page level (for all containing
> rows),
> > Blocklet level (for all containing pages), Block level(for all containing
> > blocklets)?
> >
> >   2. Where will you store the local dictionary?
> >
> >   3. How do you decide to enable the local dictionary for a column?
> >
> >   4. Have you considered to fall back to plain encoding if the local
> > dictionary encoding consumes more space?
> >
> >   5. Will you still work on V3 format or start a new V4 (or v3.1)
> version?
> >
> >   Anyway, I'm concerning about the data loading performance. Please pay
> > attention to it while you are implementing this feature.
> >
> >
> >
> > --
> > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.
> > n5.nabble.com/
> >
>


Re: [Discussion] Carbon Local Dictionary Support

2018-06-07 Thread manish gupta
Hi Vishal,

Thanks for uploading the design document. The document is good and gives a
detailed picture of the requirement.

I have few questions and suggestions. Kindly consider if applicable.

1. Will the local dictionary be read once and put into offheap/onheap
memory or for every query it will be read?

2. Will the columnCardinality integer array now contain the actual
cardinality for no dictionary column in the block footer or in any other
metadata?
If not then we can store as it can be one of the statistics which can help
in deciding pushdown for like queries on no dictionary column.

3. Apart from default threshold we can also define the max threshold for
the local dictionary (lets say 1 lac). If user configures a value greater
than max allowed threshold then we can consider max and continue.

Regards
Manish Gupta

On Wed, Jun 6, 2018 at 6:54 PM, Kumar Vishal 
wrote:

> Hi Xuchuanyin,
>
> Please find the JIRA link for local dictionary support.
>
> https://issues.apache.org/jira/browse/CARBONDATA-2584
>
> -Regards
> Kumar Vishal
>
> On Wed, Jun 6, 2018 at 6:25 PM, xuchuanyin  wrote:
>
> > Hi, Kumar:
> >   Can you raise a Jira and provide the document as attachment? I cannot
> > open the links since it is blocked.
>


Re: [Discussion] Blocklet DataMap caching in driver

2018-06-23 Thread manish gupta
Thanks for the feedback Jacky.

As of now we have min/max at each block and blocklet level and while
loading the metadata cache we compute the task level min/max. Segment Level
min/max is not considered as of now but surely this solution can be
enhanced to consider segment level min/max.

We can discuss further on this in detail and decide whether to consider now
or enhance it in near future.

Regards
Manish Gupta

On Fri, Jun 22, 2018 at 8:34 PM, Jacky Li  wrote:

> Hi Manish,
>
> +1 for solution 1 for next carbon version. Solution 2 should be also
> considered, but for a future version after next version.
>
> In my previous observation, many scenario user will filter on time range,
> and since Carbon’s segment is per incremental load which makes it related
> to time normally. So if we can have minmax for sort_columns for segment
> level, I think it will further help making driver index minimum. Will you
> also consider this?
>
> Regards,
> Jacky
>
>
> > 在 2018年6月21日,下午5:24,manish gupta  写道:
> >
> > Hi Dev,
> >
> > The current implementation of Blocklet dataMap caching in driver is that
> it
> > caches the min and max values of all the columns in schema by default.
> >
> > The problem with this implementation is that as the number of loads
> > increases the memory required to hold min and max values also increases
> > considerably. We know that in most of the scenarios there is a single
> > driver and memory configured for driver is less as compared to executor.
> > With continuos increase in memory requirement driver can even go out of
> > memory which makes the situation further worse.
> >
> > *Proposed Solution to solve the above problem:*
> >
> > Carbondata uses min and max values for blocklet level pruning. It might
> not
> > be necessary that user has filter on all the columns specified in the
> > schema instead it could be only few columns that has filter applied on
> them
> > in the query.
> >
> > 1. We provide user an option to cache the min and max values of only the
> > required columns. Caching only the required columns can optimize the
> > blocklet dataMap memory usage as well as solve the driver memory problem
> to
> > a greater extent.
> >
> > 2. Using an external storage/DB to cache min and max values. We can also
> > implement a solution to create a table in the external DB and store min
> and
> > max values for all the columns in that table. This will not use any
> driver
> > memory and hence the driver memory usage will be optimized further as
> > compared to solution 1.
> >
> > *Solution 1* will not have any performance impact as the user will cache
> > the required filter columns and it will not have any external dependency
> > for query execution.
> > *Solution 2* will degrade the query performance as it will involve
> querying
> > for min and max values from external DB required for Blocklet pruning.
> >
> > *So from my point of view we should go with solution 1 and in near future
> > propose a design for solution 2. User can have an option to select
> between
> > the 2 options*. Kindly share your suggestions.
> >
> > Regards
> > Manish Gupta
>
>
>
>


Re: Initiating Apache CarbonData-1.3.0 Release

2017-12-24 Thread manish gupta
+1

Regards
Manish Gupta

On Sun, 24 Dec 2017 at 4:56 PM, Kumar Vishal <kumarvishal1...@gmail.com>
wrote:

> +1
> -Regards
> Kumar Vishal
> Sent from my iPhone
>
> > On 24-Dec-2017, at 16:29, Jacky Li <jacky.li...@qq.com> wrote:
> >
> > +1
> >
> >> 在 2017年12月24日,上午1:56,Ravindra Pesala <ravi.pes...@gmail.com> 写道:
> >>
> >> Hi All,
> >>
> >> We are initiating CarbonData 1.3.0 release so no new features are
> allowed
> >> to commit on master branch till the release is done. We will stabilize
> the
> >> code and only defect fixes are allowed to commit.  Please let us know if
> >> any urgent features need to be merged into 1.3.0 version so that we will
> >> plan accordingly.
> >>
> >> Major Features did in CarbonData 1.3.0 Release
> >> 1. Supported Streaming in CarbonData
> >> 2. Supported Spark 2.2 version in Carbon.
> >> 3. Added Pre-aggregation support to carbon.
> >> 4. Supported standard hive type of partitioning in carbon.
> >> 5. Added CTAS support in Carbon
> >>
> >> --
> >> Thanks & Regards,
> >> Ravindra.
> >
>


[Discussion] Blocklet DataMap caching in driver

2018-06-21 Thread manish gupta
Hi Dev,

The current implementation of Blocklet dataMap caching in driver is that it
caches the min and max values of all the columns in schema by default.

The problem with this implementation is that as the number of loads
increases the memory required to hold min and max values also increases
considerably. We know that in most of the scenarios there is a single
driver and memory configured for driver is less as compared to executor.
With continuos increase in memory requirement driver can even go out of
memory which makes the situation further worse.

*Proposed Solution to solve the above problem:*

Carbondata uses min and max values for blocklet level pruning. It might not
be necessary that user has filter on all the columns specified in the
schema instead it could be only few columns that has filter applied on them
in the query.

1. We provide user an option to cache the min and max values of only the
required columns. Caching only the required columns can optimize the
blocklet dataMap memory usage as well as solve the driver memory problem to
a greater extent.

2. Using an external storage/DB to cache min and max values. We can also
implement a solution to create a table in the external DB and store min and
max values for all the columns in that table. This will not use any driver
memory and hence the driver memory usage will be optimized further as
compared to solution 1.

*Solution 1* will not have any performance impact as the user will cache
the required filter columns and it will not have any external dependency
for query execution.
*Solution 2* will degrade the query performance as it will involve querying
for min and max values from external DB required for Blocklet pruning.

*So from my point of view we should go with solution 1 and in near future
propose a design for solution 2. User can have an option to select between
the 2 options*. Kindly share your suggestions.

Regards
Manish Gupta


Re: [VOTE] Apache CarbonData 1.4.1(RC2) release

2018-08-13 Thread manish gupta
+1

Regards
Manish Gupta

On Mon, 13 Aug 2018 at 8:29 PM, Kumar Vishal 
wrote:

> +1
> Regards
> Kumar Vishal
>
> On Fri, 10 Aug 2018 at 08:14, Ravindra Pesala 
> wrote:
>
> > Hi
> >
> >
> > I submit the Apache CarbonData 1.4.1 (RC2) for your vote.
> >
> >
> > 1.Release Notes:
> >
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12343148
> >
> > Some key features and improvements in this release:
> >
> >1. Supported Local dictionary to improve IO and query performance.
> >2. Improved and stabilized Bloom filter datamap.
> >3. Supported left outer join MV datamap(Alpha feature)
> >4. Supported driver min max caching for specified columns and
> >segregate block and blocklet cache.
> >5. Support Flat folder structure in carbon to maintain the same folder
> >structure as Hive.
> >6. Supported S3 read and write on carbondata files
> >7. Support projection push down for struct data type.
> >8. Improved complex datatypes compression and performance through
> >adaptive encoding.
> >9. Many Bug fixes and stabilized carbondata.
> >
> >
> >  2. The tag to be voted upon : apache-carbondata-1.4.1.rc2(commit:
> > a17db2439aa51f6db7da293215f9732ffb200bd9)
> >
> >
> >
> https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.4.1-rc2
> >
> >
> > 3. The artifacts to be voted on are located here:
> >
> > https://dist.apache.org/repos/dist/dev/carbondata/1.4.1-rc2/
> >
> >
> > 4. A staged Maven repository is available for review at:
> >
> >
> https://repository.apache.org/content/repositories/orgapachecarbondata-1032
> >
> >
> > 5. Release artifacts are signed with the following key:
> >
> > *https://people.apache.org/keys/committer/ravipesala.asc
> > <
> >
> https://link.getmailspring.com/link/1524823736.local-38e60b2f-d8f4-v1.2.1-7e744...@getmailspring.com/9?redirect=https%3A%2F%2Fpeople.apache.org%2Fkeys%2Fcommitter%2Fravipesala.asc=ZGV2QGNhcmJvbmRhdGEuYXBhY2hlLm9yZw%3D%3D
> > >*
> >
> >
> > Please vote on releasing this package as Apache CarbonData 1.4.1,  The
> vote
> >
> > will be open for the next 72 hours and passes if a majority of
> >
> > at least three +1 PMC votes are cast.
> >
> >
> > [ ] +1 Release this package as Apache CarbonData 1.4.1
> >
> > [ ] 0 I don't feel strongly about it, but I'm okay with the release
> >
> > [ ] -1 Do not release this package because...
> >
> >
> > Regards,
> > Ravindra.
> >
>


Re: [SUGGESTION]Support Decoder based fallback mechanism in local dictionary

2018-08-27 Thread manish gupta
+1
@Akash..I suggest not to expose any property to the user for this. The
design should support this decision based on the property but to expose it
to the end user, this decision can be taken once you complete your
performance testing.

Regards
Manish Gupta

On Mon, 27 Aug 2018 at 1:57 PM, Kumar Vishal 
wrote:

> +1
> @ xuchuanyin
> This will not impact data map writing flow as actual column page will be
> cleared only after consuming all the records by data map writer,
> there will not be any change in that area.
>
> -Regards
> Kumar Vishal
> ,
>
> On Mon, Aug 27, 2018 at 1:01 PM xuchuanyin  wrote:
>
> > This means, no need to keep the actual data along with encoded data in
> > encoded column page.
> > ---
> > A problem is that, currently index datamap needs the actual data to
> > generate
> > index. You may affect this procedure if you do not keep the actual data.
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [DISCUSSION] Implement file-level Min/Max index for streaming segment

2018-08-27 Thread manish gupta
+1

Regards
Manish Gupta

On Mon, 27 Aug 2018 at 9:47 AM, Kumar Vishal 
wrote:

> +1
> Regards
> Kumar Vishal
>
> On Mon, 27 Aug 2018 at 07:15, xm_zzc <441586...@qq.com> wrote:
>
> > +1.
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: The size of the tablestatus file is getting larger, does it impact the performance of reading this file?

2018-03-14 Thread manish gupta
I think maintaining a tablestatus backlog file is a good idea. This will
also help us in quick filtering of valid segments as the number of segments
increase during queries execution which involve reading of table status
file.

Show segment DDL can read both the files to display the output.

Regards
Manish Gupta

On Thu, 15 Mar 2018 at 10:19 AM, xm_zzc <441586...@qq.com> wrote:

> Hi Jacky, Raghunandan S:
>   Thanks for your reply.
>   Currently I am working on PR2045, this pr will automatically delete the
> segment lock files when execute method
> 'SegmentStatusManager.deleteLoadsAndUpdateMetadata', and it will scan
> 'tablestatus' file to decide which segment lock file need to be deleted.
> Ravindra Pesala considers the performance  of reading tablestatus file as
> the size of it is getting larger. So I want to know whether it can reduce
> the size of tablestatus file.
>   According to Raghunandan S's suggestion, I think we can *append* the
> invisible segment list to the file called 'tablestatus.history' when
> execute
> command 'CLEAN FILES FOR TABLE' every time, separate  visible and invisible
> segments into two files. If later it needs to support listing all
> segments(include visible and invisible) list when execute 'SHOW SEGMENTS
> FOR
> TABLE', it just need to read from two files. Is it OK to do so?
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: CarbonData Performance Optimization

2018-09-27 Thread manish gupta
+1

Regards
Manish Gupta

On Thu, 27 Sep 2018 at 11:36 AM, Kumar Vishal 
wrote:

> +1
> Regards
> Kumar Vishal
>
> On Thu, Sep 27, 2018 at 8:57 AM Jacky Li  wrote:
>
> > +1
> >
> > > 在 2018年9月21日,上午10:20,Ravindra Pesala  写道:
> > >
> > > Hi,
> > >
> > > In case of querying data using Spark  or Presto, carbondata is not well
> > > optimized for reading data and fill the vector. The major issues are as
> > > follows.
> > > 1. CarbonData has long method stack for reading and filling out the
> data
> > to
> > > vector.
> > > 2. Many conditions and checks before filling out the data to vector.
> > > 3. Maintaining intermediate copies of data leads more CPU utilization.
> > > Because of the above issues, there is a high chance of missing the CPU
> > > cache while processing the leads to poor performance.
> > >
> > > So here I am proposing the optimization to fill the vector without much
> > > method stack and condition checks and no intermediate copies to utilize
> > > more CPU cache.
> > >
> > > *Full Scan queries:*
> > >  After decompressing the page in our V3 reader we can immediately fill
> > the
> > > data to a vector without any condition checks inside loops. So here
> > > complete column page data is set to column vector in a single batch and
> > > gives back data to Spark/Presto.
> > > *Filter Queries:*
> > >  First, apply page level pruning using the min/max of each page and get
> > > the valid pages of blocklet.  Decompress only valid pages and fill the
> > > vector directly as mentioned in full scan query scenario.
> > >
> > > In this method, we can also get the advantage of avoiding two times
> > > filtering in Spark/Presto as they do the filtering again even though we
> > > return the filtered data.
> > >
> > > Please find the *TPCH performance report of updated carbon* as per the
> > > changes mentioned above. Please note that the changes I have done the
> > > changes in POC quality so it takes some time to stabilize it.
> > >
> > > *Configurations*
> > > Laptop with i7 processor and 16 GB RAM.
> > > TPCH Data Scale: 100 GB
> > > No Sort with no inverted index data.
> > > Total CarbonData Size : 32 GB
> > > Total Parquet Size :  31 GB
> > >
> > >
> > > Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon
> New
> > > Vs Parquet Carbon old Vs Parquet
> > > Q1 101 96 128 25.00% 4.95% -26.73%
> > > Q2 85 82 85 3.53% 3.53% 0.00%
> > > Q3 118 112 135 17.04% 5.08% -14.41%
> > > Q4 473 424 486 12.76% 10.36% -2.75%
> > > Q5 228 201 205 1.95% 11.84% 10.09%
> > > Q6 19.2 19.2 48 60.00% 0.00% -150.00%
> > > Q7 194 181 198 8.59% 6.70% -2.06%
> > > Q8 285 263 275 4.36% 7.72% 3.51%
> > > Q9 362 345 363 4.96% 4.70% -0.28%
> > > Q10 101 92 93 1.08% 8.91% 7.92%
> > > Q11 64 61 62 1.61% 4.69% 3.13%
> > > Q12 41.4 44 63 30.16% -6.28% -52.17%
> > > Q13 43.4 43.6 43.7 0.23% -0.46% -0.69%
> > > Q14 36.9 31.5 41 23.17% 14.63% -11.11%
> > > Q15 70 59 80 26.25% 15.71% -14.29%
> > > Q16 64 60 64 6.25% 6.25% 0.00%
> > > Q17 426 418 432 3.24% 1.88% -1.41%
> > > Q18 1015 921 1001 7.99% 9.26% 1.38%
> > > Q19 62 53 59 10.17% 14.52% 4.84%
> > > Q20 406 326 426 23.47% 19.70% -4.93%
> > > Full Scan Query* 140 116 164 29.27% 17.14% -17.14%
> > > *Full Scan Query means count of every coumn of lineitem, In this way we
> > can
> > > check the full scan query performance.
> > >
> > > The above optimization is not just limited to fileformat and Presto
> > > integration but also improves for CarbonSession integration.
> > > We can further optimize carbon by the tasks(Vishal is already working
> on
> > > it) like adaptive encoding for all types of columns and storing length
> > and
> > > values in separate pages in case of string datatype.Please refer
> > >
> >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
> > > .
> > >
> > > --
> > > Thanks & Regards,
> > > Ravi
> > >
> >
> >
> >
> >
>


Re: [ANNOUNCE] Raghunandan as new committer of Apache CarbonData

2018-09-26 Thread manish gupta
Congratulations Raghunandan

On Wed, 26 Sep 2018 at 1:07 PM, Kunal Kapoor 
wrote:

> Congratulations raghunandan
>
> On Wed, Sep 26, 2018, 1:04 PM Ravindra Pesala 
> wrote:
>
> > Congrats Raghu
> >
> > On Wed, 26 Sep 2018 at 12:53, sujith chacko  >
> > wrote:
> >
> > > Congratulations Raghu
> > >
> > > On Wed, 26 Sep 2018 at 12:44 PM, Rahul Kumar 
> > > wrote:
> > >
> > > > congrats Raghunandan !!
> > > >
> > > >
> > > > Rahul Kumar
> > > > *Sr. Software Consultant*
> > > > *Knoldus Inc.*
> > > > m: 9555480074
> > > > w: www.knoldus.com  e: rahul.ku...@knoldus.in
> > > > 
> > > >
> > > >
> > > > On Wed, Sep 26, 2018 at 12:41 PM Kumar Vishal <
> > kumarvishal1...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Congratulations Raghunandan.
> > > > >
> > > > > -Regards
> > > > > Kumar Vishal
> > > > >
> > > > > On Wed, Sep 26, 2018 at 12:36 PM Liang Chen <
> chenliang6...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi all
> > > > > >
> > > > > > We are pleased to announce that the PMC has invited Raghunandan
> as
> > > new
> > > > > > committer of Apache CarbonData, and the invite has been accepted!
> > > > > >
> > > > > > Congrats to Raghunandan and welcome aboard.
> > > > > >
> > > > > > Regards
> > > > > > Apache CarbonData PMC
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>


Re: [VOTE] Apache CarbonData 1.5.1(RC2) release

2018-12-02 Thread manish gupta
+1

Regards
Manish Gupta

On Mon, 3 Dec 2018 at 12:22 PM, Jacky Li  wrote:

> I think there are other places that are using apache-common-log, like
> https://github.com/apache/carbondata/blob/382ce430a18ca3d7d0b444777c66591e2c2e705f/hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonInputFormat.java#L103
>
> Since this is not introduced in this version, I think it is OK to raise a
> jira now and solve it in next version.
> So +1 for the voting.
>
> Regards,
> Jacky
>
>
> > 在 2018年12月1日,下午9:41,xuchuanyin  写道:
> >
> > Hi, please consider this line of code:
> https://github.com/apache/carbondata/blob/master/core/src/main/java/org/apache/carbondata/core/datamap/TableDataMap.java#L78
> >
> > It uses apache-common-log directly instead of carbondata log. I’m not
> sure about the impact of this.
> > Please take care of this before voting.
> >
> > Sent from laptop
> >
> > From: Ravindra Pesala
> > Sent: Saturday, December 1, 2018 08:58
> > To: dev
> > Subject: [VOTE] Apache CarbonData 1.5.1(RC2) release
> >
> > Hi
> >
> > I submit the Apache CarbonData 1.5.1 (RC2) for your vote.
> >
> > 1.Release Notes:
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220=12344320
> >
> >
> >Some key features and improvements in this release:
> >
> >   1. Optimized scan performance by avoiding multiple data copies and
> >   avoided double filtering for spark fileformat and presto by lazy
> loading
> >   and page pruning.
> >   2. Supported customize column compressor
> >   3. Supported concurrent reading through SDK reader to improve read
> >   performance.
> >   4. Supported fallback mechanism when offheap memory is not enough then
> >   switch to onheap
> >   5. Supported C++ interface for writing carbon data in CSDK
> >   6. Supported VectorizedReader for SDK Reader to improve read
> performance.
> >   7. Improved Blocklet DataMap pruning in driver using multi-threading.
> >   8. Make inverted index false by default
> >   9. Enable Local dictionary by default
> >   10. Support prefetch for compaction to improve compaction performance.
> >   11. Many Bug fixes and stabilized Carbondata.
> >
> >
> > 2. The tag to be voted upon : apache-carbondata-1.5.1-rc2 (commit:
> > 1d1eb7bd625f1af1745c555274dd69298a79ab65)
> >
> https://github.com/apache/carbondata/releases/tag/apache-carbondata-1.5.1-rc2
> >
> >
> > 3. The artifacts to be voted on are located here:
> > https://dist.apache.org/repos/dist/dev/carbondata/1.5.1-rc2/
> >
> >
> > 4. A staged Maven repository is available for review at:
> >
> https://repository.apache.org/content/repositories/orgapachecarbondata-1036/
> >
> >
> > 5. Release artifacts are signed with the following key:
> >
> > *https://people.apache.org/keys/committer/ravipesala.asc
> > <
> https://link.getmailspring.com/link/1524823736.local-38e60b2f-d8f4-v1.2.1-7e744...@getmailspring.com/9?redirect=https%3A%2F%2Fpeople.apache.org%2Fkeys%2Fcommitter%2Fravipesala.asc=ZGV2QGNhcmJvbmRhdGEuYXBhY2hlLm9yZw%3D%3D
> >*
> >
> >
> > Please vote on releasing this package as Apache CarbonData 1.5.1,  The
> vote
> >
> > will be open for the next 72 hours and passes if a majority of
> >
> > at least three +1 PMC votes are cast.
> >
> >
> > [ ] +1 Release this package as Apache CarbonData 1.5.1
> >
> > [ ] 0 I don't feel strongly about it, but I'm okay with the release
> >
> > [ ] -1 Do not release this package because...
> >
> >
> > Regards,
> > Ravindra.
> >
> >
> >
>
>


Re: [ANNOUNCE] Chuanyin Xu as new PMC for Apache CarbonData

2019-01-02 Thread manish gupta
Congratulations Chuanyin..!!!

Regards
Manish Gupta

On Wed, 2 Jan 2019 at 5:22 PM, Mohammad Shahid Khan <
mohdshahidkhan1...@gmail.com> wrote:

> Congrats
> Regards,
> Shahid
>
> On Wed 2 Jan, 2019, 5:49 AM Liang Chen 
> > Hi
> >
> > We are pleased to announce that Chuanyin Xu as new PMC for Apache
> > CarbonData
> > .
> >
> > Congrats to Chuanyin Xu!
> >
> > Apache CarbonData PMC
> >
>


[DISCUSSION] Optimizing the writing of min max for a column

2018-09-15 Thread manish gupta
Hi Dev

I am currenlty working on min max optimization whereIn for string/varhcar
data types column we will decide internally whether to write min max or not.

*Background*
Currently we are storing min max for all the columns. Currently we are
storing page min max, blocklet min max in filefooter and all the blocklet
metadata entries in the shard. Consider the case where each column data
size is more than 1 characters. In this case if we write min max then
min max will be written 3 times for each column and it will lead to
increase in store size which will impact the query performance.

*Design proposal*
1. We will introduce a configurable system level property for max
characters *"carbon.string.allowed.character.count".* If the data crosses
this limit then min max will not be stored for that column.
2. If a page does not contain min max for a column, then blocklet min max
will also not contain the entry for min max of that column.
3. Thrift file will be modified to introduce a option Boolean flag which
will used in query to identify whether min max is stored for the filter
column or not.
4. As of now it will be supported only for dimensions of string/varchar
type. We can extend it further to support bigDecimal type measures also in
future if required.
5. Block and blocklet dataMap cache will also include storing min max
Boolean flag for dimensions column based on which filter pruning will be
done. If min max is not written for any column then isScanRequired will
return true in driver pruning.
6. In executor again page and blocklet level min max will be checked for
filter column. If min max is not written then complete page data will be
scanned.

*Backward compatibility*
1. For stores prior to 1.5.0 min max flag for all the columns will be set
to true during loading dataMap in query flow.

Please feel free to share your inputs and suggestions.

Regards
Manish Gupta


Re: [DISCUSSION] Optimizing the writing of min max for a column

2018-09-16 Thread manish gupta
Hi Xuchuanyin

Please find below the details for the property
‘carbon.string.allowed.character.count’.

*Property name*

*Default value*

*Max value*

*Min value*

*carbon.string.allowed.character.count*

500

1000

10

Regards
Manish Gupta

On Sun, Sep 16, 2018 at 9:32 AM xuchuanyin  wrote:

> What is the default value of the property
> ‘carbon.string.allowed.character.count’ ?
>
> Actually many IDs are string, as a result I think we can make it a
> reasonable value so that it will not affect the behavior of common usage.


Re: [VOTE] Apache CarbonData 1.5.3(RC1) release

2019-04-08 Thread manish gupta
+1

Regards
Manish Gupta

On Mon, 8 Apr 2019 at 9:15 AM, Kumar Vishal 
wrote:

> +1
> Regards
> Kumar vishal
>
> On Mon, 8 Apr 2019 at 09:09, David CaiQiang  wrote:
>
> > +1
> >
> >
> >
> > -
> > Best Regards
> > David Cai
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [Discussion] DDLs to operate on CarbonLRUCache

2019-02-18 Thread manish gupta
Hi Naman

+1 for point 1, 2 and 6.
-1 for point 3, 4 and 5.

1. For point 1, 2  --> Add a design doc to mention all those things that
will be considered for caching while displaying the caching size.
2. For point 3, 4 --> I feel that cleaning of cache should be an internal
thing and not exposed to the user. This might also suppress any bugs that
are there while cleaning the cache at the time of dropping the table. You
can think of stale cache clean up through a separate thread which checks
for stale cache clean up at intervals or you can try to integrate the
functionality with Clean DDL command.
3. For point 5 --> We should think of introducing a command to collect the
System statistics something like Spark and from there we should calculate
the memory requirements instead of exposing a DDL specifically for cache
calculations.

Regards
Manish Gupta

On Tue, Feb 19, 2019 at 7:28 AM dhatchayani 
wrote:

> Hi Naman,
>
> This will be very much useful for the users to control the cache_size and
> the utilization of cache.
>
> Please clarify me the below point.
>
> Dynamic "max cache size" configuration should be supported?
> "carbon.max.driver.lru.cache.size" is a system level configuration whereas
> dynamic property is the session level property. We can support the
> dynamically SET for which the purpose of the property still holds good to
> the system. I think in this case, it does not hold well to the system.
>
> Thanks,
> Dhatchayani
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [DISCUSSION] Distributed Index Cache Server

2019-03-05 Thread manish gupta
+1
Thanks Kunal for working on the design.

Regards
Manish Gupta

On Mon, Mar 4, 2019 at 2:59 PM Kunal Kapoor 
wrote:

> Hi xuchuayin,
> ok, we will be moving the pruning logic to this module.
>
> Please give +1 to the design if you are happy with it.
>
> Thanks
> Kunal Kapoor
>
> On Wed, Feb 20, 2019 at 6:25 PM xuchuanyin  wrote:
>
> > Hi kunal,
> >
> > At last I'd suggest again that the code for pruning procedure should be
> > moved to a separate module.
> > The earlier we do this, the easier will be if we want to implement other
> > types of IndexServer later.
> >
> >
> >
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [VOTE] Apache CarbonData 1.5.4(RC1) release

2019-05-27 Thread manish gupta
+1

Regards
Manish Gupta

On Mon, May 27, 2019 at 11:34 AM kanaka  wrote:

> +1
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [ANNOUNCE] Akash as new Apache CarbonData committer

2019-04-25 Thread manish gupta
Congratulations Akash..!!!

On Thu, 25 Apr 2019 at 8:26 PM, Kunal Kapoor 
wrote:

> Congratulations akash壟
>
> On Thu, Apr 25, 2019, 7:56 PM Mohammad Shahid Khan <
> mohdshahidkhan1...@gmail.com> wrote:
>
>> Congrats Akash
>> Regards,
>> Mohammad Shahid Khan
>>
>> On Thu 25 Apr, 2019, 7:51 PM Liang Chen,  wrote:
>>
>> > Hi all
>> >
>> > We are pleased to announce that the PMC has invited Akash as new
>> > Apache CarbonData
>> > committer, and the invite has been accepted!
>> >
>> > Congrats to Akash and welcome aboard.
>> >
>> > Regards
>> > Apache CarbonData PMC
>> >
>> >
>> > --
>> > Regards
>> > Liang
>> >
>>
>


Re: [Discussion] Roadmap for Apache CarbonData 2

2019-08-13 Thread manish gupta
Hi Team

Its glad to see how Carbondata has grown and become popular over the time.
It was important to re-look and come up with a roadmap as per future needs.
Carbondata 2.0 proposal looks good as we are trying to align it with Cloud
which will be more or less the prominent run time environment in the near
future. A lot of code refactoring will be required as per the roadmap. I
would like to add a couple of points.

1. Complex type support: Although we do have complex type support there is
scope for improvement. use cases for nested columns are growing
extensively. We should work on improving the storage of nested columns and
should also support creating compound/multi column indexes for the nested
columns.
2. Feature code segregation and Pluggability: Current code is tightly
coupled. The ideal case would be to have a base and make all the features
pluggable into it but that will be hard to achieve. We can try segregation
at the package level for major features but for any new feature developed
we should think in terms of pluggability.

[Clarification] Carbon UI: I did not understand the usage of Carbon segment
management UI. For cloud scenario we will have to expose rest end points
which will make carbon more like a Microservice and that does not go along
with Carbondata use case. UI/tool makes more sense for internal testing but
not sure how it will be beneficial for end user. May be a tool showing the
data stored in each table would be more useful to the end user.

Regards
Manish Gupta

On Tue, Aug 13, 2019 at 4:51 PM Kumar Vishal 
wrote:

> Hi Ravi,
>
> We can add below requirements in 2.0:
>
> 1. Data Loading performance improvement.(Need to analyze and improve)
> 2. Unify reading for carbon data file, currently data is read in two parts
> dimension and measure because of this number of IO is more.
> 3. Carbon Store size optimization(Already PR is raised need to revisit) and
> we can explore some more optimization(like RLE hybrid Bit Packing).
> 4. Presto enhancement(Like write support, Presto SQL adaptation, Complex
> type read support)
> 5. Spark Data Source V2 integration.
> 6. Spatial Index Support.
>
>
> -Regards
> Kumar Vishal
>
> On Thu, Jul 18, 2019 at 8:20 PM Ravindra Pesala 
> wrote:
>
> > Hi Kevin,
> >
> > Yes, we can improve it. The implementation is closely related to
> supporting
> > pre-aggregate datamaps on the streaming table which we have already
> > implemented some time ago. And same will be reimplemented for MV datamap
> > soon as well.
> > The implementation allows using of pre-aggregate datamap for
> non-streaming
> > segments and main table for streaming segments. We update the query plan
> to
> > do union on both the tables and query only the streaming segments for
> main
> > table.
> > So even in our case also we can use the same way, we can do the union
> query
> > of MV table and main table(only non loaded datamap segments) and execute
> > the query.  We can definitely consider after we support streaming table
> for
> > MV datamap.
> >
> > Regards,
> > Ravindra.
> >
> > On Wed, 17 Jul 2019 at 07:55, kevinjmh  wrote:
> >
> > > currently, datamap in carbon applys to all segments.
> > > The roadmap refers to commands like add/drop segment, and also maybe
> > > something
> > > about incremental loading for MV. For these scenes, it is better to
> make
> > > datamap can be use on segment level instead of disable the datamap when
> > any
> > > datamap data is not ready for any segment. Also this can make datamap
> > > fail-safe and enhance carbon's stablility.
> > > Maybe we can consider about this also.
> > >
> > >
> > >
> > >
> > > -
> > > Regards
> > > Manhua
> > >
> > >
> > >
> > > ---Original---
> > > From: "Ravindra Pesala"
> > > Date: Tue, Jul 16, 2019 22:31 PM
> > > To: "dev";
> > > Subject: [Discussion] Roadmap for Apache CarbonData 2
> > >
> > >
> > > Hi Community,
> > >
> > > Three years have passed since the launching of the Apache CarbonData
> > > project, CarbonData has become a popular data management solution for
> > > various scenarios. As new workload like AI and new runtime environment
> > like
> > > the cloud is emerging quickly, I think we are reaching a point that
> needs
> > > to discuss the future of CarbonData.
> > >
> > > To bring CarbonData to a new level to satisfy those new requirements,
> > Jacky
> > > and I drafted a roadmap for CarbonData 2 in the cwiki website.
> > > - English Version:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CARBONDATA/Apache+CarbonData+2+Roadmap+Proposal
> > > - Chinese Version:
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120737492
> > >
> > > Please feel free to discuss the roadmap in this thread, and we welcome
> > > every feedback to make CarbonData better.
> > >
> > > Thanks and Regards,
> > > Ravindra.
> >
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>


Re: [DISCUSSION]: Changes to SHOW METACACHE command

2019-12-18 Thread manish gupta
Hi Vikram

Is the behavior for Show Metacache generic irrespective of database user is
into? i.e every time user executes this command all database details are
displayed OR only specific to DB user is currently logged in.
If it displays for all databases by default, then as a suggestion there
should be some segregation (can be an option in DDL command) and default
behavior should display details only for the logged in database.

Regards
Manish Gupta

On Tue, Dec 17, 2019 at 5:25 PM Vikram Ahuja 
wrote:

> Hi All,
> Please find the attached design document for the same.
>
>
> https://docs.google.com/document/d/1qbr8-Ci_tCvh1tuEdxo3xJkLhDksdnfKuZiDtQCUujk/edit?usp=sharing
>
> On Tue, Nov 26, 2019 at 10:28 PM Kunal Kapoor 
> wrote:
>
> > Hi Vikram,
> > What is the background for these changes and what are the benefits this
> > will add to carbondata?
> > Better to add a detailed design document in this thread.
> >
> > Thanks
> > Kunal Kapoor
> >
> > On Tue, Nov 26, 2019 at 7:01 PM vikramahuja1001 <
> vikramahuja8...@gmail.com
> > >
> > wrote:
> >
> > > Current result of Show Metacache command:
> > > <
> > >
> >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t423/currentSM.png
> > >
> > >
> > >
> > > Proposed result of Show metacache command for the Driver and the Index
> > > Server:
> > > <
> > >
> >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/t423/proposedSM.png
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Sent from:
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >
> >
>


Re: [Discussion] Support Secondary Index on Carbon Table

2020-02-06 Thread manish gupta
+1

Regards
Manish Gupta

On Thu, 6 Feb 2020 at 1:50 PM, David CaiQiang  wrote:

> +1
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [ANNOUNCE] Kunal Kapoor as new PMC for Apache CarbonData

2020-03-30 Thread manish gupta
Congratulations Kunal !!!

Regards
Manish Gupta

On Sun, Mar 29, 2020 at 8:25 PM kanaka kumar avvaru <
kanakakumaravv...@gmail.com> wrote:

> Congratulations Kunal !!!
>
> -Regards
> Kanaka
>
> On Sun, 29 Mar, 2020, 12:37 Liang Chen,  wrote:
>
> > Hi
> >
> >
> > We are pleased to announce that Kunal Kapoor as new PMC for Apache
> > CarbonData.
> >
> >
> > Congrats to Kunal Kapoor!
> >
> >
> > Apache CarbonData PMC
> >
>


Re: [ANN] Indhumathi as new Apache CarbonData committer

2020-10-06 Thread manish gupta
Congratulations Indumathi

Regards
Manish Gupta

On Wed, 7 Oct 2020 at 10:23 AM, brijoobopanna 
wrote:

> Congrats Indhumathi, best of luck for your new role in the community
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [ANNOUNCE] Ajantha as new PMC for Apache CarbonData

2020-11-20 Thread manish gupta
Congratulations Ajantha 

On Fri, 20 Nov 2020 at 1:21 PM, BrooksLi  wrote:

> Congratulations to Ajantha!
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [ANNOUNCE] Akash R Nilugal as new PMC for Apache CarbonData

2021-04-12 Thread manish gupta
Congratulations Akash !!!

On Mon, 12 Apr 2021 at 11:36 AM, Indhumathi  wrote:

> Congratulations Akash
>
> Regards,
> Indhumathi
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


[jira] [Created] (CARBONDATA-1062) Data load fails if a column specified as sort column is of numeric data type

2017-05-17 Thread Manish Gupta (JIRA)
Manish Gupta created CARBONDATA-1062:


 Summary: Data load fails if a column specified as sort column is 
of numeric data type
 Key: CARBONDATA-1062
 URL: https://issues.apache.org/jira/browse/CARBONDATA-1062
 Project: CarbonData
  Issue Type: Bug
Reporter: Manish Gupta
Assignee: Manish Gupta
Priority: Minor
 Fix For: 1.2.0


If a numeric data type column is specified as sort column and if it contains 
non numeric value then data load fails with the below error.
ERROR UnsafeBatchParallelReadMergeSorterImpl: pool-20-thread-1 
java.lang.ClassCastException: java.lang.String cannot be cast to [B
at 
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:89)
at 
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:74)
at 
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeSortDataRows.addRowBatch(UnsafeSortDataRows.java:170)
at 
org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:150)
at 
org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:117)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Steps to reproduce
--
CREATE TABLE test_sort_col (id INT, name STRING, age INT) STORED BY 
'org.apache.carbondata.format' TBLPROPERTIES('SORT_COLUMNS'='id,age')
LOAD DATA local inpath '' INTO TABLE test_sort_col
select * from test_sort_col

Data
---
id,name,age
1,Pallavi,25
2,Rahul,24
3,Prabhat,twenty six
7,Neha,25
2,Geetika,22
3,Sangeeta,26



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CARBONDATA-1212) Memory leak in case of compaction when unsafe is true

2017-06-21 Thread Manish Gupta (JIRA)
Manish Gupta created CARBONDATA-1212:


 Summary: Memory leak in case of compaction when unsafe is true
 Key: CARBONDATA-1212
 URL: https://issues.apache.org/jira/browse/CARBONDATA-1212
 Project: CarbonData
  Issue Type: Bug
Reporter: Manish Gupta
Assignee: Manish Gupta
 Fix For: 1.2.0
 Attachments: data.csv

In case of compaction, queryExecutor object is formed for multiple blocks but 
the objects are not retained and finish method is called only on the last 
instance created for query executor. Due to this the memory allocated to 
precious objects is not released which can lead to out of memory issue.

Steps to reproduce:
--
CREATE TABLE IF NOT EXISTS t3 (ID Int, date Date, country String, name String, 
phonetype String, serialname char(10), salary Int) STORED BY 'carbondata' 
TBLPROPERTIES('DICTIONARY_EXCLUDE'='name')
LOAD DATA LOCAL INPATH 'data.csv' into table t3
LOAD DATA LOCAL INPATH 'data.csv' into table t3
alter table t3 compact 'major'




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (CARBONDATA-1217) Failure in data load when we first load the bad record and then valid record and bad record action is set to Fail

2017-06-23 Thread Manish Gupta (JIRA)
Manish Gupta created CARBONDATA-1217:


 Summary: Failure in data load when we first load the bad record 
and then valid record and bad record action is set to Fail
 Key: CARBONDATA-1217
 URL: https://issues.apache.org/jira/browse/CARBONDATA-1217
 Project: CarbonData
  Issue Type: Bug
Reporter: Manish Gupta
Assignee: Manish Gupta
Priority: Minor
 Fix For: 1.2.0
 Attachments: bigtabbad.csv, bigtab.csv

When we load bad record into the table and bad record action is set to "FAIL", 
then data load will fail. During load a bad record logger static map is 
maintained which holds the key for bad record. When data load fails due to bad 
record exception is thrown and key from bad record logger static map is not 
cleared because of which when valid data is loaded next time data load fails 
because the key still exists in the map.

Steps to reproduce
---
Execute the below commands in sequence in the same session.
create table bigtab (val string, bal int) STORED BY 'carbondata'
load data  inpath 'bigtabbad.csv' into table bigtab 
options('DELIMITER'=',','QUOTECHAR'='"','BAD_RECORDS_ACTION'='FAIL','FILEHEADER'='val,bal')
load data  inpath 'bigtab.csv' into table bigtab 
options('DELIMITER'=',','QUOTECHAR'='"','BAD_RECORDS_ACTION'='FAIL','FILEHEADER'='val,bal')



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (CARBONDATA-1104) Query failure while using unsafe for query execution numeric data type column specified as sort column

2017-05-29 Thread Manish Gupta (JIRA)
Manish Gupta created CARBONDATA-1104:


 Summary: Query failure while using unsafe for query execution 
numeric data type column specified as sort column
 Key: CARBONDATA-1104
 URL: https://issues.apache.org/jira/browse/CARBONDATA-1104
 Project: CarbonData
  Issue Type: Bug
Reporter: Manish Gupta
Assignee: Manish Gupta
 Fix For: 1.2.0


Steps to reproduce
-
1. Set the parameter enable.unsafe.in.query.processing = true

2. CREATE TABLE sorttable1 (empno int, empname String, designation String, doj 
Timestamp, workgroupcategory int, workgroupcategoryname String, deptno int, 
deptname String, projectcode int, projectjoindate Timestamp, projectenddate 
Timestamp,attendance int,utilization int,salary int) STORED BY 
'org.apache.carbondata.format' tblproperties('sort_columns'='empno')

3. LOAD DATA local inpath '' INTO TABLE sorttable1 
OPTIONS('DELIMITER'= ',', 'QUOTECHAR'= '"')

4. select empno from sorttable1

Exception thrown

17/05/29 08:43:20 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 12)
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.vectorized.ColumnVector.arrayData(ColumnVector.java:858)
at 
org.apache.spark.sql.execution.vectorized.OffHeapColumnVector.putByteArray(OffHeapColumnVector.java:421)
at 
org.apache.spark.sql.execution.vectorized.ColumnVector.putByteArray(ColumnVector.java:569)
at 
org.apache.carbondata.spark.vectorreader.ColumnarVectorWrapper.putBytes(ColumnarVectorWrapper.java:85)
at 
org.apache.carbondata.core.datastore.chunk.store.impl.unsafe.UnsafeVariableLengthDimesionDataChunkStore.fillRow(UnsafeVariableLengthDimesionDataChunkStore.java:167)
at 
org.apache.carbondata.core.datastore.chunk.impl.VariableLengthDimensionDataChunk.fillConvertedChunkData(VariableLengthDimensionDataChunk.java:112)
at 
org.apache.carbondata.core.scan.result.AbstractScannedResult.fillColumnarNoDictionaryBatch(AbstractScannedResult.java:228)
at 
org.apache.carbondata.core.scan.collector.impl.DictionaryBasedVectorResultCollector.scanAndFillResult(DictionaryBasedVectorResultCollector.java:154)
at 
org.apache.carbondata.core.scan.collector.impl.DictionaryBasedVectorResultCollector.collectVectorBatch(DictionaryBasedVectorResultCollector.java:147)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CARBONDATA-1094) Wrong results returned by the query in case inverted index is not created on a column

2017-05-25 Thread Manish Gupta (JIRA)
Manish Gupta created CARBONDATA-1094:


 Summary: Wrong results returned by the query in case inverted 
index is not created on a column
 Key: CARBONDATA-1094
 URL: https://issues.apache.org/jira/browse/CARBONDATA-1094
 Project: CarbonData
  Issue Type: Bug
Reporter: Manish Gupta
Assignee: Manish Gupta
 Fix For: 1.2.0


While creating a table if a column is not specified as sort column or included 
as no inverted index then the column will not be sorted during data load. 
Unsorted data will have incorrect min/max values and inverted index will not be 
created for that column.

During query, if filter exists for that column it gives incorrect results as 
binary search cannot be applied on the unsorted data.

Commands to reproduce
-
CREATE TABLE IF NOT EXISTS index1 (id Int, name String, city String) STORED BY 
'org.apache.carbondata.format' TBLPROPERTIES('NO_INVERTED_INDEX'='name,city', 
'DICTIONARY_EXCLUDE'='city')
LOAD DATA LOCAL INPATH '' into table index1
SELECT * FROM index1 WHERE city >= 'Shanghai'
+---+--+--+
| id|  name|  city|
+---+--+--+
| 11| James|Washington|
|  5|  John|   Beijing|
| 20| Kevin| Singapore|
| 17|  Lisa|  Hangzhou|
| 12| Maria|Berlin|
|  2|  Mark| Paris|
|  9|  Mary| Tokyo|
|  6|Michel|   Chicago|
| 16|  Paul|  Shanghai|
| 14| Peter|Boston|
|  7|Robert|   Houston|
|  4|  Sara| Tokyo|
|  8| Sunny|Boston|
+---+--+--+



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (CARBONDATA-1133) Executor lost failure in case of data load failure due to bad records

2017-06-06 Thread Manish Gupta (JIRA)
Manish Gupta created CARBONDATA-1133:


 Summary: Executor lost failure in case of data load failure due to 
bad records
 Key: CARBONDATA-1133
 URL: https://issues.apache.org/jira/browse/CARBONDATA-1133
 Project: CarbonData
  Issue Type: Bug
Reporter: Manish Gupta
Assignee: Manish Gupta
 Fix For: 1.2.0


when we try to do data load with bad records continuously, after some time it 
is observed that executor is lost due to OOM error and application also gets 
restarted by yarn after some time. This happens because in case of data load 
failure due to bad records exception is thrown by the executor and task keeps 
retrying till the max number of retry attempts are reached. This keeps 
happening continuously and after some time application is restarted by yarn.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)