Re: Create table with columns contains spaces in name.

2016-10-18 Thread Harmeet Singh
Thanks ravi, I will be raise on Jira. 



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Create-table-with-columns-contains-spaces-in-name-tp2030p2035.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: Create table with columns contains spaces in name.

2016-10-18 Thread ravipesala
Probably it is a bug. 



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Create-table-with-columns-contains-spaces-in-name-tp2030p2034.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


[jira] [Created] (CARBONDATA-325) Create table with columns contains spaces in name.

2016-10-18 Thread Harmeet Singh (JIRA)
Harmeet Singh created CARBONDATA-325:


 Summary: Create table with columns contains spaces in name.
 Key: CARBONDATA-325
 URL: https://issues.apache.org/jira/browse/CARBONDATA-325
 Project: CarbonData
  Issue Type: Bug
Reporter: Harmeet Singh


I want to create table, using columns that contains spaces. I am using Thrift 
Server and Beeline client for accessing carbon data. Whenever i am trying to 
create a table, and their columns name contains spaces i am getting an error. 
Below are the steps:

Step 1:
create table three (`first name` string, `age` int) stored by 'carbondata';

Whenever i am executing above query, i am getting below error:
Error: org.apache.carbondata.spark.exception.MalformedCarbonCommandException: 
Unsupported data type : FieldSchema(name:first name, type:string, 
comment:null).getType (state=,code=0)

The above error is pretending to be wrong data types are using. 

If I am removing `stored by 'carbondata'` from query, then this will work fine 
because it is run on Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-carbondata pull request #151: [CARBONDATA-210] Support BZIP2 compr...

2016-10-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-carbondata/pull/151


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Create table with columns contains spaces in name.

2016-10-18 Thread Harmeet Singh
I want to create table, using columns that contains spaces. I am using Thrift
Server and Beeline client for accessing carbon data. Whenever i am trying to
create a table, and their columns name contains spaces i am getting an
error. Below are the steps: 

Step 1: 
create table three (`first name` string, `age` int) stored by 'carbondata';

Whenever i am executing above query, i am getting below error: 
*Error:
org.apache.carbondata.spark.exception.MalformedCarbonCommandException:
Unsupported data type : FieldSchema(name:first name, type:string,
comment:null).getType (state=,code=0)*

The above error is pretending to be wrong data types are using. 

If am going to remove `stored by 'carbondata'` on query, this will work
fine.

Please confirm this is an issue of carbon data or carbon data is not support
column name with spaces ?




--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Create-table-with-columns-contains-spaces-in-name-tp2030.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


[jira] [Created] (CARBONDATA-324) Decimal and Bigint type columns contains Null, after load data

2016-10-18 Thread Harmeet Singh (JIRA)
Harmeet Singh created CARBONDATA-324:


 Summary: Decimal and Bigint type columns contains Null, after load 
data
 Key: CARBONDATA-324
 URL: https://issues.apache.org/jira/browse/CARBONDATA-324
 Project: CarbonData
  Issue Type: Bug
Reporter: Harmeet Singh






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-323) Fix the load data local syntax

2016-10-18 Thread Fei Wang (JIRA)
Fei Wang created CARBONDATA-323:
---

 Summary: Fix the load data local syntax
 Key: CARBONDATA-323
 URL: https://issues.apache.org/jira/browse/CARBONDATA-323
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Affects Versions: 0.1.0-incubating
Reporter: Fei Wang
Assignee: Fei Wang
 Fix For: 0.2.0-incubating


carbon should not support load data local syntax, so fix it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: Discussion(New feature) regarding single pass data loading solution.

2016-10-18 Thread Jihong Ma
Hi Ravi, 

making the memory backup copy consistent is only part of the story, we also 
need an on-disk backup  (through DB they support) to be consistent with the 
in-memory copy, how to achieve that? Probably the safest way is leveraging 
their transaction support, please look into what they can vs. can't do as well 
as the amount of effort/complexity required. 

Jihong



-Original Message-
From: Ravindra Pesala [mailto:ravi.pes...@gmail.com] 
Sent: Tuesday, October 18, 2016 7:21 AM
To: dev
Subject: Re: Discussion(New feature) regarding single pass data loading 
solution.

Hi Jihong,

Yes,  In HazleCast, it maintains only part of the data in one node because
it splits into partitions and allocate partitions ownership to the nodes.
But if the requested data is not present in that node, it still can get
data from another partition from cluster if available. Any way we can
maintain the local total data cache in each node and we look for the key
only if it is not available in the local cache and update the local cache
once it is retrieved from hazlecast.

Yes it allows data backup in multiple nodes as per configuration for high
availability. The backup is done through sync/async mode and consistency is
guaranteed if we use sync mode backup, because if you put some key-value to
hazlecast map it blocks the call till it copies to all the backup nodes in
memory. And also hazlecast map supports locks to ensure data consistency,
we can use API's like map.putIfAbsent or map.lock & map.unlock features.

Thanks,
Ravi.

On 18 October 2016 at 00:08, Jihong Ma  wrote:

> Hi Ravi,
>
> I took a quick look at Hazlecast, what they offer is a distributed map
> across cluster (on any single node only portion of the map is stored), to
> facilitate parallel data loading I think we need a complete copy on each
> node, is this the structure we are looking for?
>
> it does allow map in-memory backup in case one node goes down, to ensure
> its persistency, they allow storing map to db, but requires implementing
> their API to hook them up, there are async/ sync mode supported with no
> guarantee in terms of consistency, unless going further for a transaction
> support, 2-phase commit/XA are offered with read-committed isolation, to
> achieve that is quite complicated when we need to ensure ACID on changes to
> the map. I suggest you to investigate further to understand the implication
> and effort.
>
> We all understand We couldn't afford any inconsistency on dictionary, that
> means we couldn't decode the data back correctly. correctness is even more
> critical compared to performance.
>
>
> Jihong
>
> -Original Message-
> From: Ravindra Pesala [mailto:ravi.pes...@gmail.com]
> Sent: Saturday, October 15, 2016 12:50 AM
> To: dev
> Subject: Re: Discussion(New feature) regarding single pass data loading
> solution.
>
> Hi Jacky/Jihong,
>
> I agree that new dictionary values are less in case of incremental data
> load but that is completely depends on user data scenarios.  In some
> user scenarios new dictionary values may be more we cannot overrule that.
> And also for users convenience we should provide single pass solution with
> out insisting them to run external tool first. We can provide the option to
> run external tool first and provide dictionary to improve performance.
>
> My opinion is better to use some professional distributed map like
> Hazlecast than Zookeeper + HDFS.  It is lite weight and does not require to
> have separate cluster, it can form the cluster within the executor jvm's .
> May be we can have a try, after all it will be just one interface
> implementation for dictionary generation. We can have multiple
> implementations and then decide based on optimal performance.
>
> Regards,
> Ravi
>
> On 15 October 2016 at 10:50, Jacky Li  wrote:
>
> > Hi,
> >
> > I can offer one more approach for this discussion, since new dictionary
> > values are rare in case of incremental load (ensure first load having as
> > much dictionary value as possible), so synchronization should be rare. So
> > how about using Zookeeper + HDFS file to provide this service. This is
> what
> > carbon is doing today, we can wrap Zookeeper + HDFS to provide the global
> > dictionary interface.
> > It has the benefit of
> > 1. automated: without bordering the user
> > 2. not introducing more dependency: we already using zookeeper and HDFS.
> > 3. performance? since new dictionary value and synchronization is rare.
> >
> > What do you think?
> >
> > Regards,
> > Jacky
> >
> > > 在 2016年10月15日,上午2:38,Jihong Ma  写道:
> > >
> > > Hi Ravi,
> > >
> > > The major concern I have for generating global dictionary from scratch
> > with a single scan is performance, the way to handle an occasional update
> > to the dictionary is way simpler and cost effective in terms of
> > synchronization cost and refresh the global/local cache copy.
> > >
> > > There are a lot to worry about for distributed map, and leveraging KV
> > store is overkill if 

Re: Discussion(New feature) regarding single pass data loading solution.

2016-10-18 Thread Ravindra Pesala
Hi Jihong,

Yes,  In HazleCast, it maintains only part of the data in one node because
it splits into partitions and allocate partitions ownership to the nodes.
But if the requested data is not present in that node, it still can get
data from another partition from cluster if available. Any way we can
maintain the local total data cache in each node and we look for the key
only if it is not available in the local cache and update the local cache
once it is retrieved from hazlecast.

Yes it allows data backup in multiple nodes as per configuration for high
availability. The backup is done through sync/async mode and consistency is
guaranteed if we use sync mode backup, because if you put some key-value to
hazlecast map it blocks the call till it copies to all the backup nodes in
memory. And also hazlecast map supports locks to ensure data consistency,
we can use API's like map.putIfAbsent or map.lock & map.unlock features.

Thanks,
Ravi.

On 18 October 2016 at 00:08, Jihong Ma  wrote:

> Hi Ravi,
>
> I took a quick look at Hazlecast, what they offer is a distributed map
> across cluster (on any single node only portion of the map is stored), to
> facilitate parallel data loading I think we need a complete copy on each
> node, is this the structure we are looking for?
>
> it does allow map in-memory backup in case one node goes down, to ensure
> its persistency, they allow storing map to db, but requires implementing
> their API to hook them up, there are async/ sync mode supported with no
> guarantee in terms of consistency, unless going further for a transaction
> support, 2-phase commit/XA are offered with read-committed isolation, to
> achieve that is quite complicated when we need to ensure ACID on changes to
> the map. I suggest you to investigate further to understand the implication
> and effort.
>
> We all understand We couldn't afford any inconsistency on dictionary, that
> means we couldn't decode the data back correctly. correctness is even more
> critical compared to performance.
>
>
> Jihong
>
> -Original Message-
> From: Ravindra Pesala [mailto:ravi.pes...@gmail.com]
> Sent: Saturday, October 15, 2016 12:50 AM
> To: dev
> Subject: Re: Discussion(New feature) regarding single pass data loading
> solution.
>
> Hi Jacky/Jihong,
>
> I agree that new dictionary values are less in case of incremental data
> load but that is completely depends on user data scenarios.  In some
> user scenarios new dictionary values may be more we cannot overrule that.
> And also for users convenience we should provide single pass solution with
> out insisting them to run external tool first. We can provide the option to
> run external tool first and provide dictionary to improve performance.
>
> My opinion is better to use some professional distributed map like
> Hazlecast than Zookeeper + HDFS.  It is lite weight and does not require to
> have separate cluster, it can form the cluster within the executor jvm's .
> May be we can have a try, after all it will be just one interface
> implementation for dictionary generation. We can have multiple
> implementations and then decide based on optimal performance.
>
> Regards,
> Ravi
>
> On 15 October 2016 at 10:50, Jacky Li  wrote:
>
> > Hi,
> >
> > I can offer one more approach for this discussion, since new dictionary
> > values are rare in case of incremental load (ensure first load having as
> > much dictionary value as possible), so synchronization should be rare. So
> > how about using Zookeeper + HDFS file to provide this service. This is
> what
> > carbon is doing today, we can wrap Zookeeper + HDFS to provide the global
> > dictionary interface.
> > It has the benefit of
> > 1. automated: without bordering the user
> > 2. not introducing more dependency: we already using zookeeper and HDFS.
> > 3. performance? since new dictionary value and synchronization is rare.
> >
> > What do you think?
> >
> > Regards,
> > Jacky
> >
> > > 在 2016年10月15日,上午2:38,Jihong Ma  写道:
> > >
> > > Hi Ravi,
> > >
> > > The major concern I have for generating global dictionary from scratch
> > with a single scan is performance, the way to handle an occasional update
> > to the dictionary is way simpler and cost effective in terms of
> > synchronization cost and refresh the global/local cache copy.
> > >
> > > There are a lot to worry about for distributed map, and leveraging KV
> > store is overkill if simply just for dictionary generation.
> > >
> > > Regards.
> > >
> > > Jihong
> > >
> > > -Original Message-
> > > From: Ravindra Pesala [mailto:ravi.pes...@gmail.com]
> > > Sent: Friday, October 14, 2016 11:03 AM
> > > To: dev
> > > Subject: Re: Discussion(New feature) regarding single pass data loading
> > solution.
> > >
> > > Hi Jihong,
> > >
> > > I agree, we can use external tool for first load, but for incremental
> > load
> > > we should have solution to add global dictionary. So this solution
> should
> > > be enough to generate global dictionary even if user does no

[jira] [Created] (CARBONDATA-322) integrate spark 2.x

2016-10-18 Thread Fei Wang (JIRA)
Fei Wang created CARBONDATA-322:
---

 Summary: integrate spark 2.x 
 Key: CARBONDATA-322
 URL: https://issues.apache.org/jira/browse/CARBONDATA-322
 Project: CarbonData
  Issue Type: Bug
  Components: spark-integration
Affects Versions: 0.2.0-incubating
Reporter: Fei Wang
 Fix For: 0.3.0-incubating


As spark 2.0 released. there are many nice features such as more efficient 
parser, vectorized execution, adaptive execution. It is good to integrate with 
spark 2.x

Another side now spark integration is heavy coupling with spark, we should 
redesign the spark integration, it should satisfy flowing requirement:

1. decoupled with spark, integrate according to spark datasource API(V2)
2. This integration should support vectorized carbon reader
3. Supoort write to carbondata from dadatrame
...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)