Re: Create table with columns contains spaces in name.
Thanks ravi, I will be raise on Jira. -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Create-table-with-columns-contains-spaces-in-name-tp2030p2035.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
Re: Create table with columns contains spaces in name.
Probably it is a bug. -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Create-table-with-columns-contains-spaces-in-name-tp2030p2034.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
[jira] [Created] (CARBONDATA-325) Create table with columns contains spaces in name.
Harmeet Singh created CARBONDATA-325: Summary: Create table with columns contains spaces in name. Key: CARBONDATA-325 URL: https://issues.apache.org/jira/browse/CARBONDATA-325 Project: CarbonData Issue Type: Bug Reporter: Harmeet Singh I want to create table, using columns that contains spaces. I am using Thrift Server and Beeline client for accessing carbon data. Whenever i am trying to create a table, and their columns name contains spaces i am getting an error. Below are the steps: Step 1: create table three (`first name` string, `age` int) stored by 'carbondata'; Whenever i am executing above query, i am getting below error: Error: org.apache.carbondata.spark.exception.MalformedCarbonCommandException: Unsupported data type : FieldSchema(name:first name, type:string, comment:null).getType (state=,code=0) The above error is pretending to be wrong data types are using. If I am removing `stored by 'carbondata'` from query, then this will work fine because it is run on Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-carbondata pull request #151: [CARBONDATA-210] Support BZIP2 compr...
Github user asfgit closed the pull request at: https://github.com/apache/incubator-carbondata/pull/151 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Create table with columns contains spaces in name.
I want to create table, using columns that contains spaces. I am using Thrift Server and Beeline client for accessing carbon data. Whenever i am trying to create a table, and their columns name contains spaces i am getting an error. Below are the steps: Step 1: create table three (`first name` string, `age` int) stored by 'carbondata'; Whenever i am executing above query, i am getting below error: *Error: org.apache.carbondata.spark.exception.MalformedCarbonCommandException: Unsupported data type : FieldSchema(name:first name, type:string, comment:null).getType (state=,code=0)* The above error is pretending to be wrong data types are using. If am going to remove `stored by 'carbondata'` on query, this will work fine. Please confirm this is an issue of carbon data or carbon data is not support column name with spaces ? -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Create-table-with-columns-contains-spaces-in-name-tp2030.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.
[jira] [Created] (CARBONDATA-324) Decimal and Bigint type columns contains Null, after load data
Harmeet Singh created CARBONDATA-324: Summary: Decimal and Bigint type columns contains Null, after load data Key: CARBONDATA-324 URL: https://issues.apache.org/jira/browse/CARBONDATA-324 Project: CarbonData Issue Type: Bug Reporter: Harmeet Singh -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CARBONDATA-323) Fix the load data local syntax
Fei Wang created CARBONDATA-323: --- Summary: Fix the load data local syntax Key: CARBONDATA-323 URL: https://issues.apache.org/jira/browse/CARBONDATA-323 Project: CarbonData Issue Type: Bug Components: spark-integration Affects Versions: 0.1.0-incubating Reporter: Fei Wang Assignee: Fei Wang Fix For: 0.2.0-incubating carbon should not support load data local syntax, so fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
RE: Discussion(New feature) regarding single pass data loading solution.
Hi Ravi, making the memory backup copy consistent is only part of the story, we also need an on-disk backup (through DB they support) to be consistent with the in-memory copy, how to achieve that? Probably the safest way is leveraging their transaction support, please look into what they can vs. can't do as well as the amount of effort/complexity required. Jihong -Original Message- From: Ravindra Pesala [mailto:ravi.pes...@gmail.com] Sent: Tuesday, October 18, 2016 7:21 AM To: dev Subject: Re: Discussion(New feature) regarding single pass data loading solution. Hi Jihong, Yes, In HazleCast, it maintains only part of the data in one node because it splits into partitions and allocate partitions ownership to the nodes. But if the requested data is not present in that node, it still can get data from another partition from cluster if available. Any way we can maintain the local total data cache in each node and we look for the key only if it is not available in the local cache and update the local cache once it is retrieved from hazlecast. Yes it allows data backup in multiple nodes as per configuration for high availability. The backup is done through sync/async mode and consistency is guaranteed if we use sync mode backup, because if you put some key-value to hazlecast map it blocks the call till it copies to all the backup nodes in memory. And also hazlecast map supports locks to ensure data consistency, we can use API's like map.putIfAbsent or map.lock & map.unlock features. Thanks, Ravi. On 18 October 2016 at 00:08, Jihong Ma wrote: > Hi Ravi, > > I took a quick look at Hazlecast, what they offer is a distributed map > across cluster (on any single node only portion of the map is stored), to > facilitate parallel data loading I think we need a complete copy on each > node, is this the structure we are looking for? > > it does allow map in-memory backup in case one node goes down, to ensure > its persistency, they allow storing map to db, but requires implementing > their API to hook them up, there are async/ sync mode supported with no > guarantee in terms of consistency, unless going further for a transaction > support, 2-phase commit/XA are offered with read-committed isolation, to > achieve that is quite complicated when we need to ensure ACID on changes to > the map. I suggest you to investigate further to understand the implication > and effort. > > We all understand We couldn't afford any inconsistency on dictionary, that > means we couldn't decode the data back correctly. correctness is even more > critical compared to performance. > > > Jihong > > -Original Message- > From: Ravindra Pesala [mailto:ravi.pes...@gmail.com] > Sent: Saturday, October 15, 2016 12:50 AM > To: dev > Subject: Re: Discussion(New feature) regarding single pass data loading > solution. > > Hi Jacky/Jihong, > > I agree that new dictionary values are less in case of incremental data > load but that is completely depends on user data scenarios. In some > user scenarios new dictionary values may be more we cannot overrule that. > And also for users convenience we should provide single pass solution with > out insisting them to run external tool first. We can provide the option to > run external tool first and provide dictionary to improve performance. > > My opinion is better to use some professional distributed map like > Hazlecast than Zookeeper + HDFS. It is lite weight and does not require to > have separate cluster, it can form the cluster within the executor jvm's . > May be we can have a try, after all it will be just one interface > implementation for dictionary generation. We can have multiple > implementations and then decide based on optimal performance. > > Regards, > Ravi > > On 15 October 2016 at 10:50, Jacky Li wrote: > > > Hi, > > > > I can offer one more approach for this discussion, since new dictionary > > values are rare in case of incremental load (ensure first load having as > > much dictionary value as possible), so synchronization should be rare. So > > how about using Zookeeper + HDFS file to provide this service. This is > what > > carbon is doing today, we can wrap Zookeeper + HDFS to provide the global > > dictionary interface. > > It has the benefit of > > 1. automated: without bordering the user > > 2. not introducing more dependency: we already using zookeeper and HDFS. > > 3. performance? since new dictionary value and synchronization is rare. > > > > What do you think? > > > > Regards, > > Jacky > > > > > 在 2016年10月15日,上午2:38,Jihong Ma 写道: > > > > > > Hi Ravi, > > > > > > The major concern I have for generating global dictionary from scratch > > with a single scan is performance, the way to handle an occasional update > > to the dictionary is way simpler and cost effective in terms of > > synchronization cost and refresh the global/local cache copy. > > > > > > There are a lot to worry about for distributed map, and leveraging KV > > store is overkill if
Re: Discussion(New feature) regarding single pass data loading solution.
Hi Jihong, Yes, In HazleCast, it maintains only part of the data in one node because it splits into partitions and allocate partitions ownership to the nodes. But if the requested data is not present in that node, it still can get data from another partition from cluster if available. Any way we can maintain the local total data cache in each node and we look for the key only if it is not available in the local cache and update the local cache once it is retrieved from hazlecast. Yes it allows data backup in multiple nodes as per configuration for high availability. The backup is done through sync/async mode and consistency is guaranteed if we use sync mode backup, because if you put some key-value to hazlecast map it blocks the call till it copies to all the backup nodes in memory. And also hazlecast map supports locks to ensure data consistency, we can use API's like map.putIfAbsent or map.lock & map.unlock features. Thanks, Ravi. On 18 October 2016 at 00:08, Jihong Ma wrote: > Hi Ravi, > > I took a quick look at Hazlecast, what they offer is a distributed map > across cluster (on any single node only portion of the map is stored), to > facilitate parallel data loading I think we need a complete copy on each > node, is this the structure we are looking for? > > it does allow map in-memory backup in case one node goes down, to ensure > its persistency, they allow storing map to db, but requires implementing > their API to hook them up, there are async/ sync mode supported with no > guarantee in terms of consistency, unless going further for a transaction > support, 2-phase commit/XA are offered with read-committed isolation, to > achieve that is quite complicated when we need to ensure ACID on changes to > the map. I suggest you to investigate further to understand the implication > and effort. > > We all understand We couldn't afford any inconsistency on dictionary, that > means we couldn't decode the data back correctly. correctness is even more > critical compared to performance. > > > Jihong > > -Original Message- > From: Ravindra Pesala [mailto:ravi.pes...@gmail.com] > Sent: Saturday, October 15, 2016 12:50 AM > To: dev > Subject: Re: Discussion(New feature) regarding single pass data loading > solution. > > Hi Jacky/Jihong, > > I agree that new dictionary values are less in case of incremental data > load but that is completely depends on user data scenarios. In some > user scenarios new dictionary values may be more we cannot overrule that. > And also for users convenience we should provide single pass solution with > out insisting them to run external tool first. We can provide the option to > run external tool first and provide dictionary to improve performance. > > My opinion is better to use some professional distributed map like > Hazlecast than Zookeeper + HDFS. It is lite weight and does not require to > have separate cluster, it can form the cluster within the executor jvm's . > May be we can have a try, after all it will be just one interface > implementation for dictionary generation. We can have multiple > implementations and then decide based on optimal performance. > > Regards, > Ravi > > On 15 October 2016 at 10:50, Jacky Li wrote: > > > Hi, > > > > I can offer one more approach for this discussion, since new dictionary > > values are rare in case of incremental load (ensure first load having as > > much dictionary value as possible), so synchronization should be rare. So > > how about using Zookeeper + HDFS file to provide this service. This is > what > > carbon is doing today, we can wrap Zookeeper + HDFS to provide the global > > dictionary interface. > > It has the benefit of > > 1. automated: without bordering the user > > 2. not introducing more dependency: we already using zookeeper and HDFS. > > 3. performance? since new dictionary value and synchronization is rare. > > > > What do you think? > > > > Regards, > > Jacky > > > > > 在 2016年10月15日,上午2:38,Jihong Ma 写道: > > > > > > Hi Ravi, > > > > > > The major concern I have for generating global dictionary from scratch > > with a single scan is performance, the way to handle an occasional update > > to the dictionary is way simpler and cost effective in terms of > > synchronization cost and refresh the global/local cache copy. > > > > > > There are a lot to worry about for distributed map, and leveraging KV > > store is overkill if simply just for dictionary generation. > > > > > > Regards. > > > > > > Jihong > > > > > > -Original Message- > > > From: Ravindra Pesala [mailto:ravi.pes...@gmail.com] > > > Sent: Friday, October 14, 2016 11:03 AM > > > To: dev > > > Subject: Re: Discussion(New feature) regarding single pass data loading > > solution. > > > > > > Hi Jihong, > > > > > > I agree, we can use external tool for first load, but for incremental > > load > > > we should have solution to add global dictionary. So this solution > should > > > be enough to generate global dictionary even if user does no
[jira] [Created] (CARBONDATA-322) integrate spark 2.x
Fei Wang created CARBONDATA-322: --- Summary: integrate spark 2.x Key: CARBONDATA-322 URL: https://issues.apache.org/jira/browse/CARBONDATA-322 Project: CarbonData Issue Type: Bug Components: spark-integration Affects Versions: 0.2.0-incubating Reporter: Fei Wang Fix For: 0.3.0-incubating As spark 2.0 released. there are many nice features such as more efficient parser, vectorized execution, adaptive execution. It is good to integrate with spark 2.x Another side now spark integration is heavy coupling with spark, we should redesign the spark integration, it should satisfy flowing requirement: 1. decoupled with spark, integrate according to spark datasource API(V2) 2. This integration should support vectorized carbon reader 3. Supoort write to carbondata from dadatrame ... -- This message was sent by Atlassian JIRA (v6.3.4#6332)