[RESULT][VOTE] Apache CarbonData 0.1.1-incubating release
Hi all Finally, this vote passed with the following result: +1 (binding) : Justin Mclean,Henry Saputra,Julian Hyde-3,Uma Gangumalla Thanks all for your vote. Regards Liang
[jira] [Created] (CARBONDATA-289) Support MB/M for table block size and update the doc about this new feature.
zhangshunyu created CARBONDATA-289: -- Summary: Support MB/M for table block size and update the doc about this new feature. Key: CARBONDATA-289 URL: https://issues.apache.org/jira/browse/CARBONDATA-289 Project: CarbonData Issue Type: Bug Components: spark-integration Affects Versions: 0.1.0-incubating Reporter: zhangshunyu Assignee: zhangshunyu Priority: Minor Fix For: 0.2.0-incubating Support MB/M for table block size and update the doc about this new feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-carbondata pull request #219: [WIP] [CARBONDATA-37]Support differe...
GitHub user lion-x opened a pull request: https://github.com/apache/incubator-carbondata/pull/219 [WIP] [CARBONDATA-37]Support different time format input style # Why raise this PR? support different time format input style. In some scenarios different time dimensions may use different time format, we should support these requirements. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lion-x/incubator-carbondata timeformat Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/219.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #219 commit 15d1e640f96b15411fb39eaf7208b3b656ebda0a Author: X-LionDate: 2016-09-29T14:33:18Z Lionx0929 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #218: [CARBONDATA-288] In hdfs bad record ...
GitHub user mohammadshahidkhan opened a pull request: https://github.com/apache/incubator-carbondata/pull/218 [CARBONDATA-288] In hdfs bad record logger is failing in writing the bad records in log file **Poblem** For HDFS file system CarbonFile logFile = FileFactory.getCarbonFile(filePath, FileType.HDFS); if filePath does not exits then Calling CarbonFile.getPath() throws NullPointerException. **Solution:** If file does not exist then before accessing the file must be created first. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mohammadshahidkhan/incubator-carbondata badrecord_log_file_writting_fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/218.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #218 commit 6a9ca5a78bc128f1e4f8c37164fc81dec6b70894 Author: mohammadshahidkhanDate: 2016-10-08T20:54:57Z [CARBONDATA-288] In hdfs bad record logger is failing in writting the bad records --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (CARBONDATA-288) In hdfs bad record logger is failing in writting the bad records
Mohammad Shahid Khan created CARBONDATA-288: --- Summary: In hdfs bad record logger is failing in writting the bad records Key: CARBONDATA-288 URL: https://issues.apache.org/jira/browse/CARBONDATA-288 Project: CarbonData Issue Type: Bug Reporter: Mohammad Shahid Khan Assignee: Mohammad Shahid Khan For HDFS file system CarbonFile logFile = FileFactory.getCarbonFile(filePath, FileType.HDFS); if filePath does not exits then Calling CarbonFile.getPath() throws NullPointerException. Solution: If file does not exist then file must be created first. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/208#discussion_r82505613 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonInputFormatBase.java --- @@ -0,0 +1,69 @@ +/* --- End diff -- Yes, it is like that only --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/208#discussion_r82505582 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/StreamingSegment.java --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.carbondata.hadoop.internal.segment; + +import java.io.IOException; +import java.util.List; + +import org.apache.carbondata.hadoop.api.CarbonInputFormatBase; +import org.apache.carbondata.scan.model.QueryModel; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.JobContext; + +public class StreamingSegment extends Segment { --- End diff -- If I understand the comment correctly, the answer is that all segments are handled unifiedly in `CarbonInputFormatBase`, however, the internally read implementation of this segment is different from IndexedSegment. It uses Row input format to read. Is this the question? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/208#discussion_r82505527 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/internal/segment/IndexedSegment.java --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.carbondata.hadoop.internal.segment; + +import java.io.IOException; +import java.util.LinkedList; +import java.util.List; + +import org.apache.carbondata.hadoop.api.CarbonInputFormatBase; +import org.apache.carbondata.hadoop.internal.index.Index; +import org.apache.carbondata.hadoop.internal.index.IndexLoader; +import org.apache.carbondata.scan.filter.resolver.FilterResolverIntf; +import org.apache.carbondata.scan.model.QueryModel; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.JobContext; + +public class IndexedSegment extends Segment { --- End diff -- Do you mean a non-indexed and non-streaming segment? Yes, I think we can have a segment like that, then reading of this segment will be pure scan only without any index. But what is the benefit doing that? I think the real issue is how to let optimizer to decide when to use index and when not to use index. I think I will create another `Estimator` interface to expose cost information for optimizer. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #208: [CARBONDATA-284][WIP] Abstracting in...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/208#discussion_r82505442 --- Diff: hadoop/src/main/java/org/apache/carbondata/hadoop/api/row/CarbonRowInputFormat.java --- @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.carbondata.hadoop.api.row; + +import java.io.IOException; + +import org.apache.carbondata.hadoop.api.CarbonInputFormatBase; +import org.apache.carbondata.hadoop.internal.segment.SegmentManager; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +public class CarbonRowInputFormat extends CarbonInputFormatBase { --- End diff -- Row format is for streaming ingest. I can remove it now and submit it in future feature of streaming ingest --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-carbondata pull request #198: [CARBONDATA-273]Using carbon common ...
Github user asfgit closed the pull request at: https://github.com/apache/incubator-carbondata/pull/198 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: Discussion about using multi local directorys to improve dataloading perfomance
Yes, I think it is a good feature to have. Please feel free to create JIRA issue and Pull Request. Regards, Jacky > 在 2016年10月9日,上午12:04,caiqiang写道: > > Hi All, > For each dataloading, we write the sorted temp files into only one different > local directory. I think this is a bottle neck of dataloading. It is > neccessary to use multi local directorys in multi disks for each dataloading > to improve dataloading performance.
[GitHub] incubator-carbondata pull request #216: Update DOC about table level blocksi...
Github user jackylk commented on a diff in the pull request: https://github.com/apache/incubator-carbondata/pull/216#discussion_r82504992 --- Diff: docs/DDL-Operations-on-Carbon.md --- @@ -67,6 +67,14 @@ Here, DICTIONARY_EXCLUDE will exclude dictionary creation. This is applicable fo ```ruby TBLPROPERTIES ("COLUMN_GROUPS"="(column1,column3),(Column4,Column5,Column6)") ``` + - **Table Block Size Configuration** + + The block size of one table's files on hdfs can be defined using an int value whose size is in MB, the range is form 1MB to 2048MB and the default value is 1024MB, if user didn't define this values in ddl, it would use default value to set. + + ```ruby + TBLPROPERTIES ("TABLE_BLOCKSIZE"="512") --- End diff -- I think it is better to have the MB inside the value, it is more clear to the user. For example: `TBLPROPERTIES ("TABLE_BLOCKSIZE"="512MB")` And we can also add support for GB in the future, if required. Can you modify the code accordingly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Discussion about using multi local directorys to improve dataloading perfomance
Hi All, For each dataloading, we write the sorted temp files into only one different local directory. I think this is a bottle neck of dataloading. It is neccessary to use multi local directorys in multi disks for each dataloading to improve dataloading performance.
[GitHub] incubator-carbondata pull request #217: [CARBONDATA-287]Using multi local di...
GitHub user QiangCai opened a pull request: https://github.com/apache/incubator-carbondata/pull/217 [CARBONDATA-287]Using multi local directorys to improve dataloading perfomance **1 Analysis** Now for each dataloading, we use only one different local directory to save the sorted temp files. I think it is neccessary to use multi local directorys in multi disks for each dataloading to improve dataloading performance. **2 Solution** Modify SortDataRows class to separate the sorted temp files into different local directorys in different disks.. You can merge this pull request into a Git repository by running: $ git pull https://github.com/QiangCai/incubator-carbondata localDirsWorkLoadBalance Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-carbondata/pull/217.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #217 commit 4d0a276bb7283dc0683efe9f3684fdff4cadf729 Author: c00318382Date: 2016-10-08T12:54:32Z localdirs commit 41c027833c19c3f3331470f48759f469159eb59c Author: QiangCai Date: 2016-10-08T15:26:39Z clean temp folder --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (CARBONDATA-287) Save the sorted temp files to multi local dirs to improve dataloading perfomance
QiangCai created CARBONDATA-287: --- Summary: Save the sorted temp files to multi local dirs to improve dataloading perfomance Key: CARBONDATA-287 URL: https://issues.apache.org/jira/browse/CARBONDATA-287 Project: CarbonData Issue Type: Improvement Components: data-load Affects Versions: 0.2.0-incubating Reporter: QiangCai Assignee: QiangCai Priority: Minor Fix For: 0.2.0-incubating Now for each dataloading, we use only a different local dir to save the sorted temp files. I think it is neccessary to use multi local dirs for each dataloading to improve dataloading performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Discussion regrading design of data load after kettle removal.
Hi Ravi, We can move mdkey generation step before sorting, this will compress the dictionary data and will reduce the IO. -Regards Kumar Vishal On Sat, Oct 8, 2016 at 3:30 PM, Ravindra Pesalawrote: > Hi All, > > > Removing kettle from carbondata is necessary as this legacy kettle > framework become overhead to carbondata.This discussion is regarding the > design of carbon load with out kettle. > > The main interface for data loading here is DataLoadProcessorStep. > > */*** > * * This base interface for data loading. It can do transformation jobs as > per the implementation.* > * ** > * */* > *public interface DataLoadProcessorStep {* > > * /*** > * * The output meta for this step. The data returns from this step is as > per this meta.* > * * @return* > * */* > * DataField[] getOutput();* > > * /*** > * * Intialization process for this step.* > * * @param configuration* > * * @param child* > * * @throws CarbonDataLoadingException* > * */* > * void intialize(CarbonDataLoadConfiguration configuration, > DataLoadProcessorStep child) throws* > * CarbonDataLoadingException;* > > * /*** > * * Tranform the data as per the implemetation.* > * * @return Iterator of data* > * * @throws CarbonDataLoadingException* > * */* > * Iterator
Discussion regrading design of data load after kettle removal.
Hi All, Removing kettle from carbondata is necessary as this legacy kettle framework become overhead to carbondata.This discussion is regarding the design of carbon load with out kettle. The main interface for data loading here is DataLoadProcessorStep. */*** * * This base interface for data loading. It can do transformation jobs as per the implementation.* * ** * */* *public interface DataLoadProcessorStep {* * /*** * * The output meta for this step. The data returns from this step is as per this meta.* * * @return* * */* * DataField[] getOutput();* * /*** * * Intialization process for this step.* * * @param configuration* * * @param child* * * @throws CarbonDataLoadingException* * */* * void intialize(CarbonDataLoadConfiguration configuration, DataLoadProcessorStep child) throws* * CarbonDataLoadingException;* * /*** * * Tranform the data as per the implemetation.* * * @return Iterator of data* * * @throws CarbonDataLoadingException* * */* * Iterator