[jira] [Commented] (CARBONDATA-306) block size info should be show in Desc Formatted and executor log

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574337#comment-15574337
 ] 

ASF GitHub Bot commented on CARBONDATA-306:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/230#discussion_r83361557
  
--- Diff: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchema.scala
 ---
@@ -1422,6 +1422,7 @@ private[sql] case class DescribeCommandFormatted(
 results ++= Seq(("Table Name : ", 
relation.tableMeta.carbonTableIdentifier.getTableName, ""))
 results ++= Seq(("CARBON Store Path : ", relation.tableMeta.storePath, 
""))
 val carbonTable = relation.tableMeta.carbonTable
+results ++= Seq(("Table Block Size : ", carbonTable.getBlocksize + " 
MB", ""))
--- End diff --

If so, can you change the corresponding variable name and function name to 
indicate it is bytes in MB,  like `getBlockSizeInMB` and  add comment to 
`CarbonCommonConstants.TABLE_BLOCKSIZE`


> block size info should be show in Desc Formatted and executor log
> -
>
> Key: CARBONDATA-306
> URL: https://issues.apache.org/jira/browse/CARBONDATA-306
> Project: CarbonData
>  Issue Type: Improvement
>Reporter: Jay
>Priority: Minor
>
> when run desc formatted command, the table block size should be show, as well 
> as in executor log when run load command



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-310) Compilation failed when using spark 1.6.2

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574321#comment-15574321
 ] 

ASF GitHub Bot commented on CARBONDATA-310:
---

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-carbondata/pull/232


> Compilation failed when using spark 1.6.2
> -
>
> Key: CARBONDATA-310
> URL: https://issues.apache.org/jira/browse/CARBONDATA-310
> Project: CarbonData
>  Issue Type: Bug
>Reporter: Gin-zhj
>Assignee: Gin-zhj
>Priority: Minor
>
> Compilation failed when using spark 1.6.2,
> caused by class not found: AggregateExpression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-297) 2. Add interfaces for data loading.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574308#comment-15574308
 ] 

ASF GitHub Bot commented on CARBONDATA-297:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83354231
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/AbstractDataLoadProcessorStep.java
 ---
@@ -73,15 +72,15 @@ public 
AbstractDataLoadProcessorStep(CarbonDataLoadConfiguration configuration,
* Create the iterator using child iterator.
*
* @param childIter
-   * @return
+   * @return new iterator with step specific processing.
*/
-  protected Iterator getIterator(final Iterator 
childIter) {
-return new CarbonIterator() {
+  protected Iterator getIterator(final Iterator 
childIter) {
--- End diff --

In order to support batch conversion, it is better to put 
`Iterator` instead


> 2. Add interfaces for data loading.
> ---
>
> Key: CARBONDATA-297
> URL: https://issues.apache.org/jira/browse/CARBONDATA-297
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: Ravindra Pesala
> Fix For: 0.2.0-incubating
>
>
> Add the major interface classes for data loading so that the following jiras 
> can use this interfaces to implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-299) 4. Add dictionary generator interfaces and give implementation for pre created dictionary.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574306#comment-15574306
 ] 

ASF GitHub Bot commented on CARBONDATA-299:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/236#discussion_r83360819
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/dictionary/InMemBiDictionary.java
 ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.carbondata.processing.newflow.dictionary;
+
+import java.util.Map;
+
+import org.apache.carbondata.core.devapi.DictionaryGenerator;
+import org.apache.carbondata.core.devapi.GeneratingBiDictionary;
+
+import com.google.common.collect.BiMap;
+import com.google.common.collect.HashBiMap;
+
+public class InMemBiDictionary extends GeneratingBiDictionary {
--- End diff --

This `InMemBiDictionary` is for one column, when it comes to cache, I think 
the cached object is 'BiDictionary`. Since the cache provides KV API, so we 
need a TableDictionary to make use of the cache. I am thinking to add another 
class for table dictionary, like 
```
CachedTableDictionary tableDict = new CachedTableDictionary(tableName)
key = tableDict.getOrGenerateKey(columnName, value)
```
Inside `CachedTableDictionary`, it will maintain a cache of 
`HashMap`
I am thinking to use 
[Cache](https://github.com/google/guava/wiki/CachesExplained) from Guava also


> 4. Add dictionary generator interfaces and give implementation for pre 
> created dictionary.
> --
>
> Key: CARBONDATA-299
> URL: https://issues.apache.org/jira/browse/CARBONDATA-299
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: Jacky Li
> Fix For: 0.2.0-incubating
>
>
> Add dictionary generator interfaces and give implementation for pre-created 
> dictionary(which is generated separetly).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-299) 4. Add dictionary generator interfaces and give implementation for pre created dictionary.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574290#comment-15574290
 ] 

ASF GitHub Bot commented on CARBONDATA-299:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/236#discussion_r83360359
  
--- Diff: 
core/src/main/java/org/apache/carbondata/core/devapi/BiDictionary.java ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.carbondata.core.devapi;
+
+public interface BiDictionary {
--- End diff --

It mean Bidirectional, I borrow the term from Guava library.
see 
[BiMap](https://github.com/google/guava/wiki/NewCollectionTypesExplained#bimap)



> 4. Add dictionary generator interfaces and give implementation for pre 
> created dictionary.
> --
>
> Key: CARBONDATA-299
> URL: https://issues.apache.org/jira/browse/CARBONDATA-299
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: Jacky Li
> Fix For: 0.2.0-incubating
>
>
> Add dictionary generator interfaces and give implementation for pre-created 
> dictionary(which is generated separetly).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574279#comment-15574279
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83355842
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/io/StringArrayWritable.java 
---
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.io;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.nio.charset.Charset;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Writable;
+
+/**
+ * A String sequence that is usable as a key or value.
+ */
+public class StringArrayWritable implements Writable {
+  private String[] values;
+
+  public String[] toStrings() {
+return values;
+  }
+
+  public void set(String[] values) {
+this.values = values;
+  }
+
+  public String[] get() {
+return values;
+  }
+
+  @Override public void readFields(DataInput in) throws IOException {
--- End diff --

`@Override` should be put to previous line


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574281#comment-15574281
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83359593
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java 
---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.mapreduce;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+
+import org.apache.carbondata.hadoop.io.BoundedInputStream;
+import org.apache.carbondata.hadoop.io.StringArrayWritable;
+import org.apache.carbondata.hadoop.util.CSVInputFormatUtil;
+
+import com.univocity.parsers.csv.CsvParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.Seekable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.CompressionCodecFactory;
+import org.apache.hadoop.io.compress.CompressionInputStream;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.hadoop.io.compress.SplitCompressionInputStream;
+import org.apache.hadoop.io.compress.SplittableCompressionCodec;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.FileSplit;
+import org.apache.hadoop.util.LineReader;
+
+/**
+ * An {@link org.apache.hadoop.mapreduce.InputFormat} for csv files.  
Files are broken into lines.
+ * Values are the line of csv files.
+ */
+public class CSVInputFormat extends FileInputFormat {
+
+  @Override
+  public RecordReader 
createRecordReader(InputSplit inputSplit,
+  TaskAttemptContext context) throws IOException, InterruptedException 
{
+return new NewCSVRecordReader();
+  }
+
+  /**
+   * Treats value as line in file. Key is null.
+   */
+  public static class NewCSVRecordReader extends 
RecordReader {
--- End diff --

Why is it a static class? And you can rename it to `CSVRecordReader`


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574283#comment-15574283
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83359938
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/util/CSVInputFormatUtil.java 
---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.util;
+
+import com.univocity.parsers.csv.CsvParserSettings;
+import org.apache.hadoop.conf.Configuration;
+
+/**
+ * CSVInputFormatUtil is a util class.
+ */
+public class CSVInputFormatUtil {
+
+  public static final String DELIMITER = "carbon.csvinputformat.delimiter";
+  public static final String DELIMITER_DEFAULT = ",";
+  public static final String COMMENT = "carbon.csvinputformat.comment";
+  public static final String COMMENT_DEFAULT = "#";
+  public static final String QUOTE = "carbon.csvinputformat.quote";
+  public static final String QUOTE_DEFAULT = "\"";
+  public static final String ESCAPE = "carbon.csvinputformat.escape";
+  public static final String ESCAPE_DEFAULT = "\\";
+  public static final String HEADER_PRESENT = 
"caron.csvinputformat.header.present";
+  public static final boolean HEADER_PRESENT_DEFAULT = false;
+
+  public static CsvParserSettings extractCsvParserSettings(Configuration 
job, long start) {
--- End diff --

I think this class is not needed, move this function into CSVRecordReader 
class


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574278#comment-15574278
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83360081
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java 
---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.mapreduce;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+
+import org.apache.carbondata.hadoop.io.BoundedInputStream;
+import org.apache.carbondata.hadoop.io.StringArrayWritable;
+import org.apache.carbondata.hadoop.util.CSVInputFormatUtil;
+
+import com.univocity.parsers.csv.CsvParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.Seekable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.CompressionCodecFactory;
+import org.apache.hadoop.io.compress.CompressionInputStream;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.hadoop.io.compress.SplitCompressionInputStream;
+import org.apache.hadoop.io.compress.SplittableCompressionCodec;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.FileSplit;
+import org.apache.hadoop.util.LineReader;
+
+/**
+ * An {@link org.apache.hadoop.mapreduce.InputFormat} for csv files.  
Files are broken into lines.
+ * Values are the line of csv files.
+ */
+public class CSVInputFormat extends FileInputFormat {
+
+  @Override
+  public RecordReader 
createRecordReader(InputSplit inputSplit,
+  TaskAttemptContext context) throws IOException, InterruptedException 
{
+return new NewCSVRecordReader();
+  }
+
+  /**
+   * Treats value as line in file. Key is null.
+   */
+  public static class NewCSVRecordReader extends 
RecordReader {
+
+private long start;
+private long end;
+private BoundedInputStream boundedInputStream;
+private Reader reader;
+private CsvParser csvParser;
+private StringArrayWritable value;
+private String[] columns;
+private Seekable filePosition;
+private boolean isCompressedInput;
+private Decompressor decompressor;
+
+@Override
+public void initialize(InputSplit inputSplit, TaskAttemptContext 
context)
+throws IOException, InterruptedException {
+  FileSplit split = (FileSplit) inputSplit;
+  this.start = split.getStart();
--- End diff --

No need to use `this.start`, you can use `start` directly. 
The same for all occurrence in this file


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was 

[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574280#comment-15574280
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83359637
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java 
---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.mapreduce;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+
+import org.apache.carbondata.hadoop.io.BoundedInputStream;
+import org.apache.carbondata.hadoop.io.StringArrayWritable;
+import org.apache.carbondata.hadoop.util.CSVInputFormatUtil;
+
+import com.univocity.parsers.csv.CsvParser;
--- End diff --

I think code style will fail, incorrect order


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574276#comment-15574276
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83360131
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java 
---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.mapreduce;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+
+import org.apache.carbondata.hadoop.io.BoundedInputStream;
+import org.apache.carbondata.hadoop.io.StringArrayWritable;
+import org.apache.carbondata.hadoop.util.CSVInputFormatUtil;
+
+import com.univocity.parsers.csv.CsvParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.Seekable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.CompressionCodecFactory;
+import org.apache.hadoop.io.compress.CompressionInputStream;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.hadoop.io.compress.SplitCompressionInputStream;
+import org.apache.hadoop.io.compress.SplittableCompressionCodec;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.FileSplit;
+import org.apache.hadoop.util.LineReader;
+
+/**
+ * An {@link org.apache.hadoop.mapreduce.InputFormat} for csv files.  
Files are broken into lines.
+ * Values are the line of csv files.
+ */
+public class CSVInputFormat extends FileInputFormat {
+
+  @Override
+  public RecordReader 
createRecordReader(InputSplit inputSplit,
+  TaskAttemptContext context) throws IOException, InterruptedException 
{
+return new NewCSVRecordReader();
+  }
+
+  /**
+   * Treats value as line in file. Key is null.
+   */
+  public static class NewCSVRecordReader extends 
RecordReader {
+
+private long start;
+private long end;
+private BoundedInputStream boundedInputStream;
+private Reader reader;
+private CsvParser csvParser;
+private StringArrayWritable value;
+private String[] columns;
+private Seekable filePosition;
+private boolean isCompressedInput;
+private Decompressor decompressor;
+
+@Override
+public void initialize(InputSplit inputSplit, TaskAttemptContext 
context)
+throws IOException, InterruptedException {
+  FileSplit split = (FileSplit) inputSplit;
+  this.start = split.getStart();
+  this.end = this.start + split.getLength();
+  Path file = split.getPath();
+  Configuration job = context.getConfiguration();
+  CompressionCodec codec = (new 
CompressionCodecFactory(job)).getCodec(file);
+  FileSystem fs = file.getFileSystem(job);
+  FSDataInputStream fileIn = fs.open(file);
+  InputStream inputStream = null;
+  if (codec != null) {
+this.isCompressedInput = true;
+this.decompressor = CodecPool.getDecompressor(codec);
+if (codec instanceof SplittableCompressionCodec) {
+  SplitCompressionInputStream scIn = 

[jira] [Commented] (CARBONDATA-299) 4. Add dictionary generator interfaces and give implementation for pre created dictionary.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574254#comment-15574254
 ] 

ASF GitHub Bot commented on CARBONDATA-299:
---

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/236#discussion_r83359748
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/dictionary/InMemBiDictionary.java
 ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.carbondata.processing.newflow.dictionary;
+
+import java.util.Map;
+
+import org.apache.carbondata.core.devapi.DictionaryGenerator;
+import org.apache.carbondata.core.devapi.GeneratingBiDictionary;
+
+import com.google.common.collect.BiMap;
+import com.google.common.collect.HashBiMap;
+
+public class InMemBiDictionary extends GeneratingBiDictionary {
--- End diff --

we can add DictionaryCache based implementation. we may not require this 
implementation now. We can add when we need it.


> 4. Add dictionary generator interfaces and give implementation for pre 
> created dictionary.
> --
>
> Key: CARBONDATA-299
> URL: https://issues.apache.org/jira/browse/CARBONDATA-299
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: Jacky Li
> Fix For: 0.2.0-incubating
>
>
> Add dictionary generator interfaces and give implementation for pre-created 
> dictionary(which is generated separetly).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-299) 4. Add dictionary generator interfaces and give implementation for pre created dictionary.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574246#comment-15574246
 ] 

ASF GitHub Bot commented on CARBONDATA-299:
---

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/236#discussion_r83359504
  
--- Diff: 
core/src/main/java/org/apache/carbondata/core/devapi/BiDictionary.java ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.carbondata.core.devapi;
+
+public interface BiDictionary {
--- End diff --

what is the meaning of `Bi`, not understandable


> 4. Add dictionary generator interfaces and give implementation for pre 
> created dictionary.
> --
>
> Key: CARBONDATA-299
> URL: https://issues.apache.org/jira/browse/CARBONDATA-299
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: Jacky Li
> Fix For: 0.2.0-incubating
>
>
> Add dictionary generator interfaces and give implementation for pre-created 
> dictionary(which is generated separetly).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-306) block size info should be show in Desc Formatted and executor log

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574086#comment-15574086
 ] 

ASF GitHub Bot commented on CARBONDATA-306:
---

Github user Jay357089 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/230#discussion_r83354112
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/store/writer/AbstractFactDataWriter.java
 ---
@@ -252,6 +252,15 @@ private static long getMaxOfBlockAndFileSize(long 
blockSize, long fileSize) {
 if (remainder > 0) {
   maxSize = maxSize + HDFS_CHECKSUM_LENGTH - remainder;
 }
+long setBlockSizeInMb = blockSize / 
CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR /
+CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR;
+// actual file size may be less than 1KB or 1MB, need to classify.
+String readableFileSize = ByteUtil.convertByteToReadable(fileSize);
+long maxSizeInMb = maxSize / 
CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR /
+CarbonCommonConstants.BYTE_TO_KB_CONVERSION_FACTOR;
+LOGGER.info("The configured block size is " + setBlockSizeInMb + " MB, 
" +
--- End diff --

done. CI passed. 
http://136.243.101.176:8080/job/ApacheCarbonManualPRBuilder/427/


> block size info should be show in Desc Formatted and executor log
> -
>
> Key: CARBONDATA-306
> URL: https://issues.apache.org/jira/browse/CARBONDATA-306
> Project: CarbonData
>  Issue Type: Improvement
>Reporter: Jay
>Priority: Minor
>
> when run desc formatted command, the table block size should be show, as well 
> as in executor log when run load command



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-306) block size info should be show in Desc Formatted and executor log

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574011#comment-15574011
 ] 

ASF GitHub Bot commented on CARBONDATA-306:
---

Github user Jay357089 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/230#discussion_r83352166
  
--- Diff: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchema.scala
 ---
@@ -1422,6 +1422,7 @@ private[sql] case class DescribeCommandFormatted(
 results ++= Seq(("Table Name : ", 
relation.tableMeta.carbonTableIdentifier.getTableName, ""))
 results ++= Seq(("CARBON Store Path : ", relation.tableMeta.storePath, 
""))
 val carbonTable = relation.tableMeta.carbonTable
+results ++= Seq(("Table Block Size : ", carbonTable.getBlocksize + " 
MB", ""))
--- End diff --

in carbonTable, block size is set in MB, it's already readable, so i don't 
think it need to format


> block size info should be show in Desc Formatted and executor log
> -
>
> Key: CARBONDATA-306
> URL: https://issues.apache.org/jira/browse/CARBONDATA-306
> Project: CarbonData
>  Issue Type: Improvement
>Reporter: Jay
>Priority: Minor
>
> when run desc formatted command, the table block size should be show, as well 
> as in executor log when run load command



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-285) Use path parameter in Spark datasource API

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573986#comment-15573986
 ] 

ASF GitHub Bot commented on CARBONDATA-285:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/212#discussion_r83351395
  
--- Diff: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchema.scala
 ---
@@ -861,9 +861,11 @@ private[sql] case class CreateTable(cm: tableModel) 
extends RunnableCommand {
   val tablePath = catalog.createTableFromThrift(tableInfo, dbName, 
tbName, null)(sqlContext)
   try {
 sqlContext.sql(
-  s"""CREATE TABLE $dbName.$tbName USING carbondata""" +
-  s""" OPTIONS (tableName "$dbName.$tbName", tablePath 
"$tablePath") """)
-  .collect
+  s"""
+ | CREATE TABLE $dbName.$tbName
+ | USING carbondata
+ | OPTIONS (path "$tablePath")
--- End diff --

ok, will fix it


> Use path parameter in Spark datasource API
> --
>
> Key: CARBONDATA-285
> URL: https://issues.apache.org/jira/browse/CARBONDATA-285
> Project: CarbonData
>  Issue Type: Improvement
>  Components: spark-integration
>Affects Versions: 0.1.0-incubating
>Reporter: Jacky Li
> Fix For: 0.2.0-incubating
>
>
> Currently, when using carbon with spark datasource API, it need to give 
> database name and table name as parameter, it is not the normal way of 
> datasource API usage. In this PR, database name and table name is not 
> required to give, user need to specify the `path` parameter (indicating the 
> path to table folder) only when using datasource API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-285) Use path parameter in Spark datasource API

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573945#comment-15573945
 ] 

ASF GitHub Bot commented on CARBONDATA-285:
---

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/212#discussion_r83350430
  
--- Diff: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchema.scala
 ---
@@ -861,9 +861,11 @@ private[sql] case class CreateTable(cm: tableModel) 
extends RunnableCommand {
   val tablePath = catalog.createTableFromThrift(tableInfo, dbName, 
tbName, null)(sqlContext)
   try {
 sqlContext.sql(
-  s"""CREATE TABLE $dbName.$tbName USING carbondata""" +
-  s""" OPTIONS (tableName "$dbName.$tbName", tablePath 
"$tablePath") """)
-  .collect
+  s"""
+ | CREATE TABLE $dbName.$tbName
+ | USING carbondata
+ | OPTIONS (path "$tablePath")
--- End diff --

There would be backward compatability issues here. Old tables cannot work 
because `path` was not present.


> Use path parameter in Spark datasource API
> --
>
> Key: CARBONDATA-285
> URL: https://issues.apache.org/jira/browse/CARBONDATA-285
> Project: CarbonData
>  Issue Type: Improvement
>  Components: spark-integration
>Affects Versions: 0.1.0-incubating
>Reporter: Jacky Li
> Fix For: 0.2.0-incubating
>
>
> Currently, when using carbon with spark datasource API, it need to give 
> database name and table name as parameter, it is not the normal way of 
> datasource API usage. In this PR, database name and table name is not 
> required to give, user need to specify the `path` parameter (indicating the 
> path to table folder) only when using datasource API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-306) block size info should be show in Desc Formatted and executor log

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573919#comment-15573919
 ] 

ASF GitHub Bot commented on CARBONDATA-306:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/230#discussion_r83349721
  
--- Diff: 
integration/spark/src/main/scala/org/apache/spark/sql/execution/command/carbonTableSchema.scala
 ---
@@ -1422,6 +1422,7 @@ private[sql] case class DescribeCommandFormatted(
 results ++= Seq(("Table Name : ", 
relation.tableMeta.carbonTableIdentifier.getTableName, ""))
 results ++= Seq(("CARBON Store Path : ", relation.tableMeta.storePath, 
""))
 val carbonTable = relation.tableMeta.carbonTable
+results ++= Seq(("Table Block Size : ", carbonTable.getBlocksize + " 
MB", ""))
--- End diff --

Use the new function here to format readable


> block size info should be show in Desc Formatted and executor log
> -
>
> Key: CARBONDATA-306
> URL: https://issues.apache.org/jira/browse/CARBONDATA-306
> Project: CarbonData
>  Issue Type: Improvement
>Reporter: Jay
>Priority: Minor
>
> when run desc formatted command, the table block size should be show, as well 
> as in executor log when run load command



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-317) CSV having only space char is throwing NullPointerException

2016-10-13 Thread Mohammad Shahid Khan (JIRA)
Mohammad Shahid Khan created CARBONDATA-317:
---

 Summary: CSV having only space char is throwing 
NullPointerException
 Key: CARBONDATA-317
 URL: https://issues.apache.org/jira/browse/CARBONDATA-317
 Project: CarbonData
  Issue Type: Bug
Reporter: Mohammad Shahid Khan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-297) 2. Add interfaces for data loading.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571835#comment-15571835
 ] 

ASF GitHub Bot commented on CARBONDATA-297:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/229#discussion_r83208961
  
--- Diff: 
processing/src/main/java/org/apache/carbondata/processing/newflow/DataLoadProcessorStep.java
 ---
@@ -0,0 +1,40 @@
+package org.apache.carbondata.processing.newflow;
+
+import java.util.Iterator;
+
+import 
org.apache.carbondata.processing.newflow.exception.CarbonDataLoadingException;
+
+/**
+ * This base interface for data loading. It can do transformation jobs as 
per the implementation.
+ *
+ */
+public interface DataLoadProcessorStep {
+
+  /**
+   * The output meta for this step. The data returns from this step is as 
per this meta.
+   * @return
+   */
+  DataField[] getOutput();
+
+  /**
+   * Intialization process for this step.
+   * @param configuration
+   * @param child
+   * @throws CarbonDataLoadingException
+   */
+  void intialize(CarbonDataLoadConfiguration configuration, 
DataLoadProcessorStep child) throws
+  CarbonDataLoadingException;
+
+  /**
+   * Tranform the data as per the implemetation.
+   * @return Iterator of data
+   * @throws CarbonDataLoadingException
+   */
+  Iterator execute() throws CarbonDataLoadingException;
--- End diff --

I thought the SortStep is a singleton object within the executor, and if 
there are only one executor in one datanode, then the SortStep is sorting the 
data within datanode-scope, which is what we want. Synchronization means 
SortStep is thread-safe, so that multiple task can insert row into it. 
Does your desing look like this? Otherwise how you ensure data is sorting 
within datanode?



> 2. Add interfaces for data loading.
> ---
>
> Key: CARBONDATA-297
> URL: https://issues.apache.org/jira/browse/CARBONDATA-297
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: Ravindra Pesala
> Fix For: 0.2.0-incubating
>
>
> Add the major interface classes for data loading so that the following jiras 
> can use this interfaces to implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-292) add COLUMNDICT operation info in DML operation guide

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571797#comment-15571797
 ] 

ASF GitHub Bot commented on CARBONDATA-292:
---

Github user Jay357089 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/223#discussion_r83205716
  
--- Diff: docs/DML-Operations-on-Carbon.md ---
@@ -104,8 +109,10 @@ Following are the options that can be used in load 
data:
  'MULTILINE'='true', 'ESCAPECHAR'='\', 
  'COMPLEX_DELIMITER_LEVEL_1'='$', 
  'COMPLEX_DELIMITER_LEVEL_2'=':',
- 
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary'
+ 
'ALL_DICTIONARY_PATH'='/opt/alldictionary/data.dictionary',
+ 
'COLUMNDICT'='empno:/dictFilePath/empnoDict.csv, 
empname:/dictFilePath/empnameDict.csv'
--- End diff --

done.


> add COLUMNDICT operation info in DML operation guide
> 
>
> Key: CARBONDATA-292
> URL: https://issues.apache.org/jira/browse/CARBONDATA-292
> Project: CarbonData
>  Issue Type: Improvement
>Reporter: Jay
>Priority: Minor
>
> there is no COLUMNDICT operation guide in DML-Operations-on-Carbon.md, so 
> need to add. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CARBONDATA-315) Data loading fails if parsing a double value returns infinity

2016-10-13 Thread Manish Gupta (JIRA)
Manish Gupta created CARBONDATA-315:
---

 Summary: Data loading fails if parsing a double value returns 
infinity
 Key: CARBONDATA-315
 URL: https://issues.apache.org/jira/browse/CARBONDATA-315
 Project: CarbonData
  Issue Type: Bug
Affects Versions: 0.1.0-incubating, 0.2.0-incubating
Reporter: Manish Gupta
Assignee: Manish Gupta
Priority: Minor
 Fix For: 0.2.0-incubating


During data load, if a value specified is too big for a double DataType column 
then while parsing that value as double result is returned as "Infinity". Due 
to this while we calculate min and max value for measures in carbon data writer 
step it throws an exception.

ERROR 13-10 15:27:56,968 - [t3: Graph - MDKeyGent3][partitionID:0] 
org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException
java.util.concurrent.ExecutionException: 
org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:188)
at 
org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar.processWriteTaskSubmitList(CarbonFactDataHandlerColumnar.java:812)
at 
org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar.finish(CarbonFactDataHandlerColumnar.java:779)
at 
org.apache.carbondata.processing.mdkeygen.MDKeyGenStep.processRow(MDKeyGenStep.java:222)
at org.pentaho.di.trans.step.RunThread.run(RunThread.java:50)
at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException
at 
org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar$Producer.call(CarbonFactDataHandlerColumnar.java:1244)
at 
org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar$Producer.call(CarbonFactDataHandlerColumnar.java:1215)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
... 1 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-315) Data loading fails if parsing a double value returns infinity

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571509#comment-15571509
 ] 

ASF GitHub Bot commented on CARBONDATA-315:
---

GitHub user manishgupta88 opened a pull request:

https://github.com/apache/incubator-carbondata/pull/234

[CARBONDATA-315] Data loading fails if parsing a double value returns 
infinity

Problem: Data loading fails if parsing a double value returns infinity

Analysis: During data load, if a value specified is too big for a double 
DataType column then while parsing that value as double result is returned as 
"Infinity". Due to this while we calculate min and max value for measures in 
carbon data writer step it throws an exception.

Fix: If result is Infinity or NAN for double value parsing then make the 
value as null and add it to bad records.

Impact area: Data load which contains non parseable values for a datatype.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/manishgupta88/incubator-carbondata 
double_value_range_failure

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/234.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #234


commit f7225f974828edd8b340f88fbfaa2f60d8a7d582
Author: manishgupta88 
Date:   2016-10-13T09:47:52Z

Problem: Data loading fails if parsing a double value returns infinity

Analysis: During data load, if a value specified is too big for a double 
DataType column then while parsing that value as double result is returned as 
"Infinity". Due to this while we calculate min and max value for measures in 
carbon data writer step it throws an exception.

Fix: If result is Infinity or NAN for double value parsing then make the 
value as null and add it to bad records.

Impact area: Data load which contains non parseable values for a datatype.




> Data loading fails if parsing a double value returns infinity
> -
>
> Key: CARBONDATA-315
> URL: https://issues.apache.org/jira/browse/CARBONDATA-315
> Project: CarbonData
>  Issue Type: Bug
>Affects Versions: 0.1.0-incubating, 0.2.0-incubating
>Reporter: Manish Gupta
>Assignee: Manish Gupta
>Priority: Minor
> Fix For: 0.2.0-incubating
>
>
> During data load, if a value specified is too big for a double DataType 
> column then while parsing that value as double result is returned as 
> "Infinity". Due to this while we calculate min and max value for measures in 
> carbon data writer step it throws an exception.
> ERROR 13-10 15:27:56,968 - [t3: Graph - MDKeyGent3][partitionID:0] 
> org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException
> java.util.concurrent.ExecutionException: 
> org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:188)
> at 
> org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar.processWriteTaskSubmitList(CarbonFactDataHandlerColumnar.java:812)
> at 
> org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar.finish(CarbonFactDataHandlerColumnar.java:779)
> at 
> org.apache.carbondata.processing.mdkeygen.MDKeyGenStep.processRow(MDKeyGenStep.java:222)
> at org.pentaho.di.trans.step.RunThread.run(RunThread.java:50)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.carbondata.processing.store.writer.exception.CarbonDataWriterException
> at 
> org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar$Producer.call(CarbonFactDataHandlerColumnar.java:1244)
> at 
> org.apache.carbondata.processing.store.CarbonFactDataHandlerColumnar$Producer.call(CarbonFactDataHandlerColumnar.java:1215)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> ... 1 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-213) Remove thrift complier dependency

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571304#comment-15571304
 ] 

ASF GitHub Bot commented on CARBONDATA-213:
---

GitHub user QiangCai reopened a pull request:

https://github.com/apache/incubator-carbondata/pull/127

[CARBONDATA-213] Remove dependency: thrift complier

[CARBONDATA-213] Remove dependency: thrift complier

**analysis**

I think it unnecessary for user/developer to download thrift complier When 
building CarbonData project.

**solution**

Provide the java code, generated by thrift complier.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/QiangCai/incubator-carbondata fixthrifterror

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/127.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #127


commit ff895c5276569bef358ec02356400210014911de
Author: QiangCai 
Date:   2016-10-13T08:44:22Z

add format java module




> Remove thrift complier dependency
> -
>
> Key: CARBONDATA-213
> URL: https://issues.apache.org/jira/browse/CARBONDATA-213
> Project: CarbonData
>  Issue Type: Bug
>Reporter: QiangCai
>Assignee: QiangCai
>Priority: Minor
> Fix For: 0.2.0-incubating
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CARBONDATA-307) Support executor side scan using CarbonInputFormat

2016-10-13 Thread Jacky Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/CARBONDATA-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacky Li updated CARBONDATA-307:

Description: 
Currently, there are two read path in carbon-spark module: 
1. CarbonContext => CarbonDatasourceRelation => CarbonScanRDD => QueryExecutor
In this case, CarbonScanRDD uses CarbonInputFormat to get the split, and use 
QueryExecutor for scan.

2. SqlContext => CarbonDatasourceHadoopRelation => CarbonHadoopFSRDD => 
CarbonRecordReader => QueryExecutor
In this case, CarbonHadoopFSRDD uses CarbonInputFormat to do both get split and 
scan

Because of this, there are unnecessary duplicate code, they need to be unified.
The target approach should be:
sqlContext/carbonContext => CarbonDatasourceHadoopRelation => CarbonScanRDD => 
QueryExecutor


  was:
Currently, there are two read path in carbon-spark module: 
1. CarbonContext => CarbonDatasourceRelation => CarbonScanRDD => QueryExecutor
In this case, CarbonScanRDD uses CarbonInputFormat to get the split, and use 
QueryExecutor for scan.

2. SqlContext => CarbonDatasourceHadoopRelation => CarbonHadoopFSRDD => 
CarbonRecordReader => QueryExecutor
In this case, CarbonHadoopFSRDD uses CarbonInputFormat to do both get split and 
scan

Because of this, there are unnecessary duplicate code, they need to be unified.



> Support executor side scan using CarbonInputFormat
> --
>
> Key: CARBONDATA-307
> URL: https://issues.apache.org/jira/browse/CARBONDATA-307
> Project: CarbonData
>  Issue Type: Improvement
>  Components: spark-integration
>Affects Versions: 0.1.0-incubating
>Reporter: Jacky Li
> Fix For: 0.2.0-incubating
>
>
> Currently, there are two read path in carbon-spark module: 
> 1. CarbonContext => CarbonDatasourceRelation => CarbonScanRDD => QueryExecutor
> In this case, CarbonScanRDD uses CarbonInputFormat to get the split, and use 
> QueryExecutor for scan.
> 2. SqlContext => CarbonDatasourceHadoopRelation => CarbonHadoopFSRDD => 
> CarbonRecordReader => QueryExecutor
> In this case, CarbonHadoopFSRDD uses CarbonInputFormat to do both get split 
> and scan
> Because of this, there are unnecessary duplicate code, they need to be 
> unified.
> The target approach should be:
> sqlContext/carbonContext => CarbonDatasourceHadoopRelation => CarbonScanRDD 
> => QueryExecutor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-213) Remove thrift complier dependency

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571158#comment-15571158
 ] 

ASF GitHub Bot commented on CARBONDATA-213:
---

Github user QiangCai closed the pull request at:

https://github.com/apache/incubator-carbondata/pull/127


> Remove thrift complier dependency
> -
>
> Key: CARBONDATA-213
> URL: https://issues.apache.org/jira/browse/CARBONDATA-213
> Project: CarbonData
>  Issue Type: Bug
>Reporter: QiangCai
>Assignee: QiangCai
>Priority: Minor
> Fix For: 0.2.0-incubating
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-218) Remove Dependency: spark-csv and Unify CSV Reader for dataloading

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571147#comment-15571147
 ] 

ASF GitHub Bot commented on CARBONDATA-218:
---

Github user QiangCai closed the pull request at:

https://github.com/apache/incubator-carbondata/pull/132


> Remove Dependency: spark-csv and Unify CSV Reader for dataloading
> -
>
> Key: CARBONDATA-218
> URL: https://issues.apache.org/jira/browse/CARBONDATA-218
> Project: CarbonData
>  Issue Type: Improvement
>Reporter: QiangCai
>Assignee: QiangCai
>Priority: Minor
> Fix For: 0.2.0-incubating
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571101#comment-15571101
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

GitHub user QiangCai opened a pull request:

https://github.com/apache/incubator-carbondata/pull/233

[CARBONDATA-296]1.Add CSVInputFormat to read csv files.

**1 Add CSVInputFormat to read csv files**
MRv1:
hadoop/src/main/java/org/apache/carbondata/hadoop/mapred/CSVInputFormat.java
MRv2:

hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java

**2 Use univocity parser to parse csv files.**

**3 Customize StringArrayWritable to wrap String array values of each line 
in csv files.**

**4 Add BoundedInputStream to limit input stream**

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/QiangCai/incubator-carbondata dataloadinginput

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/233.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #233


commit cfb177bbca12cbe72a5947d7fdec1bc906d8aa7e
Author: QiangCai 
Date:   2016-10-12T09:53:05Z

csvinputformat




> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)