[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15600821#comment-15600821
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-carbondata/pull/233


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15588762#comment-15588762
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user QiangCai commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r84068389
  
--- Diff: 
hadoop/src/test/java/org/apache/carbondata/hadoop/csv/CSVInputFormatTest.java 
---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.csv;
+
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.FileOutputStream;
+import java.io.IOException;
+
+import org.apache.carbondata.hadoop.io.StringArrayWritable;
+
+import junit.framework.TestCase;
+import org.junit.Assert;
+import org.junit.Test;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.compress.BZip2Codec;
+import org.apache.hadoop.io.compress.CompressionOutputStream;
+import org.apache.hadoop.io.compress.GzipCodec;
+import org.apache.hadoop.io.compress.Lz4Codec;
+import org.apache.hadoop.io.compress.SnappyCodec;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+
+public class CSVInputFormatTest extends TestCase {
+
+  /**
+   * generate compressed files, no need to call this method.
+   * @throws Exception
+   */
+  public void testGenerateCompressFiles() throws Exception {
+String pwd = new File("src/test/resources").getCanonicalPath();
+String inputFile = pwd + "/data.csv";
+FileInputStream input = new FileInputStream(inputFile);
+Configuration conf = new Configuration();
+
+// .gz
+String outputFile = pwd + "/data.csv.gz";
+FileOutputStream output = new FileOutputStream(outputFile);
+GzipCodec gzip = new GzipCodec();
+gzip.setConf(conf);
+CompressionOutputStream outputStream = gzip.createOutputStream(output);
+int i = -1;
+while ((i = input.read()) != -1) {
+  outputStream.write(i);
+}
+outputStream.close();
+input.close();
+
+// .bz2
+input = new FileInputStream(inputFile);
+outputFile = pwd + "/data.csv.bz2";
+output = new FileOutputStream(outputFile);
+BZip2Codec bzip2 = new BZip2Codec();
+bzip2.setConf(conf);
+outputStream = bzip2.createOutputStream(output);
+i = -1;
+while ((i = input.read()) != -1) {
+  outputStream.write(i);
+}
+outputStream.close();
+input.close();
+
+// .snappy
+input = new FileInputStream(inputFile);
+outputFile = pwd + "/data.csv.snappy";
+output = new FileOutputStream(outputFile);
+SnappyCodec snappy = new SnappyCodec();
+snappy.setConf(conf);
+outputStream = snappy.createOutputStream(output);
+i = -1;
+while ((i = input.read()) != -1) {
+  outputStream.write(i);
+}
+outputStream.close();
+input.close();
+
+//.lz4
+input = new FileInputStream(inputFile);
+outputFile = pwd + "/data.csv.lz4";
+output = new FileOutputStream(outputFile);
+Lz4Codec lz4 = new Lz4Codec();
+lz4.setConf(conf);
+outputStream = lz4.createOutputStream(output);
+i = -1;
+while ((i = input.read()) != -1) {
+  outputStream.write(i);
+}
+outputStream.close();
+input.close();
+
+  }
+
+  /**
+   * CSVCheckMapper check the content of csv files.
+   */
+  public static class CSVCheckMapper extends Mapper {
+@Override
+protected void map(NullWritable ke

[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575358#comment-15575358
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83422823
  
--- Diff: 
hadoop/src/test/java/org/apache/carbondata/hadoop/csv/CSVInputFormatTest.java 
---
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.csv;
+
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.FileOutputStream;
+import java.io.IOException;
+
+import org.apache.carbondata.hadoop.io.StringArrayWritable;
+
+import junit.framework.TestCase;
+import org.junit.Assert;
+import org.junit.Test;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.compress.BZip2Codec;
+import org.apache.hadoop.io.compress.CompressionOutputStream;
+import org.apache.hadoop.io.compress.GzipCodec;
+import org.apache.hadoop.io.compress.Lz4Codec;
+import org.apache.hadoop.io.compress.SnappyCodec;
+import org.apache.hadoop.mapreduce.Job;
+import org.apache.hadoop.mapreduce.Mapper;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
+
+public class CSVInputFormatTest extends TestCase {
+
+  /**
+   * generate compressed files, no need to call this method.
+   * @throws Exception
+   */
+  public void testGenerateCompressFiles() throws Exception {
+String pwd = new File("src/test/resources").getCanonicalPath();
+String inputFile = pwd + "/data.csv";
+FileInputStream input = new FileInputStream(inputFile);
+Configuration conf = new Configuration();
+
+// .gz
+String outputFile = pwd + "/data.csv.gz";
+FileOutputStream output = new FileOutputStream(outputFile);
+GzipCodec gzip = new GzipCodec();
+gzip.setConf(conf);
+CompressionOutputStream outputStream = gzip.createOutputStream(output);
+int i = -1;
+while ((i = input.read()) != -1) {
+  outputStream.write(i);
+}
+outputStream.close();
+input.close();
+
+// .bz2
+input = new FileInputStream(inputFile);
+outputFile = pwd + "/data.csv.bz2";
+output = new FileOutputStream(outputFile);
+BZip2Codec bzip2 = new BZip2Codec();
+bzip2.setConf(conf);
+outputStream = bzip2.createOutputStream(output);
+i = -1;
+while ((i = input.read()) != -1) {
+  outputStream.write(i);
+}
+outputStream.close();
+input.close();
+
+// .snappy
+input = new FileInputStream(inputFile);
+outputFile = pwd + "/data.csv.snappy";
+output = new FileOutputStream(outputFile);
+SnappyCodec snappy = new SnappyCodec();
+snappy.setConf(conf);
+outputStream = snappy.createOutputStream(output);
+i = -1;
+while ((i = input.read()) != -1) {
+  outputStream.write(i);
+}
+outputStream.close();
+input.close();
+
+//.lz4
+input = new FileInputStream(inputFile);
+outputFile = pwd + "/data.csv.lz4";
+output = new FileOutputStream(outputFile);
+Lz4Codec lz4 = new Lz4Codec();
+lz4.setConf(conf);
+outputStream = lz4.createOutputStream(output);
+i = -1;
+while ((i = input.read()) != -1) {
+  outputStream.write(i);
+}
+outputStream.close();
+input.close();
+
+  }
+
+  /**
+   * CSVCheckMapper check the content of csv files.
+   */
+  public static class CSVCheckMapper extends Mapper {
+@Override
+protected void map(NullWritable key

[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574786#comment-15574786
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user QiangCai commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83387415
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java 
---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.mapreduce;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+
+import org.apache.carbondata.hadoop.io.BoundedInputStream;
+import org.apache.carbondata.hadoop.io.StringArrayWritable;
+import org.apache.carbondata.hadoop.util.CSVInputFormatUtil;
+
+import com.univocity.parsers.csv.CsvParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.Seekable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.CompressionCodecFactory;
+import org.apache.hadoop.io.compress.CompressionInputStream;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.hadoop.io.compress.SplitCompressionInputStream;
+import org.apache.hadoop.io.compress.SplittableCompressionCodec;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.FileSplit;
+import org.apache.hadoop.util.LineReader;
+
+/**
+ * An {@link org.apache.hadoop.mapreduce.InputFormat} for csv files.  
Files are broken into lines.
+ * Values are the line of csv files.
+ */
+public class CSVInputFormat extends FileInputFormat {
+
+  @Override
+  public RecordReader 
createRecordReader(InputSplit inputSplit,
+  TaskAttemptContext context) throws IOException, InterruptedException 
{
+return new NewCSVRecordReader();
+  }
+
+  /**
+   * Treats value as line in file. Key is null.
+   */
+  public static class NewCSVRecordReader extends 
RecordReader {
+
+private long start;
+private long end;
+private BoundedInputStream boundedInputStream;
+private Reader reader;
+private CsvParser csvParser;
+private StringArrayWritable value;
+private String[] columns;
+private Seekable filePosition;
+private boolean isCompressedInput;
+private Decompressor decompressor;
+
+@Override
+public void initialize(InputSplit inputSplit, TaskAttemptContext 
context)
+throws IOException, InterruptedException {
+  FileSplit split = (FileSplit) inputSplit;
+  this.start = split.getStart();
+  this.end = this.start + split.getLength();
+  Path file = split.getPath();
+  Configuration job = context.getConfiguration();
+  CompressionCodec codec = (new 
CompressionCodecFactory(job)).getCodec(file);
+  FileSystem fs = file.getFileSystem(job);
+  FSDataInputStream fileIn = fs.open(file);
+  InputStream inputStream = null;
+  if (codec != null) {
+this.isCompressedInput = true;
+this.decompressor = CodecPool.getDecompressor(codec);
+if (codec instanceof SplittableCompressionCodec) {
+  SplitCompressionInputStream scIn = ((SplittableCompressionCodec) 
codec)
+  .createInputStream(fileIn, this.decompressor, this

[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574785#comment-15574785
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user QiangCai commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83387366
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java 
---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.mapreduce;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+
+import org.apache.carbondata.hadoop.io.BoundedInputStream;
+import org.apache.carbondata.hadoop.io.StringArrayWritable;
+import org.apache.carbondata.hadoop.util.CSVInputFormatUtil;
+
+import com.univocity.parsers.csv.CsvParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.Seekable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.CompressionCodecFactory;
+import org.apache.hadoop.io.compress.CompressionInputStream;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.hadoop.io.compress.SplitCompressionInputStream;
+import org.apache.hadoop.io.compress.SplittableCompressionCodec;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.FileSplit;
+import org.apache.hadoop.util.LineReader;
+
+/**
+ * An {@link org.apache.hadoop.mapreduce.InputFormat} for csv files.  
Files are broken into lines.
+ * Values are the line of csv files.
+ */
+public class CSVInputFormat extends FileInputFormat {
+
+  @Override
+  public RecordReader 
createRecordReader(InputSplit inputSplit,
+  TaskAttemptContext context) throws IOException, InterruptedException 
{
+return new NewCSVRecordReader();
+  }
+
+  /**
+   * Treats value as line in file. Key is null.
+   */
+  public static class NewCSVRecordReader extends 
RecordReader {
+
+private long start;
+private long end;
+private BoundedInputStream boundedInputStream;
+private Reader reader;
+private CsvParser csvParser;
+private StringArrayWritable value;
+private String[] columns;
+private Seekable filePosition;
+private boolean isCompressedInput;
+private Decompressor decompressor;
+
+@Override
+public void initialize(InputSplit inputSplit, TaskAttemptContext 
context)
+throws IOException, InterruptedException {
+  FileSplit split = (FileSplit) inputSplit;
+  this.start = split.getStart();
--- End diff --

fixed


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574771#comment-15574771
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user QiangCai commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83386474
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java 
---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.mapreduce;
--- End diff --

fixed


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574761#comment-15574761
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user QiangCai commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83386400
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/io/StringArrayWritable.java 
---
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.io;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.nio.charset.Charset;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Writable;
+
+/**
+ * A String sequence that is usable as a key or value.
+ */
+public class StringArrayWritable implements Writable {
+  private String[] values;
+
+  public String[] toStrings() {
+return values;
+  }
+
+  public void set(String[] values) {
+this.values = values;
+  }
+
+  public String[] get() {
+return values;
+  }
+
+  @Override public void readFields(DataInput in) throws IOException {
--- End diff --

fixed


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574281#comment-15574281
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83359593
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java 
---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.mapreduce;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+
+import org.apache.carbondata.hadoop.io.BoundedInputStream;
+import org.apache.carbondata.hadoop.io.StringArrayWritable;
+import org.apache.carbondata.hadoop.util.CSVInputFormatUtil;
+
+import com.univocity.parsers.csv.CsvParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.Seekable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.CompressionCodecFactory;
+import org.apache.hadoop.io.compress.CompressionInputStream;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.hadoop.io.compress.SplitCompressionInputStream;
+import org.apache.hadoop.io.compress.SplittableCompressionCodec;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.FileSplit;
+import org.apache.hadoop.util.LineReader;
+
+/**
+ * An {@link org.apache.hadoop.mapreduce.InputFormat} for csv files.  
Files are broken into lines.
+ * Values are the line of csv files.
+ */
+public class CSVInputFormat extends FileInputFormat {
+
+  @Override
+  public RecordReader 
createRecordReader(InputSplit inputSplit,
+  TaskAttemptContext context) throws IOException, InterruptedException 
{
+return new NewCSVRecordReader();
+  }
+
+  /**
+   * Treats value as line in file. Key is null.
+   */
+  public static class NewCSVRecordReader extends 
RecordReader {
--- End diff --

Why is it a static class? And you can rename it to `CSVRecordReader`


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574279#comment-15574279
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83355842
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/io/StringArrayWritable.java 
---
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.io;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.nio.charset.Charset;
+import java.util.Arrays;
+
+import org.apache.hadoop.io.Writable;
+
+/**
+ * A String sequence that is usable as a key or value.
+ */
+public class StringArrayWritable implements Writable {
+  private String[] values;
+
+  public String[] toStrings() {
+return values;
+  }
+
+  public void set(String[] values) {
+this.values = values;
+  }
+
+  public String[] get() {
+return values;
+  }
+
+  @Override public void readFields(DataInput in) throws IOException {
--- End diff --

`@Override` should be put to previous line


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574277#comment-15574277
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83355953
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/mapred/CSVInputFormat.java ---
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.mapred;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+
+import org.apache.carbondata.hadoop.io.BoundedInputStream;
+import org.apache.carbondata.hadoop.io.StringArrayWritable;
+import org.apache.carbondata.hadoop.util.CSVInputFormatUtil;
+
+import com.univocity.parsers.csv.CsvParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.Seekable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.CompressionCodecFactory;
+import org.apache.hadoop.io.compress.CompressionInputStream;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.hadoop.io.compress.SplitCompressionInputStream;
+import org.apache.hadoop.io.compress.SplittableCompressionCodec;
+import org.apache.hadoop.mapred.FileInputFormat;
+import org.apache.hadoop.mapred.FileSplit;
+import org.apache.hadoop.mapred.InputSplit;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.JobConfigurable;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.util.LineReader;
+
+/**
+ * An {@link org.apache.hadoop.mapred.InputFormat} for csv files.  Files 
are broken into lines.
+ * Values are the line of csv files.
+ */
+public class CSVInputFormat extends FileInputFormat implements
--- End diff --

Agree, there is no legacy application use hadoop 1.X, right?


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574283#comment-15574283
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83359938
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/util/CSVInputFormatUtil.java 
---
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.util;
+
+import com.univocity.parsers.csv.CsvParserSettings;
+import org.apache.hadoop.conf.Configuration;
+
+/**
+ * CSVInputFormatUtil is a util class.
+ */
+public class CSVInputFormatUtil {
+
+  public static final String DELIMITER = "carbon.csvinputformat.delimiter";
+  public static final String DELIMITER_DEFAULT = ",";
+  public static final String COMMENT = "carbon.csvinputformat.comment";
+  public static final String COMMENT_DEFAULT = "#";
+  public static final String QUOTE = "carbon.csvinputformat.quote";
+  public static final String QUOTE_DEFAULT = "\"";
+  public static final String ESCAPE = "carbon.csvinputformat.escape";
+  public static final String ESCAPE_DEFAULT = "\\";
+  public static final String HEADER_PRESENT = 
"caron.csvinputformat.header.present";
+  public static final boolean HEADER_PRESENT_DEFAULT = false;
+
+  public static CsvParserSettings extractCsvParserSettings(Configuration 
job, long start) {
--- End diff --

I think this class is not needed, move this function into CSVRecordReader 
class


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574278#comment-15574278
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83360081
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java 
---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.mapreduce;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+
+import org.apache.carbondata.hadoop.io.BoundedInputStream;
+import org.apache.carbondata.hadoop.io.StringArrayWritable;
+import org.apache.carbondata.hadoop.util.CSVInputFormatUtil;
+
+import com.univocity.parsers.csv.CsvParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.Seekable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.CompressionCodecFactory;
+import org.apache.hadoop.io.compress.CompressionInputStream;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.hadoop.io.compress.SplitCompressionInputStream;
+import org.apache.hadoop.io.compress.SplittableCompressionCodec;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.FileSplit;
+import org.apache.hadoop.util.LineReader;
+
+/**
+ * An {@link org.apache.hadoop.mapreduce.InputFormat} for csv files.  
Files are broken into lines.
+ * Values are the line of csv files.
+ */
+public class CSVInputFormat extends FileInputFormat {
+
+  @Override
+  public RecordReader 
createRecordReader(InputSplit inputSplit,
+  TaskAttemptContext context) throws IOException, InterruptedException 
{
+return new NewCSVRecordReader();
+  }
+
+  /**
+   * Treats value as line in file. Key is null.
+   */
+  public static class NewCSVRecordReader extends 
RecordReader {
+
+private long start;
+private long end;
+private BoundedInputStream boundedInputStream;
+private Reader reader;
+private CsvParser csvParser;
+private StringArrayWritable value;
+private String[] columns;
+private Seekable filePosition;
+private boolean isCompressedInput;
+private Decompressor decompressor;
+
+@Override
+public void initialize(InputSplit inputSplit, TaskAttemptContext 
context)
+throws IOException, InterruptedException {
+  FileSplit split = (FileSplit) inputSplit;
+  this.start = split.getStart();
--- End diff --

No need to use `this.start`, you can use `start` directly. 
The same for all occurrence in this file


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574280#comment-15574280
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83359637
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java 
---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.mapreduce;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+
+import org.apache.carbondata.hadoop.io.BoundedInputStream;
+import org.apache.carbondata.hadoop.io.StringArrayWritable;
+import org.apache.carbondata.hadoop.util.CSVInputFormatUtil;
+
+import com.univocity.parsers.csv.CsvParser;
--- End diff --

I think code style will fail, incorrect order


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574276#comment-15574276
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83360131
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java 
---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.mapreduce;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.Reader;
+
+import org.apache.carbondata.hadoop.io.BoundedInputStream;
+import org.apache.carbondata.hadoop.io.StringArrayWritable;
+import org.apache.carbondata.hadoop.util.CSVInputFormatUtil;
+
+import com.univocity.parsers.csv.CsvParser;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.Seekable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.compress.CodecPool;
+import org.apache.hadoop.io.compress.CompressionCodec;
+import org.apache.hadoop.io.compress.CompressionCodecFactory;
+import org.apache.hadoop.io.compress.CompressionInputStream;
+import org.apache.hadoop.io.compress.Decompressor;
+import org.apache.hadoop.io.compress.SplitCompressionInputStream;
+import org.apache.hadoop.io.compress.SplittableCompressionCodec;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.FileSplit;
+import org.apache.hadoop.util.LineReader;
+
+/**
+ * An {@link org.apache.hadoop.mapreduce.InputFormat} for csv files.  
Files are broken into lines.
+ * Values are the line of csv files.
+ */
+public class CSVInputFormat extends FileInputFormat {
+
+  @Override
+  public RecordReader 
createRecordReader(InputSplit inputSplit,
+  TaskAttemptContext context) throws IOException, InterruptedException 
{
+return new NewCSVRecordReader();
+  }
+
+  /**
+   * Treats value as line in file. Key is null.
+   */
+  public static class NewCSVRecordReader extends 
RecordReader {
+
+private long start;
+private long end;
+private BoundedInputStream boundedInputStream;
+private Reader reader;
+private CsvParser csvParser;
+private StringArrayWritable value;
+private String[] columns;
+private Seekable filePosition;
+private boolean isCompressedInput;
+private Decompressor decompressor;
+
+@Override
+public void initialize(InputSplit inputSplit, TaskAttemptContext 
context)
+throws IOException, InterruptedException {
+  FileSplit split = (FileSplit) inputSplit;
+  this.start = split.getStart();
+  this.end = this.start + split.getLength();
+  Path file = split.getPath();
+  Configuration job = context.getConfiguration();
+  CompressionCodec codec = (new 
CompressionCodecFactory(job)).getCodec(file);
+  FileSystem fs = file.getFileSystem(job);
+  FSDataInputStream fileIn = fs.open(file);
+  InputStream inputStream = null;
+  if (codec != null) {
+this.isCompressedInput = true;
+this.decompressor = CodecPool.getDecompressor(codec);
+if (codec instanceof SplittableCompressionCodec) {
+  SplitCompressionInputStream scIn = ((SplittableCompressionCodec) 
codec)
+  .createInputStream(fileIn, this.decompressor, this.

[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574282#comment-15574282
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

Github user jackylk commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/233#discussion_r83359457
  
--- Diff: 
hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java 
---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.carbondata.hadoop.mapreduce;
--- End diff --

suggest to move it to org.apache.carbondata.hadoop.csv. And it is carbon 
project internal class, not meant to be used by user application.


> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CARBONDATA-296) 1.Add CSVInputFormat to read csv files.

2016-10-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/CARBONDATA-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571101#comment-15571101
 ] 

ASF GitHub Bot commented on CARBONDATA-296:
---

GitHub user QiangCai opened a pull request:

https://github.com/apache/incubator-carbondata/pull/233

[CARBONDATA-296]1.Add CSVInputFormat to read csv files.

**1 Add CSVInputFormat to read csv files**
MRv1:
hadoop/src/main/java/org/apache/carbondata/hadoop/mapred/CSVInputFormat.java
MRv2:

hadoop/src/main/java/org/apache/carbondata/hadoop/mapreduce/CSVInputFormat.java

**2 Use univocity parser to parse csv files.**

**3 Customize StringArrayWritable to wrap String array values of each line 
in csv files.**

**4 Add BoundedInputStream to limit input stream**

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/QiangCai/incubator-carbondata dataloadinginput

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/233.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #233


commit cfb177bbca12cbe72a5947d7fdec1bc906d8aa7e
Author: QiangCai 
Date:   2016-10-12T09:53:05Z

csvinputformat




> 1.Add CSVInputFormat to read csv files.
> ---
>
> Key: CARBONDATA-296
> URL: https://issues.apache.org/jira/browse/CARBONDATA-296
> Project: CarbonData
>  Issue Type: Sub-task
>Reporter: Ravindra Pesala
>Assignee: QiangCai
> Fix For: 0.2.0-incubating
>
>
> Add CSVInputFormat to read csv files, it should use Univocity parser to read 
> csv files to get optimal performance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)