[jira] [Commented] (PARQUET-124) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-09 Thread swetha k (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996536#comment-14996536 ] swetha k commented on PARQUET-124: -- [~b...@cloudera.com] I still see the issues. Please see the Warning

Re: Proposal for Union type

2015-11-09 Thread Julien Le Dem
This sounds good to me. We should have a UNION logical type in parquet-format to capture this information. A UNION type is defined as a GROUP and should always have exactly one field populated. By default the name of the field is the type name but in the case of thrift it is provided by the IDL.

[jira] [Commented] (PARQUET-124) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-09 Thread swetha k (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997989#comment-14997989 ] swetha k commented on PARQUET-124: -- [~rdblue] I can create a JIRA issue for this. Just to confirm,

Re: Reading Parquet data from input stream and write to output stream

2015-11-09 Thread Ryan Blue
Selina, I would use parquet-avro to create a writer. Kafka messages are commonly encoded as Avro, so you may already be working with Avro objects. If not, then convert to Avro and then write to the AvroParquetWriter. You can create a the writer that creates S3 files by setting up your S3

[jira] [Commented] (PARQUET-124) parquet.hadoop.ParquetOutputCommitter.commitJob() throws parquet.io.ParquetEncodingException

2015-11-09 Thread Ryan Blue (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996949#comment-14996949 ] Ryan Blue commented on PARQUET-124: --- [~swethakasireddy], it looks like this wasn't completely addressed

Re: Reading Parquet data from input stream and write to output stream

2015-11-09 Thread Ryan Blue
Selina, You should be able to write to S3 without needing to flush to an output stream. You would just use the S3 FileSystem to write data instead of HDFS. This doesn't need to require Parquet to write to an OutputStream instead of a file. Is there a reason why you want to supply an output

[jira] [Commented] (PARQUET-390) GroupType.union(Type toMerge, boolean strict) does not honor strict parameter

2015-11-09 Thread Ryan Blue (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997083#comment-14997083 ] Ryan Blue commented on PARQUET-390: --- You're right that my suggestion is a much larger issue. For this

Re: Reading Parquet data from input stream and write to output stream

2015-11-09 Thread Selina Tech
Hi, Ryan: Thanks a lot for your suggestion. I do not have to get the output stream if I could write my continually Kafka message (in json, cvs or avro format) to AWS S3 in parquet format. Would you like to introduce a little bit more detail about it and then I find some solution in