Re: 回复:回复:Re: Re: parquet performance

2018-03-14 Thread Wes McKinney
Adding the mailing list back and adding the benchmark script I notice one likely-serious problem: you are spawning num_columns * num_row_groups threads all at once. Based on what you've described about your data, that's ~300 threads simultaneously. I would recommend setting the number of threads e

Re: Question about my use case.

2018-03-14 Thread Ryan Blue
Yeah, sounds like something went wrong. What is your data model? Parquet can handle Avro records pretty seamlessly if you already have them. On Wed, Mar 14, 2018 at 9:20 AM, ALeX Wang wrote: > Hi Ryan, > > Thanks for the reply, > > We are using samza for streaming, > > Regarding parquet java, th

Re: Question about my use case.

2018-03-14 Thread ALeX Wang
Hi Ryan, Thanks for the reply, We are using samza for streaming, Regarding parquet java, then i must have not used the APIs right,,, since last time we tried, we have 7 hadoop processes spawned for writing to a single file and it was much slower than our parquet c++ alternative, Thanks, On 14

Re: Question about my use case.

2018-03-14 Thread Ryan Blue
Hi Alex, I don't think what you're trying to do makes sense. If you're using Scala, then your data is already in the JVM and it is probably much easier to write it to Parquet using the Java library. While that library depends on Hadoop, you don't have to use it with HDFS. The Hadoop FileSystem int

[jira] [Created] (PARQUET-1248) java.lang.UnsupportedOperationException: Unimplemented type: StringType

2018-03-14 Thread Shrutika modi (JIRA)
Shrutika modi created PARQUET-1248: -- Summary: java.lang.UnsupportedOperationException: Unimplemented type: StringType Key: PARQUET-1248 URL: https://issues.apache.org/jira/browse/PARQUET-1248 Projec

[jira] [Created] (PARQUET-1247) org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

2018-03-14 Thread Shrutika modi (JIRA)
Shrutika modi created PARQUET-1247: -- Summary: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary Key: PARQUET-1247 URL: https://issues.apache.org/jira/browse/PARQUET-1247

[jira] [Commented] (PARQUET-1242) parquet.thrift refers to wrong releases for the new compressions

2018-03-14 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398641#comment-16398641 ] ASF GitHub Bot commented on PARQUET-1242: - zivanfi opened a new pull request #87

[jira] [Commented] (PARQUET-1246) Ignore float/double statistics in case of NaN

2018-03-14 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398633#comment-16398633 ] ASF GitHub Bot commented on PARQUET-1246: - zivanfi commented on a change in pull

[jira] [Commented] (PARQUET-1246) Ignore float/double statistics in case of NaN

2018-03-14 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398630#comment-16398630 ] ASF GitHub Bot commented on PARQUET-1246: - zivanfi commented on a change in pull

[jira] [Commented] (PARQUET-1246) Ignore float/double statistics in case of NaN

2018-03-14 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398631#comment-16398631 ] ASF GitHub Bot commented on PARQUET-1246: - zivanfi commented on a change in pull

[jira] [Commented] (PARQUET-1246) Ignore float/double statistics in case of NaN

2018-03-14 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398632#comment-16398632 ] ASF GitHub Bot commented on PARQUET-1246: - zivanfi commented on a change in pull

[jira] [Commented] (PARQUET-1246) Ignore float/double statistics in case of NaN

2018-03-14 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398629#comment-16398629 ] ASF GitHub Bot commented on PARQUET-1246: - zivanfi commented on a change in pull

[jira] [Assigned] (PARQUET-1212) Write column indexes: Show indexes in tools

2018-03-14 Thread Gabor Szadovszky (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-1212: - Assignee: Gabor Szadovszky > Write column indexes: Show indexes in tools >