Re: Question about my use case.
Also could i get a pointer to example that write parquet file from arrow memory buffer directly? The part i'm currently missing is how to derive the repetition level and definition level@@ Thanks, On 13 March 2018 at 17:52, ALeX Wangwrote: > hi, > > i know it is may not be the best place to ask but would like to try > anyways, as it is quite hard for me to find good example of this online. > > My usecase: > > i'd like to generate from streaming data (using Scala) into arrow format > in memory mapped file and then have my parquet-cpp program writing it as > parquet file to disk. > > my understanding is that java parquet only implements HDFS writer, which > is not my use case (not using hadoop) and parquet-cpp is much more > succinct. > > My question: > > does my usecase make sense? or if there is better way? > > Thanks, > -- > Alex Wang, > Open vSwitch developer > -- Alex Wang, Open vSwitch developer
Question about my use case.
hi, i know it is may not be the best place to ask but would like to try anyways, as it is quite hard for me to find good example of this online. My usecase: i'd like to generate from streaming data (using Scala) into arrow format in memory mapped file and then have my parquet-cpp program writing it as parquet file to disk. my understanding is that java parquet only implements HDFS writer, which is not my use case (not using hadoop) and parquet-cpp is much more succinct. My question: does my usecase make sense? or if there is better way? Thanks, -- Alex Wang, Open vSwitch developer
[jira] [Commented] (PARQUET-1143) Update Java for format 2.4.0 changes
[ https://issues.apache.org/jira/browse/PARQUET-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397914#comment-16397914 ] ASF GitHub Bot commented on PARQUET-1143: - rdblue commented on issue #430: PARQUET-1143: Update to Parquet format 2.4.0. URL: https://github.com/apache/parquet-mr/pull/430#issuecomment-372866727 I'd like to get 1.10.0 out in the next week or two. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Update Java for format 2.4.0 changes > > > Key: PARQUET-1143 > URL: https://issues.apache.org/jira/browse/PARQUET-1143 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Affects Versions: 1.9.0, 1.8.2 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1143) Update Java for format 2.4.0 changes
[ https://issues.apache.org/jira/browse/PARQUET-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397790#comment-16397790 ] ASF GitHub Bot commented on PARQUET-1143: - scottcarey commented on issue #430: PARQUET-1143: Update to Parquet format 2.4.0. URL: https://github.com/apache/parquet-mr/pull/430#issuecomment-372844823 This is great! I would love to test out writing some parquet files using zstd compression. It appears I can not do so without a parquet release however, containing this work. Am I mistaken? Is there a way to manually supply parquet-format 2.4 and combine it with released versions of parquet-avro/mr/etc and spark and output zstd files? If not, what is the rough ETA on a 1.9.1 or 1.10.0 release of parquet that would unlock zstd compression? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Update Java for format 2.4.0 changes > > > Key: PARQUET-1143 > URL: https://issues.apache.org/jira/browse/PARQUET-1143 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Affects Versions: 1.9.0, 1.8.2 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1222) Definition of float and double sort order is ambiguous
[ https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem updated PARQUET-1222: --- Summary: Definition of float and double sort order is ambiguous (was: Definition of float and double sort order is ambigious) > Definition of float and double sort order is ambiguous > -- > > Key: PARQUET-1222 > URL: https://issues.apache.org/jira/browse/PARQUET-1222 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Zoltan Ivanfi >Priority: Critical > Fix For: format-2.5.0 > > > Currently parquet-format specifies the sort order for floating point numbers > as follows: > {code:java} >* FLOAT - signed comparison of the represented value >* DOUBLE - signed comparison of the represented value > {code} > The problem is that the comparison of floating point numbers is only a > partial ordering with strange behaviour in specific corner cases. For > example, according to IEEE 754, -0 is neither less nor more than \+0 and > comparing NaN to anything always returns false. This ordering is not suitable > for statistics. Additionally, the Java implementation already uses a > different (total) ordering that handles these cases correctly but differently > than the C\+\+ implementations, which leads to interoperability problems. > TypeDefinedOrder for doubles and floats should be deprecated and a new > TotalFloatingPointOrder should be introduced. The default for writing doubles > and floats would be the new TotalFloatingPointOrder. This ordering should be > effective and easy to implement in all programming languages. > For reading existing stats created using TypeDefinedOrder, the following > compatibility rules should be applied: > * When looking for NaN values, min and max should be ignored. > * If the min is a NaN, it should be ignored. > * If the max is a NaN, it should be ignored. > * If the min is \+0, the row group may contain -0 values as well. > * If the max is -0, the row group may contain \+0 values as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet
[ https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397313#comment-16397313 ] ASF GitHub Bot commented on PARQUET-968: qinghui-xu commented on a change in pull request #411: PARQUET-968 Add Hive/Presto support in ProtoParquet URL: https://github.com/apache/parquet-mr/pull/411#discussion_r174216544 ## File path: parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java ## @@ -129,10 +131,14 @@ public void add(Object value) { }; } -return newScalarConverter(parent, parentBuilder, fieldDescriptor, parquetType); +OriginalType originalType = parquetType.getOriginalType() == null ? OriginalType.UTF8 : parquetType.getOriginalType(); Review comment: I guess for the data generated in previous version of parquet-protobuf, it is not having the "OriginalType" annotation for repeated fields, thus this conditional test to be backward compatible. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Hive/Presto support in ProtoParquet > --- > > Key: PARQUET-968 > URL: https://issues.apache.org/jira/browse/PARQUET-968 > Project: Parquet > Issue Type: Task >Reporter: Constantin Muraru >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Parquet sync starting now
https://meet.google.com/jpy-mump-ngc
[jira] [Commented] (PARQUET-1246) Ignore float/double statistics in case of NaN
[ https://issues.apache.org/jira/browse/PARQUET-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397217#comment-16397217 ] ASF GitHub Bot commented on PARQUET-1246: - gszadovszky opened a new pull request #461: PARQUET-1246: Ignore float/double statistics in case of NaN URL: https://github.com/apache/parquet-mr/pull/461 Because of the ambigous sorting order of float/double the following changes made at the reading path of the related statistics: - Ignoring statistics in case of it contains a NaN value. - Using -0.0 as min value and +0.0 as max value independently from which 0.0 value was saved in the statistics. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Ignore float/double statistics in case of NaN > - > > Key: PARQUET-1246 > URL: https://issues.apache.org/jira/browse/PARQUET-1246 > Project: Parquet > Issue Type: Bug >Affects Versions: 1.8.1 >Reporter: Gabor Szadovszky >Assignee: Gabor Szadovszky >Priority: Major > Fix For: 1.10.0 > > > The sorting order of the floating point values are not properly specified, > therefore NaN values can cause skipping valid values when filtering. See > PARQUET-1222 for more info. > This issue is for ignoring statistics for float/double if it contains NaN to > prevent data loss at the read path when filtering. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-323) INT96 should be marked as deprecated
[ https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397196#comment-16397196 ] ASF GitHub Bot commented on PARQUET-323: rdblue commented on issue #86: PARQUET-323: Mark INT96 as deprecated URL: https://github.com/apache/parquet-format/pull/86#issuecomment-372726185 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > INT96 should be marked as deprecated > > > Key: PARQUET-323 > URL: https://issues.apache.org/jira/browse/PARQUET-323 > Project: Parquet > Issue Type: Bug > Components: parquet-format >Reporter: Cheng Lian >Assignee: Lars Volker >Priority: Major > > As discussed in the mailing list, {{INT96}} is only used to represent nanosec > timestamp in Impala for some historical reasons, and should be deprecated. > Since nanosec precision is rarely a real requirement, one possible and simple > solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or > {{INT64 (TIMESTAMP_MICROS)}}. > Several projects (Impala, Hive, Spark, ...) support INT96. > We need a clear spec of the replacement and the path to deprecation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns
[ https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397186#comment-16397186 ] ASF GitHub Bot commented on PARQUET-1245: - xhochy commented on a change in pull request #447: PARQUET-1245: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r174192351 ## File path: src/parquet/schema.cc ## @@ -720,17 +718,15 @@ int SchemaDescriptor::ColumnIndex(const std::string& node_path) const { return search->second; } -int SchemaDescriptor::ColumnIndex(const Node& node) const { - int result = ColumnIndex(node.path()->ToDotString()); - if (result < 0) { -return -1; - } - DCHECK(result < num_columns()); - if (!node.Equals(Column(result)->schema_node().get())) { -// Same path but not the same node -return -1; +int SchemaDescriptor::ColumnIndex(const Node* node) const { Review comment: @wesm What is your opinion on this? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Segfault when writing Arrow table with duplicate columns > -- > > Key: PARQUET-1245 > URL: https://issues.apache.org/jira/browse/PARQUET-1245 > Project: Parquet > Issue Type: Bug > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns
[ https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397136#comment-16397136 ] ASF GitHub Bot commented on PARQUET-1245: - pitrou commented on a change in pull request #447: PARQUET-1245: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r174183059 ## File path: src/parquet/schema.cc ## @@ -720,17 +718,15 @@ int SchemaDescriptor::ColumnIndex(const std::string& node_path) const { return search->second; } -int SchemaDescriptor::ColumnIndex(const Node& node) const { - int result = ColumnIndex(node.path()->ToDotString()); - if (result < 0) { -return -1; - } - DCHECK(result < num_columns()); - if (!node.Equals(Column(result)->schema_node().get())) { -// Same path but not the same node -return -1; +int SchemaDescriptor::ColumnIndex(const Node* node) const { Review comment: By the way other methods such `Node::Equals` take a node pointer, even though passing a null pointer isn't supported. Should I still convert back to a reference? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Segfault when writing Arrow table with duplicate columns > -- > > Key: PARQUET-1245 > URL: https://issues.apache.org/jira/browse/PARQUET-1245 > Project: Parquet > Issue Type: Bug > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: cpp-1.5.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1246) Ignore float/double statistics in case of NaN
Gabor Szadovszky created PARQUET-1246: - Summary: Ignore float/double statistics in case of NaN Key: PARQUET-1246 URL: https://issues.apache.org/jira/browse/PARQUET-1246 Project: Parquet Issue Type: Bug Reporter: Gabor Szadovszky Assignee: Gabor Szadovszky The sorting order of the floating point values are not properly specified, therefore NaN values can cause skipping valid values when filtering. See PARQUET-1222 for more info. This issue is for ignoring statistics for float/double if it contains NaN to prevent data loss at the read path when filtering. -- This message was sent by Atlassian JIRA (v7.6.3#76005)