Re: Question about my use case.

2018-03-13 Thread ALeX Wang
Also could i get a pointer to example that write parquet file from arrow
memory buffer directly?

The part i'm currently missing is how to derive the repetition level and
definition level@@

Thanks,

On 13 March 2018 at 17:52, ALeX Wang  wrote:

> hi,
>
> i know it is may not be the best place to ask but would like to try
> anyways, as it is quite hard for me to find good example of this online.
>
> My usecase:
>
> i'd like to generate from streaming data (using Scala) into arrow format
> in memory mapped file and then have my parquet-cpp program writing it as
> parquet file to disk.
>
> my understanding is that java parquet only implements HDFS writer, which
> is not my use case (not using hadoop) and parquet-cpp is much more
> succinct.
>
> My question:
>
> does my usecase make sense? or if there is better way?
>
> Thanks,
> --
> Alex Wang,
> Open vSwitch developer
>



-- 
Alex Wang,
Open vSwitch developer


Question about my use case.

2018-03-13 Thread ALeX Wang
hi,

i know it is may not be the best place to ask but would like to try
anyways, as it is quite hard for me to find good example of this online.

My usecase:

i'd like to generate from streaming data (using Scala) into arrow format in
memory mapped file and then have my parquet-cpp program writing it as
parquet file to disk.

my understanding is that java parquet only implements HDFS writer, which is
not my use case (not using hadoop) and parquet-cpp is much more succinct.

My question:

does my usecase make sense? or if there is better way?

Thanks,
-- 
Alex Wang,
Open vSwitch developer


[jira] [Commented] (PARQUET-1143) Update Java for format 2.4.0 changes

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397914#comment-16397914
 ] 

ASF GitHub Bot commented on PARQUET-1143:
-

rdblue commented on issue #430: PARQUET-1143: Update to Parquet format 2.4.0.
URL: https://github.com/apache/parquet-mr/pull/430#issuecomment-372866727
 
 
   I'd like to get 1.10.0 out in the next week or two.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Update Java for format 2.4.0 changes
> 
>
> Key: PARQUET-1143
> URL: https://issues.apache.org/jira/browse/PARQUET-1143
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.8.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1143) Update Java for format 2.4.0 changes

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397790#comment-16397790
 ] 

ASF GitHub Bot commented on PARQUET-1143:
-

scottcarey commented on issue #430: PARQUET-1143: Update to Parquet format 
2.4.0.
URL: https://github.com/apache/parquet-mr/pull/430#issuecomment-372844823
 
 
   This is great!   I would love to test out writing some parquet files using 
zstd compression.
   
   It appears I can not do so without a parquet release however, containing 
this work.
   
   Am I mistaken? Is there a way to manually supply parquet-format 2.4 and 
combine it with released versions of parquet-avro/mr/etc and spark and output 
zstd files?
   
   If not, what is the rough ETA on a 1.9.1 or 1.10.0 release of parquet that 
would unlock zstd compression?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Update Java for format 2.4.0 changes
> 
>
> Key: PARQUET-1143
> URL: https://issues.apache.org/jira/browse/PARQUET-1143
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.9.0, 1.8.2
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1222) Definition of float and double sort order is ambiguous

2018-03-13 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-1222:
---
Summary: Definition of float and double sort order is ambiguous  (was: 
Definition of float and double sort order is ambigious)

> Definition of float and double sort order is ambiguous
> --
>
> Key: PARQUET-1222
> URL: https://issues.apache.org/jira/browse/PARQUET-1222
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Priority: Critical
> Fix For: format-2.5.0
>
>
> Currently parquet-format specifies the sort order for floating point numbers 
> as follows:
> {code:java}
>*   FLOAT - signed comparison of the represented value
>*   DOUBLE - signed comparison of the represented value
> {code}
> The problem is that the comparison of floating point numbers is only a 
> partial ordering with strange behaviour in specific corner cases. For 
> example, according to IEEE 754, -0 is neither less nor more than \+0 and 
> comparing NaN to anything always returns false. This ordering is not suitable 
> for statistics. Additionally, the Java implementation already uses a 
> different (total) ordering that handles these cases correctly but differently 
> than the C\+\+ implementations, which leads to interoperability problems.
> TypeDefinedOrder for doubles and floats should be deprecated and a new 
> TotalFloatingPointOrder should be introduced. The default for writing doubles 
> and floats would be the new TotalFloatingPointOrder. This ordering should be 
> effective and easy to implement in all programming languages.
> For reading existing stats created using TypeDefinedOrder, the following 
> compatibility rules should be applied:
> * When looking for NaN values, min and max should be ignored.
> * If the min is a NaN, it should be ignored.
> * If the max is a NaN, it should be ignored.
> * If the min is \+0, the row group may contain -0 values as well.
> * If the max is -0, the row group may contain \+0 values as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-968) Add Hive/Presto support in ProtoParquet

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397313#comment-16397313
 ] 

ASF GitHub Bot commented on PARQUET-968:


qinghui-xu commented on a change in pull request #411: PARQUET-968 Add 
Hive/Presto support in ProtoParquet
URL: https://github.com/apache/parquet-mr/pull/411#discussion_r174216544
 
 

 ##
 File path: 
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
 ##
 @@ -129,10 +131,14 @@ public void add(Object value) {
   };
 }
 
-return newScalarConverter(parent, parentBuilder, fieldDescriptor, 
parquetType);
+OriginalType originalType = parquetType.getOriginalType() == null ? 
OriginalType.UTF8 : parquetType.getOriginalType();
 
 Review comment:
   I guess for the data generated in previous version of parquet-protobuf, it 
is not having the "OriginalType" annotation for repeated fields, thus this 
conditional test to be backward compatible.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Hive/Presto support in ProtoParquet
> ---
>
> Key: PARQUET-968
> URL: https://issues.apache.org/jira/browse/PARQUET-968
> Project: Parquet
>  Issue Type: Task
>Reporter: Constantin Muraru
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Parquet sync starting now

2018-03-13 Thread Julien Le Dem
https://meet.google.com/jpy-mump-ngc


[jira] [Commented] (PARQUET-1246) Ignore float/double statistics in case of NaN

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397217#comment-16397217
 ] 

ASF GitHub Bot commented on PARQUET-1246:
-

gszadovszky opened a new pull request #461: PARQUET-1246: Ignore float/double 
statistics in case of NaN
URL: https://github.com/apache/parquet-mr/pull/461
 
 
   Because of the ambigous sorting order of float/double the following changes 
made at the reading path of the related statistics:
   - Ignoring statistics in case of it contains a NaN value.
   - Using -0.0 as min value and +0.0 as max value independently from which 0.0 
value was saved in the statistics.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ignore float/double statistics in case of NaN
> -
>
> Key: PARQUET-1246
> URL: https://issues.apache.org/jira/browse/PARQUET-1246
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.8.1
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.10.0
>
>
> The sorting order of the floating point values are not properly specified, 
> therefore NaN values can cause skipping valid values when filtering. See 
> PARQUET-1222 for more info.
> This issue is for ignoring statistics for float/double if it contains NaN to 
> prevent data loss at the read path when filtering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-323) INT96 should be marked as deprecated

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397196#comment-16397196
 ] 

ASF GitHub Bot commented on PARQUET-323:


rdblue commented on issue #86: PARQUET-323: Mark INT96 as deprecated
URL: https://github.com/apache/parquet-format/pull/86#issuecomment-372726185
 
 
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> INT96 should be marked as deprecated
> 
>
> Key: PARQUET-323
> URL: https://issues.apache.org/jira/browse/PARQUET-323
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Lars Volker
>Priority: Major
>
> As discussed in the mailing list, {{INT96}} is only used to represent nanosec 
> timestamp in Impala for some historical reasons, and should be deprecated. 
> Since nanosec precision is rarely a real requirement, one possible and simple 
> solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or 
> {{INT64 (TIMESTAMP_MICROS)}}.
> Several projects (Impala, Hive, Spark, ...) support INT96.
> We need a clear spec of the replacement and the path to deprecation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397186#comment-16397186
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

xhochy commented on a change in pull request #447: PARQUET-1245: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r174192351
 
 

 ##
 File path: src/parquet/schema.cc
 ##
 @@ -720,17 +718,15 @@ int SchemaDescriptor::ColumnIndex(const std::string& 
node_path) const {
   return search->second;
 }
 
-int SchemaDescriptor::ColumnIndex(const Node& node) const {
-  int result = ColumnIndex(node.path()->ToDotString());
-  if (result < 0) {
-return -1;
-  }
-  DCHECK(result < num_columns());
-  if (!node.Equals(Column(result)->schema_node().get())) {
-// Same path but not the same node
-return -1;
+int SchemaDescriptor::ColumnIndex(const Node* node) const {
 
 Review comment:
   @wesm What is your opinion on this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397136#comment-16397136
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

pitrou commented on a change in pull request #447: PARQUET-1245: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r174183059
 
 

 ##
 File path: src/parquet/schema.cc
 ##
 @@ -720,17 +718,15 @@ int SchemaDescriptor::ColumnIndex(const std::string& 
node_path) const {
   return search->second;
 }
 
-int SchemaDescriptor::ColumnIndex(const Node& node) const {
-  int result = ColumnIndex(node.path()->ToDotString());
-  if (result < 0) {
-return -1;
-  }
-  DCHECK(result < num_columns());
-  if (!node.Equals(Column(result)->schema_node().get())) {
-// Same path but not the same node
-return -1;
+int SchemaDescriptor::ColumnIndex(const Node* node) const {
 
 Review comment:
   By the way other methods such `Node::Equals` take a node pointer, even 
though passing a null pointer isn't supported. Should I still convert back to a 
reference?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1246) Ignore float/double statistics in case of NaN

2018-03-13 Thread Gabor Szadovszky (JIRA)
Gabor Szadovszky created PARQUET-1246:
-

 Summary: Ignore float/double statistics in case of NaN
 Key: PARQUET-1246
 URL: https://issues.apache.org/jira/browse/PARQUET-1246
 Project: Parquet
  Issue Type: Bug
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


The sorting order of the floating point values are not properly specified, 
therefore NaN values can cause skipping valid values when filtering. See 
PARQUET-1222 for more info.
This issue is for ignoring statistics for float/double if it contains NaN to 
prevent data loss at the read path when filtering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)