[jira] [Updated] (PARQUET-1935) [C++][Parquet] nullptr access violation when writing arrays of non-nullable values

2020-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1935:

Labels: pull-request-available  (was: )

> [C++][Parquet] nullptr access violation when writing arrays of non-nullable 
> values
> --
>
> Key: PARQUET-1935
> URL: https://issues.apache.org/jira/browse/PARQUET-1935
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Tanguy Fautre
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm updating ParquetSharp to build against Arrow 2.0.0 (currently using Arrow 
> 1.0.1). One of our unit test is now throwing a {{nullptr}} access violation.
> I have narrowed it down to writing arrays of non-nullable values (in this 
> case the column contains {{int[]}}) . If the values are nullable, the test 
> passes.
> The parquet file schema is as following:
> * {{GroupNode("schema", LogicalType.None, Repetition.Required)}}
> ** {{GroupNode("array_of_ints_column", LogicalType.List, 
> Repetition.Optional)}}
> *** {{GroupNode("list", LogicalType.None, Repetition.Repeated)}}
>  {{PrimitiveNode("item", LogicalType.Int(32, signed), 
> Repetition.Required)}} 
> The test crashes when calling {{TypedColumnWriter::WriteBatchSpaced}} with 
> the following  arguments:
> * {{num_values = 1}}
> * {{def_levels = {0}}}
> * {{rep_levels = {0}}}
> * {{valid_bits = {0}}}
> * {{valid_bit_offset = 0}}
> * {{values = {}}} (i.e. {{nullptr}})
> This call is effectively trying to write a null array, and therefore (to my 
> understanding) does not need to pass any values. Yet further down the 
> callstack, the implementation tries to read one value out of {{values}} 
> (which is {{nullptr}}).
> I believe the problem lies with
> {code:c++}
>   void MaybeCalculateValidityBits(
> const int16_t* def_levels,
> int64_t batch_size,
> int64_t* out_values_to_write,
> int64_t* out_spaced_values_to_write,
> int64_t* null_count) {
> if (bits_buffer_ == nullptr) {
>   if (!level_info_.HasNullableValues()) {
> *out_values_to_write = batch_size;
> *out_spaced_values_to_write = batch_size;
> *null_count = 0;
>   } else {
> for (int x = 0; x < batch_size; x++) {
>   *out_values_to_write += def_levels[x] == level_info_.def_level ? 1 
> : 0;
>   *out_spaced_values_to_write +=
>   def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : 
> 0;
> }
> *null_count = *out_values_to_write - *out_spaced_values_to_write;
>   }
>   return;
> }
> // ...
>   }
> {code}
> In particular, {{level_info_.HasNullableValues()}} returns {{false}} given 
> that the arrays cannot contain null-values. My understanding is that this is 
> wrong, since the arrays themselves are nullable.
> This code appears to have been introduced by ARROW-9603.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] shangxinli commented on a change in pull request #819: PARQUET-1915: Add nullify column

2020-10-23 Thread GitBox


shangxinli commented on a change in pull request #819:
URL: https://github.com/apache/parquet-mr/pull/819#discussion_r511212212



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/ColumnMasker.java
##
@@ -0,0 +1,274 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.util;
+
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.ColumnReader;
+import org.apache.parquet.column.ColumnWriter;
+import org.apache.parquet.column.Encoding;
+import org.apache.parquet.column.ParquetProperties;
+import org.apache.parquet.column.ParquetProperties.WriterVersion;
+import org.apache.parquet.column.impl.ColumnReadStoreImpl;
+import org.apache.parquet.column.page.DataPage;
+import org.apache.parquet.column.page.DictionaryPage;
+import org.apache.parquet.column.page.PageReadStore;
+import org.apache.parquet.column.page.PageWriteStore;
+import org.apache.parquet.column.page.PageWriter;
+import org.apache.parquet.column.statistics.Statistics;
+import org.apache.parquet.column.values.bloomfilter.BloomFilter;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import 
org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileReader;
+import 
org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileWriter;
+import org.apache.parquet.internal.column.columnindex.ColumnIndex;
+import org.apache.parquet.internal.column.columnindex.OffsetIndex;
+import org.apache.parquet.io.ParquetEncodingException;
+import org.apache.parquet.io.api.Converter;
+import org.apache.parquet.io.api.GroupConverter;
+import org.apache.parquet.io.api.PrimitiveConverter;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.PrimitiveType;
+import org.apache.parquet.schema.Type;
+
+import java.io.IOException;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+public class ColumnMasker {
+  /**
+   *
+   * @param reader Reader of source file
+   * @param writer Writer of destination file
+   * @param meta Metadata of source file
+   * @param schema Schema of source file
+   * @param paths Column Paths need to be masked
+   * @param maskMode Mode to mask
+   * @throws IOException
+   */
+  public void processBlocks(TransParquetFileReader reader, 
TransParquetFileWriter writer, ParquetMetadata meta,
+MessageType schema, List paths, MaskMode 
maskMode) throws IOException {
+Set nullifyColumns = convertToColumnPaths(paths);
+int blockIndex = 0;
+PageReadStore store = reader.readNextRowGroup();
+
+while (store != null) {
+  writer.startBlock(store.getRowCount());
+  List columnsInOrder = 
meta.getBlocks().get(blockIndex).getColumns();
+  Map descriptorsMap = 
schema.getColumns().stream().collect(
+Collectors.toMap(x -> ColumnPath.get(x.getPath()), x -> x));
+  ColumnReadStoreImpl crStore = new ColumnReadStoreImpl(store, new 
DummyGroupConverter(), schema,
+meta.getFileMetaData().getCreatedBy());
+
+  for (int i = 0; i < columnsInOrder.size(); i += 1) {
+ColumnChunkMetaData chunk = columnsInOrder.get(i);
+ColumnDescriptor descriptor = descriptorsMap.get(chunk.getPath());
+processChunk(descriptor, chunk, crStore, reader, writer, schema, 
nullifyColumns, maskMode);
+  }
+
+  writer.endBlock();
+  store = reader.readNextRowGroup();
+  blockIndex++;
+}
+  }
+
+  private void processChunk(ColumnDescriptor descriptor, ColumnChunkMetaData 
chunk, ColumnReadStoreImpl crStore,
+TransParquetFileReader reader, 
TransParquetFileWriter writer, MessageType schema,
+Set paths, MaskMode maskMode) throws 
IOException {
+reader.setStreamPosition(chunk.getStartingPos());
+
+if (paths.contains(chunk.getPath())) {
+  if (maskMode.equals(MaskMode.NULLIFY)) {
+ 

[jira] [Commented] (PARQUET-1915) Add null command

2020-10-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219981#comment-17219981
 ] 

ASF GitHub Bot commented on PARQUET-1915:
-

shangxinli commented on a change in pull request #819:
URL: https://github.com/apache/parquet-mr/pull/819#discussion_r511212212



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/ColumnMasker.java
##
@@ -0,0 +1,274 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.util;
+
+import org.apache.parquet.bytes.BytesInput;
+import org.apache.parquet.column.ColumnDescriptor;
+import org.apache.parquet.column.ColumnReader;
+import org.apache.parquet.column.ColumnWriter;
+import org.apache.parquet.column.Encoding;
+import org.apache.parquet.column.ParquetProperties;
+import org.apache.parquet.column.ParquetProperties.WriterVersion;
+import org.apache.parquet.column.impl.ColumnReadStoreImpl;
+import org.apache.parquet.column.page.DataPage;
+import org.apache.parquet.column.page.DictionaryPage;
+import org.apache.parquet.column.page.PageReadStore;
+import org.apache.parquet.column.page.PageWriteStore;
+import org.apache.parquet.column.page.PageWriter;
+import org.apache.parquet.column.statistics.Statistics;
+import org.apache.parquet.column.values.bloomfilter.BloomFilter;
+import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData;
+import org.apache.parquet.hadoop.metadata.ColumnPath;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import 
org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileReader;
+import 
org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileWriter;
+import org.apache.parquet.internal.column.columnindex.ColumnIndex;
+import org.apache.parquet.internal.column.columnindex.OffsetIndex;
+import org.apache.parquet.io.ParquetEncodingException;
+import org.apache.parquet.io.api.Converter;
+import org.apache.parquet.io.api.GroupConverter;
+import org.apache.parquet.io.api.PrimitiveConverter;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.PrimitiveType;
+import org.apache.parquet.schema.Type;
+
+import java.io.IOException;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+public class ColumnMasker {
+  /**
+   *
+   * @param reader Reader of source file
+   * @param writer Writer of destination file
+   * @param meta Metadata of source file
+   * @param schema Schema of source file
+   * @param paths Column Paths need to be masked
+   * @param maskMode Mode to mask
+   * @throws IOException
+   */
+  public void processBlocks(TransParquetFileReader reader, 
TransParquetFileWriter writer, ParquetMetadata meta,
+MessageType schema, List paths, MaskMode 
maskMode) throws IOException {
+Set nullifyColumns = convertToColumnPaths(paths);
+int blockIndex = 0;
+PageReadStore store = reader.readNextRowGroup();
+
+while (store != null) {
+  writer.startBlock(store.getRowCount());
+  List columnsInOrder = 
meta.getBlocks().get(blockIndex).getColumns();
+  Map descriptorsMap = 
schema.getColumns().stream().collect(
+Collectors.toMap(x -> ColumnPath.get(x.getPath()), x -> x));
+  ColumnReadStoreImpl crStore = new ColumnReadStoreImpl(store, new 
DummyGroupConverter(), schema,
+meta.getFileMetaData().getCreatedBy());
+
+  for (int i = 0; i < columnsInOrder.size(); i += 1) {
+ColumnChunkMetaData chunk = columnsInOrder.get(i);
+ColumnDescriptor descriptor = descriptorsMap.get(chunk.getPath());
+processChunk(descriptor, chunk, crStore, reader, writer, schema, 
nullifyColumns, maskMode);
+  }
+
+  writer.endBlock();
+  store = reader.readNextRowGroup();
+  blockIndex++;
+}
+  }
+
+  private void processChunk(ColumnDescriptor descriptor, ColumnChunkMetaData 
chunk, ColumnReadStoreImpl crStore,
+TransParquetFileReader reader, 
TransParquetFileWriter wri

[jira] [Commented] (PARQUET-1935) [C++][Parquet] nullptr access violation when writing arrays of non-nullable values

2020-10-23 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219933#comment-17219933
 ] 

Micah Kornfield commented on PARQUET-1935:
--

One workaround for this is to detect the last column doesn't contain nulls and 
call WriteBatch (not spaced) instead that is what the Arrow code path does.  I 
need to think about the right fix here (it will involve changing the check that 
you initially indicated is problematic).

> [C++][Parquet] nullptr access violation when writing arrays of non-nullable 
> values
> --
>
> Key: PARQUET-1935
> URL: https://issues.apache.org/jira/browse/PARQUET-1935
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Tanguy Fautre
>Assignee: Micah Kornfield
>Priority: Major
>
> I'm updating ParquetSharp to build against Arrow 2.0.0 (currently using Arrow 
> 1.0.1). One of our unit test is now throwing a {{nullptr}} access violation.
> I have narrowed it down to writing arrays of non-nullable values (in this 
> case the column contains {{int[]}}) . If the values are nullable, the test 
> passes.
> The parquet file schema is as following:
> * {{GroupNode("schema", LogicalType.None, Repetition.Required)}}
> ** {{GroupNode("array_of_ints_column", LogicalType.List, 
> Repetition.Optional)}}
> *** {{GroupNode("list", LogicalType.None, Repetition.Repeated)}}
>  {{PrimitiveNode("item", LogicalType.Int(32, signed), 
> Repetition.Required)}} 
> The test crashes when calling {{TypedColumnWriter::WriteBatchSpaced}} with 
> the following  arguments:
> * {{num_values = 1}}
> * {{def_levels = {0}}}
> * {{rep_levels = {0}}}
> * {{valid_bits = {0}}}
> * {{valid_bit_offset = 0}}
> * {{values = {}}} (i.e. {{nullptr}})
> This call is effectively trying to write a null array, and therefore (to my 
> understanding) does not need to pass any values. Yet further down the 
> callstack, the implementation tries to read one value out of {{values}} 
> (which is {{nullptr}}).
> I believe the problem lies with
> {code:c++}
>   void MaybeCalculateValidityBits(
> const int16_t* def_levels,
> int64_t batch_size,
> int64_t* out_values_to_write,
> int64_t* out_spaced_values_to_write,
> int64_t* null_count) {
> if (bits_buffer_ == nullptr) {
>   if (!level_info_.HasNullableValues()) {
> *out_values_to_write = batch_size;
> *out_spaced_values_to_write = batch_size;
> *null_count = 0;
>   } else {
> for (int x = 0; x < batch_size; x++) {
>   *out_values_to_write += def_levels[x] == level_info_.def_level ? 1 
> : 0;
>   *out_spaced_values_to_write +=
>   def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : 
> 0;
> }
> *null_count = *out_values_to_write - *out_spaced_values_to_write;
>   }
>   return;
> }
> // ...
>   }
> {code}
> In particular, {{level_info_.HasNullableValues()}} returns {{false}} given 
> that the arrays cannot contain null-values. My understanding is that this is 
> wrong, since the arrays themselves are nullable.
> This code appears to have been introduced by ARROW-9603.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Moved] (PARQUET-1935) [C++][Parquet] nullptr access violation when writing arrays of non-nullable values

2020-10-23 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield moved ARROW-10377 to PARQUET-1935:
--

  Key: PARQUET-1935  (was: ARROW-10377)
Affects Version/s: (was: 2.0.0)
 Workflow: patch-available, re-open possible  (was: jira)
  Project: Parquet  (was: Apache Arrow)

> [C++][Parquet] nullptr access violation when writing arrays of non-nullable 
> values
> --
>
> Key: PARQUET-1935
> URL: https://issues.apache.org/jira/browse/PARQUET-1935
> Project: Parquet
>  Issue Type: Bug
>Reporter: Tanguy Fautre
>Assignee: Micah Kornfield
>Priority: Major
>
> I'm updating ParquetSharp to build against Arrow 2.0.0 (currently using Arrow 
> 1.0.1). One of our unit test is now throwing a {{nullptr}} access violation.
> I have narrowed it down to writing arrays of non-nullable values (in this 
> case the column contains {{int[]}}) . If the values are nullable, the test 
> passes.
> The parquet file schema is as following:
> * {{GroupNode("schema", LogicalType.None, Repetition.Required)}}
> ** {{GroupNode("array_of_ints_column", LogicalType.List, 
> Repetition.Optional)}}
> *** {{GroupNode("list", LogicalType.None, Repetition.Repeated)}}
>  {{PrimitiveNode("item", LogicalType.Int(32, signed), 
> Repetition.Required)}} 
> The test crashes when calling {{TypedColumnWriter::WriteBatchSpaced}} with 
> the following  arguments:
> * {{num_values = 1}}
> * {{def_levels = {0}}}
> * {{rep_levels = {0}}}
> * {{valid_bits = {0}}}
> * {{valid_bit_offset = 0}}
> * {{values = {}}} (i.e. {{nullptr}})
> This call is effectively trying to write a null array, and therefore (to my 
> understanding) does not need to pass any values. Yet further down the 
> callstack, the implementation tries to read one value out of {{values}} 
> (which is {{nullptr}}).
> I believe the problem lies with
> {code:c++}
>   void MaybeCalculateValidityBits(
> const int16_t* def_levels,
> int64_t batch_size,
> int64_t* out_values_to_write,
> int64_t* out_spaced_values_to_write,
> int64_t* null_count) {
> if (bits_buffer_ == nullptr) {
>   if (!level_info_.HasNullableValues()) {
> *out_values_to_write = batch_size;
> *out_spaced_values_to_write = batch_size;
> *null_count = 0;
>   } else {
> for (int x = 0; x < batch_size; x++) {
>   *out_values_to_write += def_levels[x] == level_info_.def_level ? 1 
> : 0;
>   *out_spaced_values_to_write +=
>   def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : 
> 0;
> }
> *null_count = *out_values_to_write - *out_spaced_values_to_write;
>   }
>   return;
> }
> // ...
>   }
> {code}
> In particular, {{level_info_.HasNullableValues()}} returns {{false}} given 
> that the arrays cannot contain null-values. My understanding is that this is 
> wrong, since the arrays themselves are nullable.
> This code appears to have been introduced by ARROW-9603.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1935) [C++][Parquet] nullptr access violation when writing arrays of non-nullable values

2020-10-23 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated PARQUET-1935:
-
Component/s: parquet-cpp

> [C++][Parquet] nullptr access violation when writing arrays of non-nullable 
> values
> --
>
> Key: PARQUET-1935
> URL: https://issues.apache.org/jira/browse/PARQUET-1935
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Tanguy Fautre
>Assignee: Micah Kornfield
>Priority: Major
>
> I'm updating ParquetSharp to build against Arrow 2.0.0 (currently using Arrow 
> 1.0.1). One of our unit test is now throwing a {{nullptr}} access violation.
> I have narrowed it down to writing arrays of non-nullable values (in this 
> case the column contains {{int[]}}) . If the values are nullable, the test 
> passes.
> The parquet file schema is as following:
> * {{GroupNode("schema", LogicalType.None, Repetition.Required)}}
> ** {{GroupNode("array_of_ints_column", LogicalType.List, 
> Repetition.Optional)}}
> *** {{GroupNode("list", LogicalType.None, Repetition.Repeated)}}
>  {{PrimitiveNode("item", LogicalType.Int(32, signed), 
> Repetition.Required)}} 
> The test crashes when calling {{TypedColumnWriter::WriteBatchSpaced}} with 
> the following  arguments:
> * {{num_values = 1}}
> * {{def_levels = {0}}}
> * {{rep_levels = {0}}}
> * {{valid_bits = {0}}}
> * {{valid_bit_offset = 0}}
> * {{values = {}}} (i.e. {{nullptr}})
> This call is effectively trying to write a null array, and therefore (to my 
> understanding) does not need to pass any values. Yet further down the 
> callstack, the implementation tries to read one value out of {{values}} 
> (which is {{nullptr}}).
> I believe the problem lies with
> {code:c++}
>   void MaybeCalculateValidityBits(
> const int16_t* def_levels,
> int64_t batch_size,
> int64_t* out_values_to_write,
> int64_t* out_spaced_values_to_write,
> int64_t* null_count) {
> if (bits_buffer_ == nullptr) {
>   if (!level_info_.HasNullableValues()) {
> *out_values_to_write = batch_size;
> *out_spaced_values_to_write = batch_size;
> *null_count = 0;
>   } else {
> for (int x = 0; x < batch_size; x++) {
>   *out_values_to_write += def_levels[x] == level_info_.def_level ? 1 
> : 0;
>   *out_spaced_values_to_write +=
>   def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : 
> 0;
> }
> *null_count = *out_values_to_write - *out_spaced_values_to_write;
>   }
>   return;
> }
> // ...
>   }
> {code}
> In particular, {{level_info_.HasNullableValues()}} returns {{false}} given 
> that the arrays cannot contain null-values. My understanding is that this is 
> wrong, since the arrays themselves are nullable.
> This code appears to have been introduced by ARROW-9603.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-10-23 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang reassigned PARQUET-1927:


Assignee: Xinli Shang

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1915) Add null command

2020-10-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219690#comment-17219690
 ] 

ASF GitHub Bot commented on PARQUET-1915:
-

shangxinli commented on a change in pull request #819:
URL: https://github.com/apache/parquet-mr/pull/819#discussion_r510896852



##
File path: 
parquet-cli/src/main/java/org/apache/parquet/cli/commands/ColumnMaskingCommand.java
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.cli.commands;
+
+import com.beust.jcommander.Parameter;
+import com.beust.jcommander.Parameters;
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.HadoopReadOptions;
+import org.apache.parquet.cli.BaseCommand;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.util.ColumnMasker;
+import org.apache.parquet.hadoop.util.ColumnMasker.MaskMode;
+import 
org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileReader;
+import 
org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileWriter;
+import org.apache.parquet.hadoop.util.HadoopInputFile;
+import org.apache.parquet.schema.MessageType;
+import org.slf4j.Logger;
+
+import java.io.IOException;
+import java.util.List;
+
+import static 
org.apache.parquet.format.converter.ParquetMetadataConverter.NO_FILTER;
+
+@Parameters(commandDescription="Replace columns with masked values and write 
to a new Parquet file")
+public class ColumnMaskingCommand extends BaseCommand {
+
+  private ColumnMasker masker;
+
+  public ColumnMaskingCommand(Logger console) {
+super(console);
+masker = new ColumnMasker();
+  }
+
+  @Parameter(description = "")

Review comment:
   > I am not sure it is a good idea the different modes and throw 
exceptions that some of them are not supported/implemented. We are heading to 
the next major release and I think it is not a good user experience to have 
such features list as if they work properly but they are not implemented yet.
   
   Sounds good! I will remove unsupported ones from the list.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add null command 
> -
>
> Key: PARQUET-1915
> URL: https://issues.apache.org/jira/browse/PARQUET-1915
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] shangxinli commented on a change in pull request #819: PARQUET-1915: Add nullify column

2020-10-23 Thread GitBox


shangxinli commented on a change in pull request #819:
URL: https://github.com/apache/parquet-mr/pull/819#discussion_r510896852



##
File path: 
parquet-cli/src/main/java/org/apache/parquet/cli/commands/ColumnMaskingCommand.java
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.parquet.cli.commands;
+
+import com.beust.jcommander.Parameter;
+import com.beust.jcommander.Parameters;
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.HadoopReadOptions;
+import org.apache.parquet.cli.BaseCommand;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.ParquetFileWriter;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.hadoop.util.ColumnMasker;
+import org.apache.parquet.hadoop.util.ColumnMasker.MaskMode;
+import 
org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileReader;
+import 
org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileWriter;
+import org.apache.parquet.hadoop.util.HadoopInputFile;
+import org.apache.parquet.schema.MessageType;
+import org.slf4j.Logger;
+
+import java.io.IOException;
+import java.util.List;
+
+import static 
org.apache.parquet.format.converter.ParquetMetadataConverter.NO_FILTER;
+
+@Parameters(commandDescription="Replace columns with masked values and write 
to a new Parquet file")
+public class ColumnMaskingCommand extends BaseCommand {
+
+  private ColumnMasker masker;
+
+  public ColumnMaskingCommand(Logger console) {
+super(console);
+masker = new ColumnMasker();
+  }
+
+  @Parameter(description = "")

Review comment:
   > I am not sure it is a good idea the different modes and throw 
exceptions that some of them are not supported/implemented. We are heading to 
the next major release and I think it is not a good user experience to have 
such features list as if they work properly but they are not implemented yet.
   
   Sounds good! I will remove unsupported ones from the list.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] ggershinsky commented on a change in pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…

2020-10-23 Thread GitBox


ggershinsky commented on a change in pull request #808:
URL: https://github.com/apache/parquet-mr/pull/808#discussion_r510692882



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java
##
@@ -279,6 +279,11 @@ public ParquetWriter(Path file, Configuration conf, 
WriteSupport writeSupport
 WriteSupport.WriteContext writeContext = writeSupport.init(conf);
 MessageType schema = writeContext.getSchema();
 
+// encryptionProperties could be built from the implementation of 
EncryptionPropertiesFactory when it is attached.
+if (encryptionProperties == null) {
+  encryptionProperties = 
ParquetOutputFormat.createEncryptionProperties(conf, new Path(file.toString()), 
writeContext);

Review comment:
   looks good. please also replace `new Path(file.toString())` with 
`file.getPath()`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org