[jira] [Updated] (PARQUET-1935) [C++][Parquet] nullptr access violation when writing arrays of non-nullable values
[ https://issues.apache.org/jira/browse/PARQUET-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1935: Labels: pull-request-available (was: ) > [C++][Parquet] nullptr access violation when writing arrays of non-nullable > values > -- > > Key: PARQUET-1935 > URL: https://issues.apache.org/jira/browse/PARQUET-1935 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Tanguy Fautre >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > I'm updating ParquetSharp to build against Arrow 2.0.0 (currently using Arrow > 1.0.1). One of our unit test is now throwing a {{nullptr}} access violation. > I have narrowed it down to writing arrays of non-nullable values (in this > case the column contains {{int[]}}) . If the values are nullable, the test > passes. > The parquet file schema is as following: > * {{GroupNode("schema", LogicalType.None, Repetition.Required)}} > ** {{GroupNode("array_of_ints_column", LogicalType.List, > Repetition.Optional)}} > *** {{GroupNode("list", LogicalType.None, Repetition.Repeated)}} > {{PrimitiveNode("item", LogicalType.Int(32, signed), > Repetition.Required)}} > The test crashes when calling {{TypedColumnWriter::WriteBatchSpaced}} with > the following arguments: > * {{num_values = 1}} > * {{def_levels = {0}}} > * {{rep_levels = {0}}} > * {{valid_bits = {0}}} > * {{valid_bit_offset = 0}} > * {{values = {}}} (i.e. {{nullptr}}) > This call is effectively trying to write a null array, and therefore (to my > understanding) does not need to pass any values. Yet further down the > callstack, the implementation tries to read one value out of {{values}} > (which is {{nullptr}}). > I believe the problem lies with > {code:c++} > void MaybeCalculateValidityBits( > const int16_t* def_levels, > int64_t batch_size, > int64_t* out_values_to_write, > int64_t* out_spaced_values_to_write, > int64_t* null_count) { > if (bits_buffer_ == nullptr) { > if (!level_info_.HasNullableValues()) { > *out_values_to_write = batch_size; > *out_spaced_values_to_write = batch_size; > *null_count = 0; > } else { > for (int x = 0; x < batch_size; x++) { > *out_values_to_write += def_levels[x] == level_info_.def_level ? 1 > : 0; > *out_spaced_values_to_write += > def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : > 0; > } > *null_count = *out_values_to_write - *out_spaced_values_to_write; > } > return; > } > // ... > } > {code} > In particular, {{level_info_.HasNullableValues()}} returns {{false}} given > that the arrays cannot contain null-values. My understanding is that this is > wrong, since the arrays themselves are nullable. > This code appears to have been introduced by ARROW-9603. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] shangxinli commented on a change in pull request #819: PARQUET-1915: Add nullify column
shangxinli commented on a change in pull request #819: URL: https://github.com/apache/parquet-mr/pull/819#discussion_r511212212 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/ColumnMasker.java ## @@ -0,0 +1,274 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.util; + +import org.apache.parquet.bytes.BytesInput; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.column.ColumnReader; +import org.apache.parquet.column.ColumnWriter; +import org.apache.parquet.column.Encoding; +import org.apache.parquet.column.ParquetProperties; +import org.apache.parquet.column.ParquetProperties.WriterVersion; +import org.apache.parquet.column.impl.ColumnReadStoreImpl; +import org.apache.parquet.column.page.DataPage; +import org.apache.parquet.column.page.DictionaryPage; +import org.apache.parquet.column.page.PageReadStore; +import org.apache.parquet.column.page.PageWriteStore; +import org.apache.parquet.column.page.PageWriter; +import org.apache.parquet.column.statistics.Statistics; +import org.apache.parquet.column.values.bloomfilter.BloomFilter; +import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData; +import org.apache.parquet.hadoop.metadata.ColumnPath; +import org.apache.parquet.hadoop.metadata.CompressionCodecName; +import org.apache.parquet.hadoop.metadata.ParquetMetadata; +import org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileReader; +import org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileWriter; +import org.apache.parquet.internal.column.columnindex.ColumnIndex; +import org.apache.parquet.internal.column.columnindex.OffsetIndex; +import org.apache.parquet.io.ParquetEncodingException; +import org.apache.parquet.io.api.Converter; +import org.apache.parquet.io.api.GroupConverter; +import org.apache.parquet.io.api.PrimitiveConverter; +import org.apache.parquet.schema.MessageType; +import org.apache.parquet.schema.PrimitiveType; +import org.apache.parquet.schema.Type; + +import java.io.IOException; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; + +public class ColumnMasker { + /** + * + * @param reader Reader of source file + * @param writer Writer of destination file + * @param meta Metadata of source file + * @param schema Schema of source file + * @param paths Column Paths need to be masked + * @param maskMode Mode to mask + * @throws IOException + */ + public void processBlocks(TransParquetFileReader reader, TransParquetFileWriter writer, ParquetMetadata meta, +MessageType schema, List paths, MaskMode maskMode) throws IOException { +Set nullifyColumns = convertToColumnPaths(paths); +int blockIndex = 0; +PageReadStore store = reader.readNextRowGroup(); + +while (store != null) { + writer.startBlock(store.getRowCount()); + List columnsInOrder = meta.getBlocks().get(blockIndex).getColumns(); + Map descriptorsMap = schema.getColumns().stream().collect( +Collectors.toMap(x -> ColumnPath.get(x.getPath()), x -> x)); + ColumnReadStoreImpl crStore = new ColumnReadStoreImpl(store, new DummyGroupConverter(), schema, +meta.getFileMetaData().getCreatedBy()); + + for (int i = 0; i < columnsInOrder.size(); i += 1) { +ColumnChunkMetaData chunk = columnsInOrder.get(i); +ColumnDescriptor descriptor = descriptorsMap.get(chunk.getPath()); +processChunk(descriptor, chunk, crStore, reader, writer, schema, nullifyColumns, maskMode); + } + + writer.endBlock(); + store = reader.readNextRowGroup(); + blockIndex++; +} + } + + private void processChunk(ColumnDescriptor descriptor, ColumnChunkMetaData chunk, ColumnReadStoreImpl crStore, +TransParquetFileReader reader, TransParquetFileWriter writer, MessageType schema, +Set paths, MaskMode maskMode) throws IOException { +reader.setStreamPosition(chunk.getStartingPos()); + +if (paths.contains(chunk.getPath())) { + if (maskMode.equals(MaskMode.NULLIFY)) { +
[jira] [Commented] (PARQUET-1915) Add null command
[ https://issues.apache.org/jira/browse/PARQUET-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219981#comment-17219981 ] ASF GitHub Bot commented on PARQUET-1915: - shangxinli commented on a change in pull request #819: URL: https://github.com/apache/parquet-mr/pull/819#discussion_r511212212 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/ColumnMasker.java ## @@ -0,0 +1,274 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.parquet.hadoop.util; + +import org.apache.parquet.bytes.BytesInput; +import org.apache.parquet.column.ColumnDescriptor; +import org.apache.parquet.column.ColumnReader; +import org.apache.parquet.column.ColumnWriter; +import org.apache.parquet.column.Encoding; +import org.apache.parquet.column.ParquetProperties; +import org.apache.parquet.column.ParquetProperties.WriterVersion; +import org.apache.parquet.column.impl.ColumnReadStoreImpl; +import org.apache.parquet.column.page.DataPage; +import org.apache.parquet.column.page.DictionaryPage; +import org.apache.parquet.column.page.PageReadStore; +import org.apache.parquet.column.page.PageWriteStore; +import org.apache.parquet.column.page.PageWriter; +import org.apache.parquet.column.statistics.Statistics; +import org.apache.parquet.column.values.bloomfilter.BloomFilter; +import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData; +import org.apache.parquet.hadoop.metadata.ColumnPath; +import org.apache.parquet.hadoop.metadata.CompressionCodecName; +import org.apache.parquet.hadoop.metadata.ParquetMetadata; +import org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileReader; +import org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileWriter; +import org.apache.parquet.internal.column.columnindex.ColumnIndex; +import org.apache.parquet.internal.column.columnindex.OffsetIndex; +import org.apache.parquet.io.ParquetEncodingException; +import org.apache.parquet.io.api.Converter; +import org.apache.parquet.io.api.GroupConverter; +import org.apache.parquet.io.api.PrimitiveConverter; +import org.apache.parquet.schema.MessageType; +import org.apache.parquet.schema.PrimitiveType; +import org.apache.parquet.schema.Type; + +import java.io.IOException; +import java.util.HashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; + +public class ColumnMasker { + /** + * + * @param reader Reader of source file + * @param writer Writer of destination file + * @param meta Metadata of source file + * @param schema Schema of source file + * @param paths Column Paths need to be masked + * @param maskMode Mode to mask + * @throws IOException + */ + public void processBlocks(TransParquetFileReader reader, TransParquetFileWriter writer, ParquetMetadata meta, +MessageType schema, List paths, MaskMode maskMode) throws IOException { +Set nullifyColumns = convertToColumnPaths(paths); +int blockIndex = 0; +PageReadStore store = reader.readNextRowGroup(); + +while (store != null) { + writer.startBlock(store.getRowCount()); + List columnsInOrder = meta.getBlocks().get(blockIndex).getColumns(); + Map descriptorsMap = schema.getColumns().stream().collect( +Collectors.toMap(x -> ColumnPath.get(x.getPath()), x -> x)); + ColumnReadStoreImpl crStore = new ColumnReadStoreImpl(store, new DummyGroupConverter(), schema, +meta.getFileMetaData().getCreatedBy()); + + for (int i = 0; i < columnsInOrder.size(); i += 1) { +ColumnChunkMetaData chunk = columnsInOrder.get(i); +ColumnDescriptor descriptor = descriptorsMap.get(chunk.getPath()); +processChunk(descriptor, chunk, crStore, reader, writer, schema, nullifyColumns, maskMode); + } + + writer.endBlock(); + store = reader.readNextRowGroup(); + blockIndex++; +} + } + + private void processChunk(ColumnDescriptor descriptor, ColumnChunkMetaData chunk, ColumnReadStoreImpl crStore, +TransParquetFileReader reader, TransParquetFileWriter wri
[jira] [Commented] (PARQUET-1935) [C++][Parquet] nullptr access violation when writing arrays of non-nullable values
[ https://issues.apache.org/jira/browse/PARQUET-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219933#comment-17219933 ] Micah Kornfield commented on PARQUET-1935: -- One workaround for this is to detect the last column doesn't contain nulls and call WriteBatch (not spaced) instead that is what the Arrow code path does. I need to think about the right fix here (it will involve changing the check that you initially indicated is problematic). > [C++][Parquet] nullptr access violation when writing arrays of non-nullable > values > -- > > Key: PARQUET-1935 > URL: https://issues.apache.org/jira/browse/PARQUET-1935 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Tanguy Fautre >Assignee: Micah Kornfield >Priority: Major > > I'm updating ParquetSharp to build against Arrow 2.0.0 (currently using Arrow > 1.0.1). One of our unit test is now throwing a {{nullptr}} access violation. > I have narrowed it down to writing arrays of non-nullable values (in this > case the column contains {{int[]}}) . If the values are nullable, the test > passes. > The parquet file schema is as following: > * {{GroupNode("schema", LogicalType.None, Repetition.Required)}} > ** {{GroupNode("array_of_ints_column", LogicalType.List, > Repetition.Optional)}} > *** {{GroupNode("list", LogicalType.None, Repetition.Repeated)}} > {{PrimitiveNode("item", LogicalType.Int(32, signed), > Repetition.Required)}} > The test crashes when calling {{TypedColumnWriter::WriteBatchSpaced}} with > the following arguments: > * {{num_values = 1}} > * {{def_levels = {0}}} > * {{rep_levels = {0}}} > * {{valid_bits = {0}}} > * {{valid_bit_offset = 0}} > * {{values = {}}} (i.e. {{nullptr}}) > This call is effectively trying to write a null array, and therefore (to my > understanding) does not need to pass any values. Yet further down the > callstack, the implementation tries to read one value out of {{values}} > (which is {{nullptr}}). > I believe the problem lies with > {code:c++} > void MaybeCalculateValidityBits( > const int16_t* def_levels, > int64_t batch_size, > int64_t* out_values_to_write, > int64_t* out_spaced_values_to_write, > int64_t* null_count) { > if (bits_buffer_ == nullptr) { > if (!level_info_.HasNullableValues()) { > *out_values_to_write = batch_size; > *out_spaced_values_to_write = batch_size; > *null_count = 0; > } else { > for (int x = 0; x < batch_size; x++) { > *out_values_to_write += def_levels[x] == level_info_.def_level ? 1 > : 0; > *out_spaced_values_to_write += > def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : > 0; > } > *null_count = *out_values_to_write - *out_spaced_values_to_write; > } > return; > } > // ... > } > {code} > In particular, {{level_info_.HasNullableValues()}} returns {{false}} given > that the arrays cannot contain null-values. My understanding is that this is > wrong, since the arrays themselves are nullable. > This code appears to have been introduced by ARROW-9603. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Moved] (PARQUET-1935) [C++][Parquet] nullptr access violation when writing arrays of non-nullable values
[ https://issues.apache.org/jira/browse/PARQUET-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield moved ARROW-10377 to PARQUET-1935: -- Key: PARQUET-1935 (was: ARROW-10377) Affects Version/s: (was: 2.0.0) Workflow: patch-available, re-open possible (was: jira) Project: Parquet (was: Apache Arrow) > [C++][Parquet] nullptr access violation when writing arrays of non-nullable > values > -- > > Key: PARQUET-1935 > URL: https://issues.apache.org/jira/browse/PARQUET-1935 > Project: Parquet > Issue Type: Bug >Reporter: Tanguy Fautre >Assignee: Micah Kornfield >Priority: Major > > I'm updating ParquetSharp to build against Arrow 2.0.0 (currently using Arrow > 1.0.1). One of our unit test is now throwing a {{nullptr}} access violation. > I have narrowed it down to writing arrays of non-nullable values (in this > case the column contains {{int[]}}) . If the values are nullable, the test > passes. > The parquet file schema is as following: > * {{GroupNode("schema", LogicalType.None, Repetition.Required)}} > ** {{GroupNode("array_of_ints_column", LogicalType.List, > Repetition.Optional)}} > *** {{GroupNode("list", LogicalType.None, Repetition.Repeated)}} > {{PrimitiveNode("item", LogicalType.Int(32, signed), > Repetition.Required)}} > The test crashes when calling {{TypedColumnWriter::WriteBatchSpaced}} with > the following arguments: > * {{num_values = 1}} > * {{def_levels = {0}}} > * {{rep_levels = {0}}} > * {{valid_bits = {0}}} > * {{valid_bit_offset = 0}} > * {{values = {}}} (i.e. {{nullptr}}) > This call is effectively trying to write a null array, and therefore (to my > understanding) does not need to pass any values. Yet further down the > callstack, the implementation tries to read one value out of {{values}} > (which is {{nullptr}}). > I believe the problem lies with > {code:c++} > void MaybeCalculateValidityBits( > const int16_t* def_levels, > int64_t batch_size, > int64_t* out_values_to_write, > int64_t* out_spaced_values_to_write, > int64_t* null_count) { > if (bits_buffer_ == nullptr) { > if (!level_info_.HasNullableValues()) { > *out_values_to_write = batch_size; > *out_spaced_values_to_write = batch_size; > *null_count = 0; > } else { > for (int x = 0; x < batch_size; x++) { > *out_values_to_write += def_levels[x] == level_info_.def_level ? 1 > : 0; > *out_spaced_values_to_write += > def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : > 0; > } > *null_count = *out_values_to_write - *out_spaced_values_to_write; > } > return; > } > // ... > } > {code} > In particular, {{level_info_.HasNullableValues()}} returns {{false}} given > that the arrays cannot contain null-values. My understanding is that this is > wrong, since the arrays themselves are nullable. > This code appears to have been introduced by ARROW-9603. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1935) [C++][Parquet] nullptr access violation when writing arrays of non-nullable values
[ https://issues.apache.org/jira/browse/PARQUET-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield updated PARQUET-1935: - Component/s: parquet-cpp > [C++][Parquet] nullptr access violation when writing arrays of non-nullable > values > -- > > Key: PARQUET-1935 > URL: https://issues.apache.org/jira/browse/PARQUET-1935 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Tanguy Fautre >Assignee: Micah Kornfield >Priority: Major > > I'm updating ParquetSharp to build against Arrow 2.0.0 (currently using Arrow > 1.0.1). One of our unit test is now throwing a {{nullptr}} access violation. > I have narrowed it down to writing arrays of non-nullable values (in this > case the column contains {{int[]}}) . If the values are nullable, the test > passes. > The parquet file schema is as following: > * {{GroupNode("schema", LogicalType.None, Repetition.Required)}} > ** {{GroupNode("array_of_ints_column", LogicalType.List, > Repetition.Optional)}} > *** {{GroupNode("list", LogicalType.None, Repetition.Repeated)}} > {{PrimitiveNode("item", LogicalType.Int(32, signed), > Repetition.Required)}} > The test crashes when calling {{TypedColumnWriter::WriteBatchSpaced}} with > the following arguments: > * {{num_values = 1}} > * {{def_levels = {0}}} > * {{rep_levels = {0}}} > * {{valid_bits = {0}}} > * {{valid_bit_offset = 0}} > * {{values = {}}} (i.e. {{nullptr}}) > This call is effectively trying to write a null array, and therefore (to my > understanding) does not need to pass any values. Yet further down the > callstack, the implementation tries to read one value out of {{values}} > (which is {{nullptr}}). > I believe the problem lies with > {code:c++} > void MaybeCalculateValidityBits( > const int16_t* def_levels, > int64_t batch_size, > int64_t* out_values_to_write, > int64_t* out_spaced_values_to_write, > int64_t* null_count) { > if (bits_buffer_ == nullptr) { > if (!level_info_.HasNullableValues()) { > *out_values_to_write = batch_size; > *out_spaced_values_to_write = batch_size; > *null_count = 0; > } else { > for (int x = 0; x < batch_size; x++) { > *out_values_to_write += def_levels[x] == level_info_.def_level ? 1 > : 0; > *out_spaced_values_to_write += > def_levels[x] >= level_info_.repeated_ancestor_def_level ? 1 : > 0; > } > *null_count = *out_values_to_write - *out_spaced_values_to_write; > } > return; > } > // ... > } > {code} > In particular, {{level_info_.HasNullableValues()}} returns {{false}} given > that the arrays cannot contain null-values. My understanding is that this is > wrong, since the arrays themselves are nullable. > This code appears to have been introduced by ARROW-9603. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinli Shang reassigned PARQUET-1927: Assignee: Xinli Shang > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1915) Add null command
[ https://issues.apache.org/jira/browse/PARQUET-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219690#comment-17219690 ] ASF GitHub Bot commented on PARQUET-1915: - shangxinli commented on a change in pull request #819: URL: https://github.com/apache/parquet-mr/pull/819#discussion_r510896852 ## File path: parquet-cli/src/main/java/org/apache/parquet/cli/commands/ColumnMaskingCommand.java ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.cli.commands; + +import com.beust.jcommander.Parameter; +import com.beust.jcommander.Parameters; +import com.google.common.base.Preconditions; +import com.google.common.collect.Lists; +import org.apache.hadoop.fs.Path; +import org.apache.parquet.HadoopReadOptions; +import org.apache.parquet.cli.BaseCommand; +import org.apache.parquet.hadoop.ParquetFileReader; +import org.apache.parquet.hadoop.ParquetFileWriter; +import org.apache.parquet.hadoop.metadata.ParquetMetadata; +import org.apache.parquet.hadoop.util.ColumnMasker; +import org.apache.parquet.hadoop.util.ColumnMasker.MaskMode; +import org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileReader; +import org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileWriter; +import org.apache.parquet.hadoop.util.HadoopInputFile; +import org.apache.parquet.schema.MessageType; +import org.slf4j.Logger; + +import java.io.IOException; +import java.util.List; + +import static org.apache.parquet.format.converter.ParquetMetadataConverter.NO_FILTER; + +@Parameters(commandDescription="Replace columns with masked values and write to a new Parquet file") +public class ColumnMaskingCommand extends BaseCommand { + + private ColumnMasker masker; + + public ColumnMaskingCommand(Logger console) { +super(console); +masker = new ColumnMasker(); + } + + @Parameter(description = "") Review comment: > I am not sure it is a good idea the different modes and throw exceptions that some of them are not supported/implemented. We are heading to the next major release and I think it is not a good user experience to have such features list as if they work properly but they are not implemented yet. Sounds good! I will remove unsupported ones from the list. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add null command > - > > Key: PARQUET-1915 > URL: https://issues.apache.org/jira/browse/PARQUET-1915 > Project: Parquet > Issue Type: Sub-task >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] shangxinli commented on a change in pull request #819: PARQUET-1915: Add nullify column
shangxinli commented on a change in pull request #819: URL: https://github.com/apache/parquet-mr/pull/819#discussion_r510896852 ## File path: parquet-cli/src/main/java/org/apache/parquet/cli/commands/ColumnMaskingCommand.java ## @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.parquet.cli.commands; + +import com.beust.jcommander.Parameter; +import com.beust.jcommander.Parameters; +import com.google.common.base.Preconditions; +import com.google.common.collect.Lists; +import org.apache.hadoop.fs.Path; +import org.apache.parquet.HadoopReadOptions; +import org.apache.parquet.cli.BaseCommand; +import org.apache.parquet.hadoop.ParquetFileReader; +import org.apache.parquet.hadoop.ParquetFileWriter; +import org.apache.parquet.hadoop.metadata.ParquetMetadata; +import org.apache.parquet.hadoop.util.ColumnMasker; +import org.apache.parquet.hadoop.util.ColumnMasker.MaskMode; +import org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileReader; +import org.apache.parquet.hadoop.util.CompressionConverter.TransParquetFileWriter; +import org.apache.parquet.hadoop.util.HadoopInputFile; +import org.apache.parquet.schema.MessageType; +import org.slf4j.Logger; + +import java.io.IOException; +import java.util.List; + +import static org.apache.parquet.format.converter.ParquetMetadataConverter.NO_FILTER; + +@Parameters(commandDescription="Replace columns with masked values and write to a new Parquet file") +public class ColumnMaskingCommand extends BaseCommand { + + private ColumnMasker masker; + + public ColumnMaskingCommand(Logger console) { +super(console); +masker = new ColumnMasker(); + } + + @Parameter(description = "") Review comment: > I am not sure it is a good idea the different modes and throw exceptions that some of them are not supported/implemented. We are heading to the next major release and I think it is not a good user experience to have such features list as if they work properly but they are not implemented yet. Sounds good! I will remove unsupported ones from the list. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] ggershinsky commented on a change in pull request #808: Parquet-1396: Cryptodata Interface for Schema Activation of Parquet E…
ggershinsky commented on a change in pull request #808: URL: https://github.com/apache/parquet-mr/pull/808#discussion_r510692882 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java ## @@ -279,6 +279,11 @@ public ParquetWriter(Path file, Configuration conf, WriteSupport writeSupport WriteSupport.WriteContext writeContext = writeSupport.init(conf); MessageType schema = writeContext.getSchema(); +// encryptionProperties could be built from the implementation of EncryptionPropertiesFactory when it is attached. +if (encryptionProperties == null) { + encryptionProperties = ParquetOutputFormat.createEncryptionProperties(conf, new Path(file.toString()), writeContext); Review comment: looks good. please also replace `new Path(file.toString())` with `file.getPath()` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org