[jira] [Updated] (PARQUET-1178) Parquet modular encryption

2020-02-11 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated PARQUET-1178:
--
Fix Version/s: (was: format-2.7.0)

> Parquet modular encryption
> --
>
> Key: PARQUET-1178
> URL: https://issues.apache.org/jira/browse/PARQUET-1178
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> A mechanism for modular encryption and decryption of Parquet files. Allows to 
> keep data fully encrypted in the storage - while enabling efficient analytics 
> on the data, via reader-side extraction / authentication / decryption of data 
> subsets required by columnar projection and predicate push-down.
> Enables fine-grained access control to column data by encrypting different 
> columns with different keys.
> Supports a number of encryption algorithms, to account for different security 
> and performance requirements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1178) Parquet modular encryption

2020-02-11 Thread Jason Brugger (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034916#comment-17034916
 ] 

Jason Brugger commented on PARQUET-1178:


What's the best way to get started with this on a Databricks cluster? If I 
install _format-2.7.0_ as a new library, how would I reference this data source 
in lieu of the cluster's default parquet library?

> Parquet modular encryption
> --
>
> Key: PARQUET-1178
> URL: https://issues.apache.org/jira/browse/PARQUET-1178
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Major
> Fix For: format-2.7.0
>
>
> A mechanism for modular encryption and decryption of Parquet files. Allows to 
> keep data fully encrypted in the storage - while enabling efficient analytics 
> on the data, via reader-side extraction / authentication / decryption of data 
> subsets required by columnar projection and predicate push-down.
> Enables fine-grained access control to column data by encrypting different 
> columns with different keys.
> Supports a number of encryption algorithms, to account for different security 
> and performance requirements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1770) [C++][CI] Add fuzz target for reading Parquet files

2020-02-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1770:

Labels: pull-request-available  (was: )

> [C++][CI] Add fuzz target for reading Parquet files
> ---
>
> Key: PARQUET-1770
> URL: https://issues.apache.org/jira/browse/PARQUET-1770
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> Now that Arrow has been accepted on OSS-Fuzz, we should check for crashes and 
> potential vulnerabilities when reading Parquet files.
> The Parquet fuzz target should use similar conventions as the IPC fuzz 
> targets in {{cpp/src/arrow/ipc/}}. An executable to generate a seed corpus 
> should be added as well, to make fuzzing more efficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1770) [C++][CI] Add fuzz target for reading Parquet files

2020-02-11 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned PARQUET-1770:
---

Assignee: Antoine Pitrou

> [C++][CI] Add fuzz target for reading Parquet files
> ---
>
> Key: PARQUET-1770
> URL: https://issues.apache.org/jira/browse/PARQUET-1770
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> Now that Arrow has been accepted on OSS-Fuzz, we should check for crashes and 
> potential vulnerabilities when reading Parquet files.
> The Parquet fuzz target should use similar conventions as the IPC fuzz 
> targets in {{cpp/src/arrow/ipc/}}. An executable to generate a seed corpus 
> should be added as well, to make fuzzing more efficient.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2020-02-11 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034588#comment-17034588
 ] 

Xinli Shang edited comment on PARQUET-1792 at 2/11/20 4:36 PM:
---

[~gershinsky], this is just a simple offline tool to replace the raw columns 
with masked value. It is different from what we talked about earlier for the 
data obfuscation feature. The difference is that users have to run this tool 
explicitly and they are aware of what the data to be after translation. There 
is no chance that they accidentally or implicitly, or doing it by default.

The tool can provide a different way to translate the raw data to masked value 
and can allow the user to define their own if they have security concerns. We 
just provide the tool to make their work easier. In addition, ORC already has 
those mask mechanism released.  

As mentioned earlier, I can send an email to dev email group to see if they 
have the needs of this tool. 

Again, this proposal is independent of the data obfuscation that we are jointly 
working on it. 

 

 

 


was (Author: sha...@uber.com):
[~gershinsky], this is just a simple offline tool to replace the raw columns 
with masked value. It is different from what we talked about earlier for the 
data obfuscation feature. The difference is that users have to run this tool 
explicitly and they are aware of what the data to be after translation. There 
is no chance that they accidentally, implicitly or doing it by default.

The tool can provide a different way to translate the raw data to masked value 
and can allow the user to define their own if they have security concerns. We 
just provide the tool to make their work easier. In addition, ORC already has 
those mask mechanism released.  

As mentioned earlier, I can send an email to dev email group to see if they 
have the needs of this tool. 

Again, this proposal is independent of the data obfuscation that we are jointly 
working on it. 

 

 

 

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2020-02-11 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034588#comment-17034588
 ] 

Xinli Shang commented on PARQUET-1792:
--

[~gershinsky], this is just a simple offline tool to replace the raw columns 
with masked value. It is different from what we talked about earlier for the 
data obfuscation feature. The difference is that users have to run this tool 
explicitly and they are aware of what the data to be after translation. There 
is no chance that they accidentally, implicitly or doing it by default.

The tool can provide a different way to translate the raw data to masked value 
and can allow the user to define their own if they have security concerns. We 
just provide the tool to make their work easier. In addition, ORC already has 
those mask mechanism released.  

As mentioned earlier, I can send an email to dev email group to see if they 
have the needs of this tool. 

Again, this proposal is independent of the data obfuscation that we are jointly 
working on it. 

 

 

 

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2020-02-11 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034553#comment-17034553
 ] 

Xinli Shang commented on PARQUET-1792:
--

[~gszadovszky] the tool can be run in parallel in a cluster. For example, we 
can easily write a Spark application to do it. Actually even for 'prune', we 
still need to write Spark application to parallel it. Otherwise, the time to 
finish is still significant, although it is already faster than doing it in 
query engines.  

Regarding reading the original value and generating the hash/statistics, we 
only need to do it for the columns to be masked. In many cases, what we see is 
that there are only very few columns to be masked. For all other columns that 
don't need to be masked, we just move them as a whole like 'merge' or 'prune' 
command, which would be a big saving. But yes, this operation would be slower 
than 'prune' command, but it still can save huge comparing with doing it via 
query engine. 

 

 

 

 

 

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2020-02-11 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034553#comment-17034553
 ] 

Xinli Shang edited comment on PARQUET-1792 at 2/11/20 3:47 PM:
---

[~gszadovszky] the tool can be run in parallel in a cluster. For example, we 
can easily write a Spark application to do it. Actually even for 'prune', we 
still need to write Spark application to parallel it. Otherwise, the time to 
finish is still significant, although it is already faster than doing it in 
query engines.  

Regarding reading the original value and generating the hash/statistics, we 
only need to do it for the columns to be masked. In many cases, what we see is 
that there are only very few columns to be masked. For all other columns that 
don't need to be masked, we just move them as a whole like 'merge' or 'prune' 
command, which would be a big saving. But yes, this operation would be slower 
than 'prune' command, but it still can save huge comparing with doing it via 
query engine.  


was (Author: sha...@uber.com):
[~gszadovszky] the tool can be run in parallel in a cluster. For example, we 
can easily write a Spark application to do it. Actually even for 'prune', we 
still need to write Spark application to parallel it. Otherwise, the time to 
finish is still significant, although it is already faster than doing it in 
query engines.  

Regarding reading the original value and generating the hash/statistics, we 
only need to do it for the columns to be masked. In many cases, what we see is 
that there are only very few columns to be masked. For all other columns that 
don't need to be masked, we just move them as a whole like 'merge' or 'prune' 
command, which would be a big saving. But yes, this operation would be slower 
than 'prune' command, but it still can save huge comparing with doing it via 
query engine. 

 

 

 

 

 

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] merge bloom filter branch to master

2020-02-11 Thread Gabor Szadovszky
I think, the best way is to create a PR. It would be more transparent.

On Tue, Feb 11, 2020 at 2:11 PM Junjie Chen 
wrote:

> Thanks for the patch @Walid.
>
> @Gabor,  do I need to create a PR for merging, or committer will help to
> merge?
>
>
> On Sun, Jan 12, 2020 at 7:39 PM Gara Walid  wrote:
>
> > Hi Junjie,
> >
> > Thank you Junjie and the community for all your efforts on the bloom
> filter
> > topic.
> > I've been testing the bloom filter branch with Map Reduce examples and it
> > sounds good.
> > I'll add an example to the parquet-hadoop package.
> >
> > Cheers,
> > Walid
> >
> > Le ven. 10 janv. 2020 à 09:22, Junjie Chen  a
> > écrit :
> >
> > > Hi Community
> > >
> > > The bloom filter branch now contains a basic functional logic that
> > includes
> > > read/write and filtering.  Though the feature still needs polishing and
> > > improving, I 'd suggest merging back to master first so that more
> people
> > > could use it and provide feedback. What do you think?
> > >
> >
>
>
> --
> Best Regards
>


Re: [DISCUSS] merge bloom filter branch to master

2020-02-11 Thread Junjie Chen
Thanks for the patch @Walid.

@Gabor,  do I need to create a PR for merging, or committer will help to
merge?


On Sun, Jan 12, 2020 at 7:39 PM Gara Walid  wrote:

> Hi Junjie,
>
> Thank you Junjie and the community for all your efforts on the bloom filter
> topic.
> I've been testing the bloom filter branch with Map Reduce examples and it
> sounds good.
> I'll add an example to the parquet-hadoop package.
>
> Cheers,
> Walid
>
> Le ven. 10 janv. 2020 à 09:22, Junjie Chen  a
> écrit :
>
> > Hi Community
> >
> > The bloom filter branch now contains a basic functional logic that
> includes
> > read/write and filtering.  Though the feature still needs polishing and
> > improving, I 'd suggest merging back to master first so that more people
> > could use it and provide feedback. What do you think?
> >
>


-- 
Best Regards


[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2020-02-11 Thread Gidon Gershinsky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034416#comment-17034416
 ] 

Gidon Gershinsky commented on PARQUET-1792:
---

There is also a security aspect. While "prune" cleanly removes a sensitive 
column, and therefore is safe - "mask"/"redact" replaces one version of the 
column data with another version, and can easily leak sensitive information if 
not done properly.  I believe that at this stage, its best done above Parquet - 
by the users, who can simply add columns with the masked data. It can be also 
faster than Parquet tools, if run on a multi-threaded engine, as mentioned by 
Gabor.

We're working on a system that would allow to analyze Parquet files with 
masked/redacted columns, and detect information leakage. This would also allow 
to perform masking inside Parquet libraries, making it fast / multi-threaded. 
But this project will take a while to complete. It's not urgent though, since, 
again, masking/redaction can be easily implemented by the users today, above 
Parquet. 

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1793) Support writing INT96 timestamp from avro

2020-02-11 Thread Tamas Palfy (Jira)
Tamas Palfy created PARQUET-1793:


 Summary: Support writing INT96 timestamp from avro
 Key: PARQUET-1793
 URL: https://issues.apache.org/jira/browse/PARQUET-1793
 Project: Parquet
  Issue Type: Improvement
Reporter: Tamas Palfy


Add support for writing avro LONG/timestamp-millis data in INT96 (or int the 
current INT64) format in parquet.

Add a config flag to select the required timestamp output format (INT96 or 
INT64). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Using JsonRecordFormatter

2020-02-11 Thread Ben Watson
Hi Gabor,

Thanks a lot for the quick reply. I've been looking into
AvroParquetReader this morning and it looks like a much better fit for
my problem:

ParquetReader pReader =
AvroParquetReader.builder(localInputFile).build();
for (GenericData.Record value = (GenericData.Record) pReader.read();
value != null; value = (GenericData.Record) pReader.read()) {
 ...
}

Seems like it gets me data in the format I want.

Regards,

Ben

On 2/11/20, Gabor Szadovszky  wrote:
> Hi Ben,
>
> SimpleRecord is pretty old and did not upgraded to a newer concept. You
> need to extend ParquetReader for SimpleRecord. See AvroParquetReader for
> details.
> After you have to specified SimpleRecordReader and the related Builder you
> can add the methods you need.
> I also wanted to highlight that  parquet-tools is not really for using it
> from the code but from the command line. There are not proper unit tests
> for that code and there are no guarantees for backward code compatibility
> between the releases. I would not recommend using these code parts for
> production.
>
> Cheers,
> Gabor
>
> On Mon, Feb 10, 2020 at 9:55 PM Ben Watson  wrote:
>
>> Hello,
>>
>> I'm wanting to read Parquet records into JSON with Java, and it seems
>> that
>> JsonRecordFormatter is the way to do it (like in
>>
>> https://github.com/apache/parquet-mr/blob/master/parquet-tools/src/main/java/org/apache/parquet/tools/command/CatCommand.java#L84
>> ).
>>
>> Unlike the above example, I want to avoid passing a Hadoop Path object,
>> and
>> instead I want to use the ParquetReader.read(InputFile).build(); builder.
>> However this returns a ParquetReader, and not the
>> ParquetReader that I need for JsonRecordFormatter. It looks
>> like I need to insert a new SimpleReadSupport() somewhere, but I can't
>> find
>> any method in that builder that accepts it.
>>
>> I've tried looking for other usages online etc but haven't had any luck.
>> Any pointers greatly appreciated.
>>
>> Regards,
>>
>> Ben
>>
>


Re: Using JsonRecordFormatter

2020-02-11 Thread Gabor Szadovszky
Hi Ben,

SimpleRecord is pretty old and did not upgraded to a newer concept. You
need to extend ParquetReader for SimpleRecord. See AvroParquetReader for
details.
After you have to specified SimpleRecordReader and the related Builder you
can add the methods you need.
I also wanted to highlight that  parquet-tools is not really for using it
from the code but from the command line. There are not proper unit tests
for that code and there are no guarantees for backward code compatibility
between the releases. I would not recommend using these code parts for
production.

Cheers,
Gabor

On Mon, Feb 10, 2020 at 9:55 PM Ben Watson  wrote:

> Hello,
>
> I'm wanting to read Parquet records into JSON with Java, and it seems that
> JsonRecordFormatter is the way to do it (like in
>
> https://github.com/apache/parquet-mr/blob/master/parquet-tools/src/main/java/org/apache/parquet/tools/command/CatCommand.java#L84
> ).
>
> Unlike the above example, I want to avoid passing a Hadoop Path object, and
> instead I want to use the ParquetReader.read(InputFile).build(); builder.
> However this returns a ParquetReader, and not the
> ParquetReader that I need for JsonRecordFormatter. It looks
> like I need to insert a new SimpleReadSupport() somewhere, but I can't find
> any method in that builder that accepts it.
>
> I've tried looking for other usages online etc but haven't had any luck.
> Any pointers greatly appreciated.
>
> Regards,
>
> Ben
>


[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2020-02-11 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034294#comment-17034294
 ] 

Gabor Szadovszky commented on PARQUET-1792:
---

If you are talking about one file at a time you might be right that it is 10x 
faster than doing it by a query engine. But the tool is running on one node 
while the query engine uses several ones at the same time so I am not sure 
about the 10x performance.
Pruning the file makes sense to me to be written at the library level because 
you can do it in an effective way (do not need to unpack/decode the pages or 
the entire column chunks). To mask the values in the other hand requires to 
read the actual values and to generate the hashes. You also need to generate 
the related statistics.
Therefore, I am not sure if this masking feature properly suited for parquet-mr.

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)