date:20180809

Re: File IO - Customer Encrypted keys

2018-08-09 Thread Lukasz Cwik

Can you use a combiner[1] for the aggregations?

1: https://beam.apache.org/documentation/programming-guide/#combine

On Thu, Aug 9, 2018 at 11:03 AM Aniruddh Sharma 
wrote:

> Thanks for revert.
>
> In my application i need to do some aggregations, So i will have to store
> contents of file in memory and it will not scale (unless I can keep reading
> it record by record and store it in some intermediary form , not allowed to
> store temporary intermediate unencrypted data on GCS) . So in this case do
> you suggest me to read the channel and write record by record in Pubsub and
> use it as a buffer and build something to consume records from Pubsub.
>
> Thanks
> Aniruddh
>
> On Thu, Aug 9, 2018 at 11:57 AM, Lukasz Cwik  wrote:
>
>> Sharding is runner specific. For example, Dataflow chooses shards at
>> sources (like TextIO/FileIO/AvroIO/...) and at GroupByKey. Dataflow uses a
>> capability exposed in Apache Beam to ask a source to "split" a shard that
>> is currently being processing into smaller shards. Also Dataflow has the
>> ability to concatenate multiple shards creating a larger shard. There is a
>> lot more detail about this in the execution model documentation[1].
>>
>> FileIO doesn't "read" the entire file, it just produces metadata records
>> which contain the name of the file and a few other properties. The ParDo
>> when accessing the FileIO output gets a handle to a ReadableByteChannel.
>> The ParDo that you will write will likely use a decrypting
>> ReadableByteChannel that wraps the ReadableByteChannel provided by the
>> FileIO output. This means that the ParDo will only use an amount of memory
>> approximately equivalent to reading a record plus any overhead that is used
>> in the implementation required to perform decyption and overhead due to
>> buffering the contents of the file. This means that the file size doesn't
>> matter and only the record size within the file matters.
>>
>> 1: https://beam.apache.org/documentation/execution-model/
>>
>>
>>
>> On Thu, Aug 9, 2018 at 6:46 AM Aniruddh Sharma 
>> wrote:
>>
>>> Thanks Lukasz for help. Its very helpful and I understand it now. But
>>> one further question on it.
>>> Lets say using java SDK , I implement this approach ' Your best bet
>>> would be use FileIO[1] followed by a ParDo that accesses the KMS to get the
>>> decryption key and wraps the readable channel with a decryptor. '
>>>
>>> Now how does Beam create shards on functions called within ParDo. Is it
>>> based on some size of data that it assesses when executing inside ParDo and
>>> how are these shards executed against different machines.
>>>
>>> For example
>>> Scenario 1 - File size is 1 MB and I keep default Autoscaling On.  on
>>> execution Beam decides to execute decryption (inside Pardo) on one machine
>>> and it works.
>>> Scenario 2a - File size is 1 GB ( machine is n1-standard-4) and default
>>> Autoscaling is On. It still decides to spin only one machine and goes out
>>> of memory on heap and get itself killed.
>>> Scenario 2b-  File size is 1 GB ( machine is n1-standard-4) and default
>>> Autoscaling is Off and I force it to start with 2 machines. Still it
>>> decides to execute file decrypting logic on one machine and another machine
>>> doesnt do anything and it still gets killed.
>>>
>>> Although I know file is unsplittable (for decryption) and I want in
>>> actual to execute it only on one machine. But because it is in ParDo and it
>>> is is Application logic.  I have not understood in this Context when does
>>> ParDo decide to execute a function in it to execute on different machines.
>>>
>>> For example in Spark - the number of partitions in RDD will get
>>> parallelized and will get executed either in same Executor (in different
>>> wave or with other core) or on different Executors.
>>>
>>> In above example how to make sure that if file size increases ParDo will
>>> not try to scale it. I am new to Beam - if this query doesnt make sense
>>> then apology in advance.
>>>
>>> Thanks
>>> Aniruddh
>>>
>>> On Wed, Aug 8, 2018 at 11:52 AM, Lukasz Cwik  wrote:
>>>
 a) FileIO produces elements which are file descriptors, each descriptor
 represents one file. If you have many files to read then you will get
 parallel processing since multiple files will be read concurrently. You
 could get parallel reading in a single file but you would need to be able
 to split the decrypted file by knowing how to randomly seek in the
 encrypted stream and start reading the decrypted stream without having to
 decrypt everything before that point.

 b) You need to add support to the underlying filesystem like S3[1] or
 GCS[2] to support file decryption.

 Note for solution a) and b), you can only get efficient parallel
 reading within a single file if you can efficiently seek within the
 "decrypted" stream. Many encryption methods do not allow for this and
 typically require you to decrypt everything up to where you want to re

Re: [ANNOUNCE] Apache Beam 2.6.0 released!

2018-08-09 Thread Ismaël Mejía

Two really interesting features in 2.6.0 not mentioned in the announcement
email:


- Bounded SplittableDoFn support is available now in all runners (SDF is
the new IO connector API).

- HBaseIO was updated to be the first IO supporting Bounded SDF (using
readAll).




On Fri, Aug 10, 2018 at 12:14 AM Connell O'Callaghan 
wrote:

> Pablo and all involved thank you for working to get this release
> completed!!!
>
> On Thu, Aug 9, 2018 at 3:06 PM Pablo Estrada  wrote:
>
>> Of course, I messed that link up again. The release notes should be here:
>>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12343392
>>
>> On Thu, Aug 9, 2018 at 10:11 AM Pablo Estrada  wrote:
>>
>>> As was pointed out to me by Ryan Williams, the link I posted with the
>>> release notes is not right. Here is a link with Beam 2.6.0 release notes
>>> for those interested:
>>>
>>> https://issues.apache.org/jira/projects/BEAM/versions/12343392
>>>
>>> On Thu, Aug 9, 2018 at 9:41 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 Great news!
 Thanks to Pablo for driving this and to all contributors of this
 release!


 On 9 Aug 2018, at 00:48, Pablo Estrada  wrote:

 The Apache Beam team is pleased to announce the release of 2.6.0
 version!

 Apache Beam is an open source unified programming model to define and
 execute data processing pipelines, including ETL, batch and stream
 (continuous) processing. See https://beam.apache.org

 You can download the release here:

 https://beam.apache.org/get-started/downloads/

 This release includes the following major new features & improvements,
 among others:
 - Improvements for internal Context Management in Python SDK
 - A number of improvements to the Portability Framework
 - A Universal Local Runner has been added to Beam. This runner runs in
 a single machine using portability, and containerized SDK harnesses.
 - Increased the coverage of ErrorProne analysis of the codebase.
 - Updates to various dependency versions
 - Updates to stability, performance, and documentation.
 - SQL - improvements: support exists operator, implemented sum()
 aggregations, fixes to CASE expression, support for date comparison,
 support LIMIT on Unbounded Data
 - Provide automatic schema registration for POJOs

 You can take a look at the Release Notes for more details:

 *https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343392&projectId=12319527
 *

 Thanks to everyone who participated in this release, and we hope you'll
 have a good time using Beam 2.6.0.
 --
 Pablo Estrada, on behalf of The Apache Beam team
 --
 Got feedback? go/pabloem-feedback
 


 --
>>> Got feedback? go/pabloem-feedback
>>> 
>>>
>> --
>> Got feedback? go/pabloem-feedback
>> 
>>
>

Re: File IO - Customer Encrypted keys

2018-08-09 Thread Aniruddh Sharma

Thanks for revert.

In my application i need to do some aggregations, So i will have to store
contents of file in memory and it will not scale (unless I can keep reading
it record by record and store it in some intermediary form , not allowed to
store temporary intermediate unencrypted data on GCS) . So in this case do
you suggest me to read the channel and write record by record in Pubsub and
use it as a buffer and build something to consume records from Pubsub.

Thanks
Aniruddh

On Thu, Aug 9, 2018 at 11:57 AM, Lukasz Cwik  wrote:

> Sharding is runner specific. For example, Dataflow chooses shards at
> sources (like TextIO/FileIO/AvroIO/...) and at GroupByKey. Dataflow uses a
> capability exposed in Apache Beam to ask a source to "split" a shard that
> is currently being processing into smaller shards. Also Dataflow has the
> ability to concatenate multiple shards creating a larger shard. There is a
> lot more detail about this in the execution model documentation[1].
>
> FileIO doesn't "read" the entire file, it just produces metadata records
> which contain the name of the file and a few other properties. The ParDo
> when accessing the FileIO output gets a handle to a ReadableByteChannel.
> The ParDo that you will write will likely use a decrypting
> ReadableByteChannel that wraps the ReadableByteChannel provided by the
> FileIO output. This means that the ParDo will only use an amount of memory
> approximately equivalent to reading a record plus any overhead that is used
> in the implementation required to perform decyption and overhead due to
> buffering the contents of the file. This means that the file size doesn't
> matter and only the record size within the file matters.
>
> 1: https://beam.apache.org/documentation/execution-model/
>
>
>
> On Thu, Aug 9, 2018 at 6:46 AM Aniruddh Sharma 
> wrote:
>
>> Thanks Lukasz for help. Its very helpful and I understand it now. But one
>> further question on it.
>> Lets say using java SDK , I implement this approach ' Your best bet
>> would be use FileIO[1] followed by a ParDo that accesses the KMS to get the
>> decryption key and wraps the readable channel with a decryptor. '
>>
>> Now how does Beam create shards on functions called within ParDo. Is it
>> based on some size of data that it assesses when executing inside ParDo and
>> how are these shards executed against different machines.
>>
>> For example
>> Scenario 1 - File size is 1 MB and I keep default Autoscaling On.  on
>> execution Beam decides to execute decryption (inside Pardo) on one machine
>> and it works.
>> Scenario 2a - File size is 1 GB ( machine is n1-standard-4) and default
>> Autoscaling is On. It still decides to spin only one machine and goes out
>> of memory on heap and get itself killed.
>> Scenario 2b-  File size is 1 GB ( machine is n1-standard-4) and default
>> Autoscaling is Off and I force it to start with 2 machines. Still it
>> decides to execute file decrypting logic on one machine and another machine
>> doesnt do anything and it still gets killed.
>>
>> Although I know file is unsplittable (for decryption) and I want in
>> actual to execute it only on one machine. But because it is in ParDo and it
>> is is Application logic.  I have not understood in this Context when does
>> ParDo decide to execute a function in it to execute on different machines.
>>
>> For example in Spark - the number of partitions in RDD will get
>> parallelized and will get executed either in same Executor (in different
>> wave or with other core) or on different Executors.
>>
>> In above example how to make sure that if file size increases ParDo will
>> not try to scale it. I am new to Beam - if this query doesnt make sense
>> then apology in advance.
>>
>> Thanks
>> Aniruddh
>>
>> On Wed, Aug 8, 2018 at 11:52 AM, Lukasz Cwik  wrote:
>>
>>> a) FileIO produces elements which are file descriptors, each descriptor
>>> represents one file. If you have many files to read then you will get
>>> parallel processing since multiple files will be read concurrently. You
>>> could get parallel reading in a single file but you would need to be able
>>> to split the decrypted file by knowing how to randomly seek in the
>>> encrypted stream and start reading the decrypted stream without having to
>>> decrypt everything before that point.
>>>
>>> b) You need to add support to the underlying filesystem like S3[1] or
>>> GCS[2] to support file decryption.
>>>
>>> Note for solution a) and b), you can only get efficient parallel reading
>>> within a single file if you can efficiently seek within the "decrypted"
>>> stream. Many encryption methods do not allow for this and typically require
>>> you to decrypt everything up to where you want to read; making parallel
>>> reading of a single file extremely inefficient. This is a common problem
>>> for compressed files as well, you can't choose an arbitrary spot in the
>>> compressed file and start decompressing without decompressing everything
>>> before it.
>>>
>>> 1: htt

Re: File IO - Customer Encrypted keys

2018-08-09 Thread Lukasz Cwik

Sharding is runner specific. For example, Dataflow chooses shards at
sources (like TextIO/FileIO/AvroIO/...) and at GroupByKey. Dataflow uses a
capability exposed in Apache Beam to ask a source to "split" a shard that
is currently being processing into smaller shards. Also Dataflow has the
ability to concatenate multiple shards creating a larger shard. There is a
lot more detail about this in the execution model documentation[1].

FileIO doesn't "read" the entire file, it just produces metadata records
which contain the name of the file and a few other properties. The ParDo
when accessing the FileIO output gets a handle to a ReadableByteChannel.
The ParDo that you will write will likely use a decrypting
ReadableByteChannel that wraps the ReadableByteChannel provided by the
FileIO output. This means that the ParDo will only use an amount of memory
approximately equivalent to reading a record plus any overhead that is used
in the implementation required to perform decyption and overhead due to
buffering the contents of the file. This means that the file size doesn't
matter and only the record size within the file matters.

1: https://beam.apache.org/documentation/execution-model/



On Thu, Aug 9, 2018 at 6:46 AM Aniruddh Sharma  wrote:

> Thanks Lukasz for help. Its very helpful and I understand it now. But one
> further question on it.
> Lets say using java SDK , I implement this approach ' Your best bet would
> be use FileIO[1] followed by a ParDo that accesses the KMS to get the
> decryption key and wraps the readable channel with a decryptor. '
>
> Now how does Beam create shards on functions called within ParDo. Is it
> based on some size of data that it assesses when executing inside ParDo and
> how are these shards executed against different machines.
>
> For example
> Scenario 1 - File size is 1 MB and I keep default Autoscaling On.  on
> execution Beam decides to execute decryption (inside Pardo) on one machine
> and it works.
> Scenario 2a - File size is 1 GB ( machine is n1-standard-4) and default
> Autoscaling is On. It still decides to spin only one machine and goes out
> of memory on heap and get itself killed.
> Scenario 2b-  File size is 1 GB ( machine is n1-standard-4) and default
> Autoscaling is Off and I force it to start with 2 machines. Still it
> decides to execute file decrypting logic on one machine and another machine
> doesnt do anything and it still gets killed.
>
> Although I know file is unsplittable (for decryption) and I want in actual
> to execute it only on one machine. But because it is in ParDo and it is is
> Application logic.  I have not understood in this Context when does ParDo
> decide to execute a function in it to execute on different machines.
>
> For example in Spark - the number of partitions in RDD will get
> parallelized and will get executed either in same Executor (in different
> wave or with other core) or on different Executors.
>
> In above example how to make sure that if file size increases ParDo will
> not try to scale it. I am new to Beam - if this query doesnt make sense
> then apology in advance.
>
> Thanks
> Aniruddh
>
> On Wed, Aug 8, 2018 at 11:52 AM, Lukasz Cwik  wrote:
>
>> a) FileIO produces elements which are file descriptors, each descriptor
>> represents one file. If you have many files to read then you will get
>> parallel processing since multiple files will be read concurrently. You
>> could get parallel reading in a single file but you would need to be able
>> to split the decrypted file by knowing how to randomly seek in the
>> encrypted stream and start reading the decrypted stream without having to
>> decrypt everything before that point.
>>
>> b) You need to add support to the underlying filesystem like S3[1] or
>> GCS[2] to support file decryption.
>>
>> Note for solution a) and b), you can only get efficient parallel reading
>> within a single file if you can efficiently seek within the "decrypted"
>> stream. Many encryption methods do not allow for this and typically require
>> you to decrypt everything up to where you want to read; making parallel
>> reading of a single file extremely inefficient. This is a common problem
>> for compressed files as well, you can't choose an arbitrary spot in the
>> compressed file and start decompressing without decompressing everything
>> before it.
>>
>> 1:
>> https://github.com/apache/beam/blob/master/sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/s3/S3FileSystem.java
>> 2:
>> https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsFileSystem.java
>>
>> On Tue, Aug 7, 2018 at 1:06 PM Aniruddh Sharma 
>> wrote:
>>
>>> Thanks Lukasz for revert.
>>> a) Have a further question. I may not be understanding it right. If I
>>> embed this logic in ParDo then won't logic be in a way that it limits file
>>> read to only one file as directly writing in ParDo  or will it still be a
>>>

Re: Schema Aware PCollections

2018-08-09 Thread Akanksha Sharma B

Hi Anton,


Thank you !!!


Regards,

Akanksha


From: Anton Kedin 
Sent: Wednesday, August 8, 2018 9:57:33 PM
To: user@beam.apache.org
Cc: d...@beam.apache.org
Subject: Re: Schema Aware PCollections

Yes, this should be possible eventually. In fact, limited version of this 
functionality is already supported for Beans (e.g. see this 
test),
 but it's still experimental and there are no good end-to-end examples yet.

Regards,
Anton

On Wed, Aug 8, 2018 at 5:45 AM Akanksha Sharma B 
mailto:akanksha.b.sha...@ericsson.com>> wrote:

Hi,


(changed the email-subject to make it generic)


It is mentioned in Schema-Aware PCollections design doc 
(https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc)


"There are a number of existing data types from which schemas can be inferred. 
Protocol buffers, Avro objects, Json objects, POJOs, primitive Java types - all 
of these have schemas that can be inferred from the type itself at 
pipeline-construction time. We should be able to automatically infer these 
schemas with a minimum of involvement from the programmer. "

Can I assume that the following usecase will be possible sometime in future :-
"read parquet (along with inferred schema) into something like dataframe or 
Beam Rows. And vice versa for write i.e. get rows and write parquet based on 
Row's schema.""

Regards,
Akanksha


From: Chamikara Jayalath mailto:chamik...@google.com>>
Sent: Wednesday, August 1, 2018 3:57 PM
To: user@beam.apache.org
Cc: d...@beam.apache.org
Subject: Re: pipeline with parquet and sql



On Wed, Aug 1, 2018 at 1:12 AM Akanksha Sharma B 
mailto:akanksha.b.sha...@ericsson.com>> wrote:

Hi,


Thanks. I understood the Parquet point. I will wait for couple of days on this 
topic. Even if this scenario cannot be achieved now, any design document or 
future plans towards this direction will also be helpful to me.


To summarize, I do not understand beam well enough, can someone please help me 
and comment whether the following fits with beam's model and future direction :-

"read parquet (along with inferred schema) into something like dataframe or 
Beam Rows. And vice versa for write i.e. get rows and write parquet based on 
Row's schema."

Beam currently does not have a standard message format. A Beam pipeline 
consists of PCollections and transforms (that converts PCollections to other 
PCollections). You can transform the PCollection read from Parquet using a 
ParDo and writing the resulting transform back to Parquet format. I think 
Schema aware PCollections [1] might be close to what you need but not sure if 
it fulfills your exact requirement.

Thanks,
Cham

[1]  
https://lists.apache.org/thread.html/fe327866c6c81b7e55af28f81cedd9b2e588279def330940e8b8ebd7@%3Cdev.beam.apache.org%3E





Regards,

Akanksha



From: Łukasz Gajowy mailto:lukasz.gaj...@gmail.com>>
Sent: Tuesday, July 31, 2018 12:43:32 PM
To: user@beam.apache.org
Cc: d...@beam.apache.org
Subject: Re: pipeline with parquet and sql

In terms of schema and ParquetIO source/sink, there was an answer in some 
previous thread:

Currently (without introducing any change in ParquetIO) there is no way to not 
pass the avro schema. It will probably be replaced with Beam's schema in the 
future ()

[1] 
https://lists.apache.org/thread.html/a466ddeb55e47fd780be3bcd8eec9d6b6eaf1dfd566ae5278b5fb9e8@%3Cuser.beam.apache.org%3E


wt., 31 lip 2018 o 10:19 Akanksha Sharma B 
mailto:akanksha.b.sha...@ericsson.com>> 
napisał(a):

Hi,


I am hoping to get some hints/pointers from the experts here.

I hope the scenario described below was understandable. I hope it is a valid 
use-case. Please let me know if I need to explain the scenario better.


Regards,

Akanksha


From: Akanksha Sharma B
Sent: Friday, July 27, 2018 9:44 AM
To: d...@beam.apache.org
Subject: Re: pipeline with parquet and sql


Hi,


Please consider following pipeline:-


Source is Parquet file, having hundreds of columns.

Sink is Parquet. Multiple output parquet files are generated after applying 
some sql joins. Sql joins to be applied differ for each output parquet file. 
Lets assume we have a sql queries generator or some configuration file with the 
needed info.


Can this be implemented generically, such that there is no need of the schema 
of the parquet files involved or any intermediate POJO or beam schema.

i.e. the way spark can handle it - read parquet into dataframe, create temp 
view and apply sql queries to it, and write it back to parquet.

As I understand, beam SQL needs (Beam Schema or POJOs) and parquetIO need

Re: File IO - Customer Encrypted keys

2018-08-09 Thread Aniruddh Sharma

Thanks Lukasz for help. Its very helpful and I understand it now. But one
further question on it.
Lets say using java SDK , I implement this approach ' Your best bet would
be use FileIO[1] followed by a ParDo that accesses the KMS to get the
decryption key and wraps the readable channel with a decryptor. '

Now how does Beam create shards on functions called within ParDo. Is it
based on some size of data that it assesses when executing inside ParDo and
how are these shards executed against different machines.

For example
Scenario 1 - File size is 1 MB and I keep default Autoscaling On.  on
execution Beam decides to execute decryption (inside Pardo) on one machine
and it works.
Scenario 2a - File size is 1 GB ( machine is n1-standard-4) and default
Autoscaling is On. It still decides to spin only one machine and goes out
of memory on heap and get itself killed.
Scenario 2b-  File size is 1 GB ( machine is n1-standard-4) and default
Autoscaling is Off and I force it to start with 2 machines. Still it
decides to execute file decrypting logic on one machine and another machine
doesnt do anything and it still gets killed.

Although I know file is unsplittable (for decryption) and I want in actual
to execute it only on one machine. But because it is in ParDo and it is is
Application logic.  I have not understood in this Context when does ParDo
decide to execute a function in it to execute on different machines.

For example in Spark - the number of partitions in RDD will get
parallelized and will get executed either in same Executor (in different
wave or with other core) or on different Executors.

In above example how to make sure that if file size increases ParDo will
not try to scale it. I am new to Beam - if this query doesnt make sense
then apology in advance.

Thanks
Aniruddh

On Wed, Aug 8, 2018 at 11:52 AM, Lukasz Cwik  wrote:

> a) FileIO produces elements which are file descriptors, each descriptor
> represents one file. If you have many files to read then you will get
> parallel processing since multiple files will be read concurrently. You
> could get parallel reading in a single file but you would need to be able
> to split the decrypted file by knowing how to randomly seek in the
> encrypted stream and start reading the decrypted stream without having to
> decrypt everything before that point.
>
> b) You need to add support to the underlying filesystem like S3[1] or
> GCS[2] to support file decryption.
>
> Note for solution a) and b), you can only get efficient parallel reading
> within a single file if you can efficiently seek within the "decrypted"
> stream. Many encryption methods do not allow for this and typically require
> you to decrypt everything up to where you want to read; making parallel
> reading of a single file extremely inefficient. This is a common problem
> for compressed files as well, you can't choose an arbitrary spot in the
> compressed file and start decompressing without decompressing everything
> before it.
>
> 1: https://github.com/apache/beam/blob/master/sdks/java/io/
> amazon-web-services/src/main/java/org/apache/beam/sdk/io/
> aws/s3/S3FileSystem.java
> 2: https://github.com/apache/beam/blob/master/sdks/java/
> extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/
> extensions/gcp/storage/GcsFileSystem.java
>
> On Tue, Aug 7, 2018 at 1:06 PM Aniruddh Sharma 
> wrote:
>
>> Thanks Lukasz for revert.
>> a) Have a further question. I may not be understanding it right. If I
>> embed this logic in ParDo then won't logic be in a way that it limits file
>> read to only one file as directly writing in ParDo  or will it still be a
>> parallel operation .If it is parallel operation then please elaborate more.
>>
>> b) Is it possible to extend TextIO or AvroIO to decrypt keys (customer
>> supplied) , if yes then request to elaborate how to do it.
>>
>> Thanks
>> Aniruddh
>>
>>
>> Thanks
>> Aniruddh
>>
>> On Mon, Aug 6, 2018 at 8:30 PM, Lukasz Cwik  wrote:
>>
>>> I'm assuming your talking about using the Java SDK. Additional details
>>> about which SDK language, filesystem, and KMS you want to access would be
>>> useful if the approach below doesn't work for you.
>>>
>>> Your best bet would be use FileIO[1] followed by a ParDo that accesses
>>> the KMS to get the decryption key and wraps the readable channel with a
>>> decryptor.
>>> Note that you'll need to apply your own file parsing logic to pull out
>>> the records as you won't be able to use things like AvroIO/TextIO to do the
>>> record parsing for you.
>>>
>>> 1: https://beam.apache.org/documentation/sdks/javadoc/2.
>>> 5.0/org/apache/beam/sdk/io/FileIO.html
>>>
>>>
>>>
>>> On Tue, Jul 31, 2018 at 10:27 AM asharma...@gmail.com <
>>> asharma...@gmail.com> wrote:
>>>

 what is the best suggested way to to read a file encrypted with a
 customer key which is also wrapped in KMS

>>>
>>

Re: File IO - Customer Encrypted keys

Re: [ANNOUNCE] Apache Beam 2.6.0 released!

Re: File IO - Customer Encrypted keys

Re: File IO - Customer Encrypted keys

Re: Schema Aware PCollections

Re: File IO - Customer Encrypted keys

6 matches

Site Navigation

Mail list logo

Footer information