Thanks for revert. In my application i need to do some aggregations, So i will have to store contents of file in memory and it will not scale (unless I can keep reading it record by record and store it in some intermediary form , not allowed to store temporary intermediate unencrypted data on GCS) . So in this case do you suggest me to read the channel and write record by record in Pubsub and use it as a buffer and build something to consume records from Pubsub.
Thanks Aniruddh On Thu, Aug 9, 2018 at 11:57 AM, Lukasz Cwik <[email protected]> wrote: > Sharding is runner specific. For example, Dataflow chooses shards at > sources (like TextIO/FileIO/AvroIO/...) and at GroupByKey. Dataflow uses a > capability exposed in Apache Beam to ask a source to "split" a shard that > is currently being processing into smaller shards. Also Dataflow has the > ability to concatenate multiple shards creating a larger shard. There is a > lot more detail about this in the execution model documentation[1]. > > FileIO doesn't "read" the entire file, it just produces metadata records > which contain the name of the file and a few other properties. The ParDo > when accessing the FileIO output gets a handle to a ReadableByteChannel. > The ParDo that you will write will likely use a decrypting > ReadableByteChannel that wraps the ReadableByteChannel provided by the > FileIO output. This means that the ParDo will only use an amount of memory > approximately equivalent to reading a record plus any overhead that is used > in the implementation required to perform decyption and overhead due to > buffering the contents of the file. This means that the file size doesn't > matter and only the record size within the file matters. > > 1: https://beam.apache.org/documentation/execution-model/ > > > > On Thu, Aug 9, 2018 at 6:46 AM Aniruddh Sharma <[email protected]> > wrote: > >> Thanks Lukasz for help. Its very helpful and I understand it now. But one >> further question on it. >> Lets say using java SDK , I implement this approach ' Your best bet >> would be use FileIO[1] followed by a ParDo that accesses the KMS to get the >> decryption key and wraps the readable channel with a decryptor. ' >> >> Now how does Beam create shards on functions called within ParDo. Is it >> based on some size of data that it assesses when executing inside ParDo and >> how are these shards executed against different machines. >> >> For example >> Scenario 1 - File size is 1 MB and I keep default Autoscaling On. on >> execution Beam decides to execute decryption (inside Pardo) on one machine >> and it works. >> Scenario 2a - File size is 1 GB ( machine is n1-standard-4) and default >> Autoscaling is On. It still decides to spin only one machine and goes out >> of memory on heap and get itself killed. >> Scenario 2b- File size is 1 GB ( machine is n1-standard-4) and default >> Autoscaling is Off and I force it to start with 2 machines. Still it >> decides to execute file decrypting logic on one machine and another machine >> doesnt do anything and it still gets killed. >> >> Although I know file is unsplittable (for decryption) and I want in >> actual to execute it only on one machine. But because it is in ParDo and it >> is is Application logic. I have not understood in this Context when does >> ParDo decide to execute a function in it to execute on different machines. >> >> For example in Spark - the number of partitions in RDD will get >> parallelized and will get executed either in same Executor (in different >> wave or with other core) or on different Executors. >> >> In above example how to make sure that if file size increases ParDo will >> not try to scale it. I am new to Beam - if this query doesnt make sense >> then apology in advance. >> >> Thanks >> Aniruddh >> >> On Wed, Aug 8, 2018 at 11:52 AM, Lukasz Cwik <[email protected]> wrote: >> >>> a) FileIO produces elements which are file descriptors, each descriptor >>> represents one file. If you have many files to read then you will get >>> parallel processing since multiple files will be read concurrently. You >>> could get parallel reading in a single file but you would need to be able >>> to split the decrypted file by knowing how to randomly seek in the >>> encrypted stream and start reading the decrypted stream without having to >>> decrypt everything before that point. >>> >>> b) You need to add support to the underlying filesystem like S3[1] or >>> GCS[2] to support file decryption. >>> >>> Note for solution a) and b), you can only get efficient parallel reading >>> within a single file if you can efficiently seek within the "decrypted" >>> stream. Many encryption methods do not allow for this and typically require >>> you to decrypt everything up to where you want to read; making parallel >>> reading of a single file extremely inefficient. This is a common problem >>> for compressed files as well, you can't choose an arbitrary spot in the >>> compressed file and start decompressing without decompressing everything >>> before it. >>> >>> 1: https://github.com/apache/beam/blob/master/sdks/java/io/ >>> amazon-web-services/src/main/java/org/apache/beam/sdk/io/ >>> aws/s3/S3FileSystem.java >>> 2: https://github.com/apache/beam/blob/master/sdks/java/ >>> extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/ >>> extensions/gcp/storage/GcsFileSystem.java >>> >>> On Tue, Aug 7, 2018 at 1:06 PM Aniruddh Sharma <[email protected]> >>> wrote: >>> >>>> Thanks Lukasz for revert. >>>> a) Have a further question. I may not be understanding it right. If I >>>> embed this logic in ParDo then won't logic be in a way that it limits file >>>> read to only one file as directly writing in ParDo or will it still be a >>>> parallel operation .If it is parallel operation then please elaborate more. >>>> >>>> b) Is it possible to extend TextIO or AvroIO to decrypt keys (customer >>>> supplied) , if yes then request to elaborate how to do it. >>>> >>>> Thanks >>>> Aniruddh >>>> >>>> >>>> Thanks >>>> Aniruddh >>>> >>>> On Mon, Aug 6, 2018 at 8:30 PM, Lukasz Cwik <[email protected]> wrote: >>>> >>>>> I'm assuming your talking about using the Java SDK. Additional details >>>>> about which SDK language, filesystem, and KMS you want to access would be >>>>> useful if the approach below doesn't work for you. >>>>> >>>>> Your best bet would be use FileIO[1] followed by a ParDo that accesses >>>>> the KMS to get the decryption key and wraps the readable channel with a >>>>> decryptor. >>>>> Note that you'll need to apply your own file parsing logic to pull out >>>>> the records as you won't be able to use things like AvroIO/TextIO to do >>>>> the >>>>> record parsing for you. >>>>> >>>>> 1: https://beam.apache.org/documentation/sdks/javadoc/2. >>>>> 5.0/org/apache/beam/sdk/io/FileIO.html >>>>> >>>>> >>>>> >>>>> On Tue, Jul 31, 2018 at 10:27 AM [email protected] < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>>> what is the best suggested way to to read a file encrypted with a >>>>>> customer key which is also wrapped in KMS >>>>>> >>>>> >>>> >>
