Re: Docs/guidelines on writing filesystem sources and sinks
Hi Dmitry, I'm excited to hear that you'd like to do this work. If you haven't already, I'd first suggest that you open a JIRA issue to make sure other folks know you're working on this. I was involved in working on the recent java HDFS file system implementation, so I'll try and share what I know - I suspect knowledge about this is scattered around a bit, so hopefully others will chime in as well. > 1. Are there any official or non-official guidelines or docs on writing filesystems? Even Java-specific ones may be really useful. I don't know of any guides for writing IOs. I believe folks should be helpful here on the mailing list for specific questions, but there aren't that many that are experts in file system implementations. It's not expected to be a frequent task, so no one has tried to document it (it also means your contribution will have a wide impact!) If you wanted to write up your notes from the process, it'd likely be highly helpful to others. https://issues.apache.org/jira/browse/BEAM-2005 documents the work that we did to add the java Hadoop FileSystem implementation, so that might be a good guide - it has links to PRs, you can find out about design questions that came up there, etc.. The Hadoop FileSystem is relatively new, so reviewing its commit history may be very informative. > 2. Are there any existing generic test suites that every filesystem is supposed to pass? Again, even if they exist only in Java world, I'd still be down for trying to adopt them in Python SDK too. I don't know of any. If you put together a test plan, we'd be happy to discuss it. The tests for the java Hadoop FileSystem represent the current thinking, but could likely be expanded on. > 3. Are there any established ideas of how to pass AWS credentials to Beam for making the S3 filesystem actually work? Looks like you already found the past discussions of this on the mailing list, that was what I would refer you to. > I also stumbled upon a problem that I can't really pass additional configuration to a filesystem, We had a similar problem with the hadoop configuration object - inside of the hadoop filesystem registrar, we read the pipeline options to see if there is configuration info there, as well as some default hadoop configuration file locations. See https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L45 The python folks will have to comment if that's the type of solution they want you to use though. I hope this helps! Stephen On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchukwrote: > I also stumbled upon a problem that I can't really pass additional > configuration to a filesystem, e.g. > > lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt', > aws_config=AWSConfig()) > > because the ReadFromText class relies on PTransform's constructor, which > has a pre-defined set of arguments. > > This is probably becoming a cross-topic for the dev list (have I added it > in the right way?) > > On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk > wrote: > >> Hi folks, >> >> I'm working on an S3 filesystem for the Python SDK, which already works >> in case of a happy path for both reading and writing, but I feel like there >> are quite a few edge cases that I'm likely missing. >> >> So far, my approach has been: "look at the generic FileSystem >> implementation, look at how gcsio.py and gcsfilesystem.py are written, try >> to copy their approach as much as possible, at least for getting to the >> proof of concept". >> >> That said, I'd like to know a few things: >> >> 1. Are there any official or non-official guidelines or docs on writing >> filesystems? Even Java-specific ones may be really useful. >> >> 2. Are there any existing generic test suites that every filesystem is >> supposed to pass? Again, even if they exist only in Java world, I'd still >> be down for trying to adopt them in Python SDK too. >> >> 3. Are there any established ideas of how to pass AWS credentials to Beam >> for making the S3 filesystem actually work? I currently rely on the >> existing environment variables, which boto just picks up, but sounds like >> setting them up in runners like Dataflow or Spark would be troublesome. >> I've seen this discussion a couple times in the list, but couldn't tell if >> any closure was found. My personal preference would be having AWS settings >> passed in some global context (pipeline options, perhaps?), but there may >> be exceptions to that (say, people want to use different credentials for >> different AWS operations). >> >> Thanks! >> >> -- >> Best regards, >> Dmitry Demeshchuk. >> > > > > -- > Best regards, > Dmitry Demeshchuk. >
Re: Docs/guidelines on writing filesystem sources and sinks
Currently we don't have official documentation or a testing guide for adding new FileSystems. Best source here would be existing FileSystem implementations, as you mentioned. I don't think parameters for initiating FileSystems should be passed when creating a read transform. Can you try to get any config parameters from the environment instead ? Note that for distributed runners, you will have to register environment variables in workers in a runner specific way (for example, for Dataflow runner, this could be through an additional package that gets installed in workers). I think +Sourabh Bajajwas looking into providing a better solution for this. - Cham On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk wrote: > I also stumbled upon a problem that I can't really pass additional > configuration to a filesystem, e.g. > > lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt', > aws_config=AWSConfig()) > > because the ReadFromText class relies on PTransform's constructor, which > has a pre-defined set of arguments. > > This is probably becoming a cross-topic for the dev list (have I added it > in the right way?) > > On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk > wrote: > >> Hi folks, >> >> I'm working on an S3 filesystem for the Python SDK, which already works >> in case of a happy path for both reading and writing, but I feel like there >> are quite a few edge cases that I'm likely missing. >> >> So far, my approach has been: "look at the generic FileSystem >> implementation, look at how gcsio.py and gcsfilesystem.py are written, try >> to copy their approach as much as possible, at least for getting to the >> proof of concept". >> >> That said, I'd like to know a few things: >> >> 1. Are there any official or non-official guidelines or docs on writing >> filesystems? Even Java-specific ones may be really useful. >> >> 2. Are there any existing generic test suites that every filesystem is >> supposed to pass? Again, even if they exist only in Java world, I'd still >> be down for trying to adopt them in Python SDK too. >> >> 3. Are there any established ideas of how to pass AWS credentials to Beam >> for making the S3 filesystem actually work? I currently rely on the >> existing environment variables, which boto just picks up, but sounds like >> setting them up in runners like Dataflow or Spark would be troublesome. >> I've seen this discussion a couple times in the list, but couldn't tell if >> any closure was found. My personal preference would be having AWS settings >> passed in some global context (pipeline options, perhaps?), but there may >> be exceptions to that (say, people want to use different credentials for >> different AWS operations). >> >> Thanks! >> >> -- >> Best regards, >> Dmitry Demeshchuk. >> > > > > -- > Best regards, > Dmitry Demeshchuk. >
Re: Docs/guidelines on writing filesystem sources and sinks
I also stumbled upon a problem that I can't really pass additional configuration to a filesystem, e.g. lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt', aws_config=AWSConfig()) because the ReadFromText class relies on PTransform's constructor, which has a pre-defined set of arguments. This is probably becoming a cross-topic for the dev list (have I added it in the right way?) On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchukwrote: > Hi folks, > > I'm working on an S3 filesystem for the Python SDK, which already works in > case of a happy path for both reading and writing, but I feel like there > are quite a few edge cases that I'm likely missing. > > So far, my approach has been: "look at the generic FileSystem > implementation, look at how gcsio.py and gcsfilesystem.py are written, try > to copy their approach as much as possible, at least for getting to the > proof of concept". > > That said, I'd like to know a few things: > > 1. Are there any official or non-official guidelines or docs on writing > filesystems? Even Java-specific ones may be really useful. > > 2. Are there any existing generic test suites that every filesystem is > supposed to pass? Again, even if they exist only in Java world, I'd still > be down for trying to adopt them in Python SDK too. > > 3. Are there any established ideas of how to pass AWS credentials to Beam > for making the S3 filesystem actually work? I currently rely on the > existing environment variables, which boto just picks up, but sounds like > setting them up in runners like Dataflow or Spark would be troublesome. > I've seen this discussion a couple times in the list, but couldn't tell if > any closure was found. My personal preference would be having AWS settings > passed in some global context (pipeline options, perhaps?), but there may > be exceptions to that (say, people want to use different credentials for > different AWS operations). > > Thanks! > > -- > Best regards, > Dmitry Demeshchuk. > -- Best regards, Dmitry Demeshchuk.
Docs/guidelines on writing filesystem sources and sinks
Hi folks, I'm working on an S3 filesystem for the Python SDK, which already works in case of a happy path for both reading and writing, but I feel like there are quite a few edge cases that I'm likely missing. So far, my approach has been: "look at the generic FileSystem implementation, look at how gcsio.py and gcsfilesystem.py are written, try to copy their approach as much as possible, at least for getting to the proof of concept". That said, I'd like to know a few things: 1. Are there any official or non-official guidelines or docs on writing filesystems? Even Java-specific ones may be really useful. 2. Are there any existing generic test suites that every filesystem is supposed to pass? Again, even if they exist only in Java world, I'd still be down for trying to adopt them in Python SDK too. 3. Are there any established ideas of how to pass AWS credentials to Beam for making the S3 filesystem actually work? I currently rely on the existing environment variables, which boto just picks up, but sounds like setting them up in runners like Dataflow or Spark would be troublesome. I've seen this discussion a couple times in the list, but couldn't tell if any closure was found. My personal preference would be having AWS settings passed in some global context (pipeline options, perhaps?), but there may be exceptions to that (say, people want to use different credentials for different AWS operations). Thanks! -- Best regards, Dmitry Demeshchuk.
Re: Providing HTTP client to DoFn
#1: For all runners, the side input needs to be ready (data needs to exist for the given window) before the main input is executed which means that in your case the whole side input will be materialized before the main input is executed. #2: For Dataflow, a map/multimap based side input is loaded lazily in parts based upon which key is being accessed. Each segment of the map is cached in memory (using an LRU policy) and the loading the data remotely is the largest cost in such a system. Depending on how large your main input is, performing a group by key on your access key will speed up your lookups (because you'll get a lot more cache hits) but you have to weight the cost of doing the GBK vs speed up in side input usage. What do you mean by "expanding the tuples to the expanded data"? * Are you trying to say that typically you'll look up the same value 100+ times from the side input ** In this case performing a GBK based upon your lookup key may be of benefit * Are you trying to say that you could have the data stored within the side input instead of just the index but it would be 100 times larger? ** A map based side input which has values which are 4 bytes vs 400 bytes isn't going to change much in lookup cost On Wed, Jul 5, 2017 at 6:22 PM, Randal Moorewrote: > Based on my understanding so far, I'm targeting Dataflow with a batch > pipeline. Just starting to experiment with the setup/teardown with the > local runner - that might work fine. > > Somewhat intrigued with the side inputs, though. The pipeline might > iterate over 1,000,000 tuples of two integers. The integers are indices > into a database of data. A given integer will be repeated in the inputs > many times. Am I prematurely optimizing to rule out expanding the tuples > to the expanded data as each value might be expanded 100 or more times? As > side inputs, it might expand to ~100GB. Expanding the input would be > significantly bigger. > > #1 how does Dataflow schedule the pipeline with a map side input - does it > wait until the whole map is collected? > #2 can the DoFn specify that it depends on only specific keys of the side > input map? does that affect the scheduling of the DoFn? > > Thanks for any pointers... > rdm > > On Wed, Jul 5, 2017 at 4:58 PM Lukasz Cwik wrote: > >> That should have said: >> ~100s MiBs per window in streaming pipelines >> >> On Wed, Jul 5, 2017 at 2:58 PM, Lukasz Cwik wrote: >> >>> #1, side inputs supported sizes and performance are specific to a >>> runner. For example, I know that Dataflow supports side inputs which are 1+ >>> TiB (aggregate) in batch pipelines and ~100s MiBs per window because there >>> have been several one off benchmarks/runs. What kinds of sizes/use case do >>> you want to support, some runners will do a much better job with really >>> small side inputs while others will be better with really large side inputs? >>> >>> #2, this depends on which library your using to perform the REST calls >>> and whether it is thread safe. DoFns can be shared across multiple bundles >>> and can contain methods marked with @Setup/@Teardown which only get invoked >>> once per DoFn instance (which is relatively infrequently) and you could >>> store an instance per DoFn instead of a singleton if the REST library was >>> not thread safe. >>> >>> On Wed, Jul 5, 2017 at 2:45 PM, Randal Moore >>> wrote: >>> I have a step in my beam pipeline that needs some data from a rest service. The data acquired from the rest service is dependent on the context of the data being processed and relatively large. The rest client I am using isn't serializable - nor is it likely possible to make it so (background threads, etc.). #1 What are the practical limits to the size of side inputs (e.g., I could try to gather all the data from the rest service and provide it as a side-input)? #2 Assuming that using the rest client is the better option, would a singleton instance be safe way to instantiate the rest client? Thanks, rdm >>> >>> >>
Re: Slack Invite
Thank you. I have joined the channel. From: Manu ZhangReply-To: "user@beam.apache.org" Date: Thursday, July 6, 2017 at 9:47 AM To: "user@beam.apache.org" Subject: Re: Slack Invite Hi Saravana, You've been invited. On Thu, Jul 6, 2017 at 9:43 PM Durairaj, Saravanakumar > wrote: Hi, I would like to join the slack channel. Thank you, Saravana The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment. The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.
Re: Slack Invite
Hi Saravana, You've been invited. On Thu, Jul 6, 2017 at 9:43 PM Durairaj, Saravanakumar < saravanakumar_durai...@homedepot.com> wrote: > Hi, > > > > I would like to join the slack channel. > > > > Thank you, > > Saravana > > -- > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email > by anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be > taken in reliance on it, is prohibited and may be unlawful. When addressed > to our clients any opinions or advice contained in this Email are subject > to the terms and conditions expressed in any applicable governing The Home > Depot terms of business or client engagement letter. The Home Depot > disclaims all responsibility and liability for the accuracy and content of > this attachment and for any damages or losses arising from any > inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other > items of a destructive nature, which may be contained in this attachment > and shall not be liable for direct, indirect, consequential or special > damages in connection with this e-mail message or its attachment. >
Slack Invite
Hi, I would like to join the slack channel. Thank you, Saravana The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.
Re: Join slack
Hi Jb, Thank you! I have accepted and are in. Best Dennis On Thu, 6 Jul 2017 at 08:31 Jean-Baptiste Onofréwrote: > You should have received an invite. > > Welcome ! > > Regards > JB > > On 07/06/2017 08:29 AM, Dennis Mårtensson wrote: > > Hi, > > > > I would like to join the slack channel. > > > > Thank you > > > > Dennis > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >
Re: Join slack
You should have received an invite. Welcome ! Regards JB On 07/06/2017 08:29 AM, Dennis Mårtensson wrote: Hi, I would like to join the slack channel. Thank you Dennis -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com
Join slack
Hi, I would like to join the slack channel. Thank you Dennis