Re: Docs/guidelines on writing filesystem sources and sinks

Chamikara Jayalath Thu, 06 Jul 2017 17:18:10 -0700

Currently we don't have official documentation or a testing guide for
adding new FileSystems. Best source here would be existing FileSystem
implementations, as you mentioned.


I don't think parameters for initiating FileSystems should be passed when
creating a read transform. Can you try to get any config parameters from
the environment instead ? Note that for distributed runners, you will have
to register environment variables in workers in a runner specific way (for
example, for Dataflow runner, this could be through an additional package
that gets installed in workers). I think +Sourabh Bajaj
<[email protected]> was looking into providing a better solution for
this.

- Cham

On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk <[email protected]>
wrote:

> I also stumbled upon a problem that I can't really pass additional
> configuration to a filesystem, e.g.
>
> lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
> aws_config=AWSConfig())
>
> because the ReadFromText class relies on PTransform's constructor, which
> has a pre-defined set of arguments.
>
> This is probably becoming a cross-topic for the dev list (have I added it
> in the right way?)
>
> On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk <[email protected]>
> wrote:
>
>> Hi folks,
>>
>> I'm working on an S3 filesystem for the Python SDK, which already works
>> in case of a happy path for both reading and writing, but I feel like there
>> are quite a few edge cases that I'm likely missing.
>>
>> So far, my approach has been: "look at the generic FileSystem
>> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
>> to copy their approach as much as possible, at least for getting to the
>> proof of concept".
>>
>> That said, I'd like to know a few things:
>>
>> 1. Are there any official or non-official guidelines or docs on writing
>> filesystems? Even Java-specific ones may be really useful.
>>
>> 2. Are there any existing generic test suites that every filesystem is
>> supposed to pass? Again, even if they exist only in Java world, I'd still
>> be down for trying to adopt them in Python SDK too.
>>
>> 3. Are there any established ideas of how to pass AWS credentials to Beam
>> for making the S3 filesystem actually work? I currently rely on the
>> existing environment variables, which boto just picks up, but sounds like
>> setting them up in runners like Dataflow or Spark would be troublesome.
>> I've seen this discussion a couple times in the list, but couldn't tell if
>> any closure was found. My personal preference would be having AWS settings
>> passed in some global context (pipeline options, perhaps?), but there may
>> be exceptions to that (say, people want to use different credentials for
>> different AWS operations).
>>
>> Thanks!
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>

Re: Docs/guidelines on writing filesystem sources and sinks

Reply via email to