Re: Docs/guidelines on writing filesystem sources and sinks

Stephen Sisk Thu, 06 Jul 2017 17:22:34 -0700

Hi Dmitry,

I'm excited to hear that you'd like to do this work. If you haven't
already, I'd first suggest that you open a JIRA issue to make sure other
folks know you're working on this.


I was involved in working on the recent java HDFS file system
implementation, so I'll try and share what I know - I suspect knowledge
about this is scattered around a bit, so hopefully others will chime in as
well.

> 1. Are there any official or non-official guidelines or docs on writing
filesystems? Even Java-specific ones may be really useful.
I don't know of any guides for writing IOs. I believe folks should be
helpful here on the mailing list for specific questions, but there aren't
that many that are experts in file system implementations. It's not
expected to be a frequent task, so no one has tried to document it (it also
means your contribution will have a wide impact!) If you wanted to write up
your notes from the process, it'd likely be highly helpful to others.

https://issues.apache.org/jira/browse/BEAM-2005 documents the work that we
did to add the java Hadoop FileSystem implementation, so that might be a
good guide - it has links to PRs, you can find out about design questions
that came up there, etc.. The Hadoop FileSystem is relatively new, so
reviewing its commit history may be very informative.

> 2. Are there any existing generic test suites that every filesystem is
supposed to pass? Again, even if they exist only in Java world, I'd still
be down for trying to adopt them in Python SDK too.

I don't know of any. If you put together a test plan, we'd be happy to
discuss it. The tests for the java Hadoop FileSystem represent the current
thinking, but could likely be expanded on.

> 3. Are there any established ideas of how to pass AWS credentials to Beam
for making the S3 filesystem actually work?

Looks like you already found the past discussions of this on the mailing
list, that was what I would refer you to.

> I also stumbled upon a problem that I can't really pass additional
configuration to a filesystem,
We had a similar problem with the hadoop configuration object - inside of
the hadoop filesystem registrar, we read the pipeline options to see if
there is configuration info there, as well as some default hadoop
configuration file locations. See
https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L45

The python folks will have to comment if that's the type of solution they
want you to use though.

I hope this helps!

Stephen


On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk <[email protected]>
wrote:

> I also stumbled upon a problem that I can't really pass additional
> configuration to a filesystem, e.g.
>
> lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
> aws_config=AWSConfig())
>
> because the ReadFromText class relies on PTransform's constructor, which
> has a pre-defined set of arguments.
>
> This is probably becoming a cross-topic for the dev list (have I added it
> in the right way?)
>
> On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk <[email protected]>
> wrote:
>
>> Hi folks,
>>
>> I'm working on an S3 filesystem for the Python SDK, which already works
>> in case of a happy path for both reading and writing, but I feel like there
>> are quite a few edge cases that I'm likely missing.
>>
>> So far, my approach has been: "look at the generic FileSystem
>> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
>> to copy their approach as much as possible, at least for getting to the
>> proof of concept".
>>
>> That said, I'd like to know a few things:
>>
>> 1. Are there any official or non-official guidelines or docs on writing
>> filesystems? Even Java-specific ones may be really useful.
>>
>> 2. Are there any existing generic test suites that every filesystem is
>> supposed to pass? Again, even if they exist only in Java world, I'd still
>> be down for trying to adopt them in Python SDK too.
>>
>> 3. Are there any established ideas of how to pass AWS credentials to Beam
>> for making the S3 filesystem actually work? I currently rely on the
>> existing environment variables, which boto just picks up, but sounds like
>> setting them up in runners like Dataflow or Spark would be troublesome.
>> I've seen this discussion a couple times in the list, but couldn't tell if
>> any closure was found. My personal preference would be having AWS settings
>> passed in some global context (pipeline options, perhaps?), but there may
>> be exceptions to that (say, people want to use different credentials for
>> different AWS operations).
>>
>> Thanks!
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>

Re: Docs/guidelines on writing filesystem sources and sinks

Reply via email to