Currently we don't have official documentation or a testing guide for adding new FileSystems. Best source here would be existing FileSystem implementations, as you mentioned.
I don't think parameters for initiating FileSystems should be passed when creating a read transform. Can you try to get any config parameters from the environment instead ? Note that for distributed runners, you will have to register environment variables in workers in a runner specific way (for example, for Dataflow runner, this could be through an additional package that gets installed in workers). I think +Sourabh Bajaj <[email protected]> was looking into providing a better solution for this. - Cham On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk <[email protected]> wrote: > I also stumbled upon a problem that I can't really pass additional > configuration to a filesystem, e.g. > > lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt', > aws_config=AWSConfig()) > > because the ReadFromText class relies on PTransform's constructor, which > has a pre-defined set of arguments. > > This is probably becoming a cross-topic for the dev list (have I added it > in the right way?) > > On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk <[email protected]> > wrote: > >> Hi folks, >> >> I'm working on an S3 filesystem for the Python SDK, which already works >> in case of a happy path for both reading and writing, but I feel like there >> are quite a few edge cases that I'm likely missing. >> >> So far, my approach has been: "look at the generic FileSystem >> implementation, look at how gcsio.py and gcsfilesystem.py are written, try >> to copy their approach as much as possible, at least for getting to the >> proof of concept". >> >> That said, I'd like to know a few things: >> >> 1. Are there any official or non-official guidelines or docs on writing >> filesystems? Even Java-specific ones may be really useful. >> >> 2. Are there any existing generic test suites that every filesystem is >> supposed to pass? Again, even if they exist only in Java world, I'd still >> be down for trying to adopt them in Python SDK too. >> >> 3. Are there any established ideas of how to pass AWS credentials to Beam >> for making the S3 filesystem actually work? I currently rely on the >> existing environment variables, which boto just picks up, but sounds like >> setting them up in runners like Dataflow or Spark would be troublesome. >> I've seen this discussion a couple times in the list, but couldn't tell if >> any closure was found. My personal preference would be having AWS settings >> passed in some global context (pipeline options, perhaps?), but there may >> be exceptions to that (say, people want to use different credentials for >> different AWS operations). >> >> Thanks! >> >> -- >> Best regards, >> Dmitry Demeshchuk. >> > > > > -- > Best regards, > Dmitry Demeshchuk. >
