Re: Docs/guidelines on writing filesystem sources and sinks

2017-07-06 Thread Stephen Sisk
Hi Dmitry,

I'm excited to hear that you'd like to do this work. If you haven't
already, I'd first suggest that you open a JIRA issue to make sure other
folks know you're working on this.

I was involved in working on the recent java HDFS file system
implementation, so I'll try and share what I know - I suspect knowledge
about this is scattered around a bit, so hopefully others will chime in as
well.

> 1. Are there any official or non-official guidelines or docs on writing
filesystems? Even Java-specific ones may be really useful.
I don't know of any guides for writing IOs. I believe folks should be
helpful here on the mailing list for specific questions, but there aren't
that many that are experts in file system implementations. It's not
expected to be a frequent task, so no one has tried to document it (it also
means your contribution will have a wide impact!) If you wanted to write up
your notes from the process, it'd likely be highly helpful to others.

https://issues.apache.org/jira/browse/BEAM-2005 documents the work that we
did to add the java Hadoop FileSystem implementation, so that might be a
good guide - it has links to PRs, you can find out about design questions
that came up there, etc.. The Hadoop FileSystem is relatively new, so
reviewing its commit history may be very informative.

> 2. Are there any existing generic test suites that every filesystem is
supposed to pass? Again, even if they exist only in Java world, I'd still
be down for trying to adopt them in Python SDK too.

I don't know of any. If you put together a test plan, we'd be happy to
discuss it. The tests for the java Hadoop FileSystem represent the current
thinking, but could likely be expanded on.

> 3. Are there any established ideas of how to pass AWS credentials to Beam
for making the S3 filesystem actually work?

Looks like you already found the past discussions of this on the mailing
list, that was what I would refer you to.

> I also stumbled upon a problem that I can't really pass additional
configuration to a filesystem,
We had a similar problem with the hadoop configuration object - inside of
the hadoop filesystem registrar, we read the pipeline options to see if
there is configuration info there, as well as some default hadoop
configuration file locations. See
https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L45

The python folks will have to comment if that's the type of solution they
want you to use though.

I hope this helps!

Stephen


On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk 
wrote:

> I also stumbled upon a problem that I can't really pass additional
> configuration to a filesystem, e.g.
>
> lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
> aws_config=AWSConfig())
>
> because the ReadFromText class relies on PTransform's constructor, which
> has a pre-defined set of arguments.
>
> This is probably becoming a cross-topic for the dev list (have I added it
> in the right way?)
>
> On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk 
> wrote:
>
>> Hi folks,
>>
>> I'm working on an S3 filesystem for the Python SDK, which already works
>> in case of a happy path for both reading and writing, but I feel like there
>> are quite a few edge cases that I'm likely missing.
>>
>> So far, my approach has been: "look at the generic FileSystem
>> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
>> to copy their approach as much as possible, at least for getting to the
>> proof of concept".
>>
>> That said, I'd like to know a few things:
>>
>> 1. Are there any official or non-official guidelines or docs on writing
>> filesystems? Even Java-specific ones may be really useful.
>>
>> 2. Are there any existing generic test suites that every filesystem is
>> supposed to pass? Again, even if they exist only in Java world, I'd still
>> be down for trying to adopt them in Python SDK too.
>>
>> 3. Are there any established ideas of how to pass AWS credentials to Beam
>> for making the S3 filesystem actually work? I currently rely on the
>> existing environment variables, which boto just picks up, but sounds like
>> setting them up in runners like Dataflow or Spark would be troublesome.
>> I've seen this discussion a couple times in the list, but couldn't tell if
>> any closure was found. My personal preference would be having AWS settings
>> passed in some global context (pipeline options, perhaps?), but there may
>> be exceptions to that (say, people want to use different credentials for
>> different AWS operations).
>>
>> Thanks!
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>


Re: Docs/guidelines on writing filesystem sources and sinks

2017-07-06 Thread Chamikara Jayalath
Currently we don't have official documentation or a testing guide for
adding new FileSystems. Best source here would be existing FileSystem
implementations, as you mentioned.

I don't think parameters for initiating FileSystems should be passed when
creating a read transform. Can you try to get any config parameters from
the environment instead ? Note that for distributed runners, you will have
to register environment variables in workers in a runner specific way (for
example, for Dataflow runner, this could be through an additional package
that gets installed in workers). I think +Sourabh Bajaj
 was looking into providing a better solution for
this.

- Cham

On Thu, Jul 6, 2017 at 4:42 PM Dmitry Demeshchuk 
wrote:

> I also stumbled upon a problem that I can't really pass additional
> configuration to a filesystem, e.g.
>
> lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
> aws_config=AWSConfig())
>
> because the ReadFromText class relies on PTransform's constructor, which
> has a pre-defined set of arguments.
>
> This is probably becoming a cross-topic for the dev list (have I added it
> in the right way?)
>
> On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk 
> wrote:
>
>> Hi folks,
>>
>> I'm working on an S3 filesystem for the Python SDK, which already works
>> in case of a happy path for both reading and writing, but I feel like there
>> are quite a few edge cases that I'm likely missing.
>>
>> So far, my approach has been: "look at the generic FileSystem
>> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
>> to copy their approach as much as possible, at least for getting to the
>> proof of concept".
>>
>> That said, I'd like to know a few things:
>>
>> 1. Are there any official or non-official guidelines or docs on writing
>> filesystems? Even Java-specific ones may be really useful.
>>
>> 2. Are there any existing generic test suites that every filesystem is
>> supposed to pass? Again, even if they exist only in Java world, I'd still
>> be down for trying to adopt them in Python SDK too.
>>
>> 3. Are there any established ideas of how to pass AWS credentials to Beam
>> for making the S3 filesystem actually work? I currently rely on the
>> existing environment variables, which boto just picks up, but sounds like
>> setting them up in runners like Dataflow or Spark would be troublesome.
>> I've seen this discussion a couple times in the list, but couldn't tell if
>> any closure was found. My personal preference would be having AWS settings
>> passed in some global context (pipeline options, perhaps?), but there may
>> be exceptions to that (say, people want to use different credentials for
>> different AWS operations).
>>
>> Thanks!
>>
>> --
>> Best regards,
>> Dmitry Demeshchuk.
>>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>


Re: Docs/guidelines on writing filesystem sources and sinks

2017-07-06 Thread Dmitry Demeshchuk
I also stumbled upon a problem that I can't really pass additional
configuration to a filesystem, e.g.

lines = pipeline | 'read' >> ReadFromText('s3://my-bucket/kinglear.txt',
aws_config=AWSConfig())

because the ReadFromText class relies on PTransform's constructor, which
has a pre-defined set of arguments.

This is probably becoming a cross-topic for the dev list (have I added it
in the right way?)

On Thu, Jul 6, 2017 at 1:27 PM, Dmitry Demeshchuk 
wrote:

> Hi folks,
>
> I'm working on an S3 filesystem for the Python SDK, which already works in
> case of a happy path for both reading and writing, but I feel like there
> are quite a few edge cases that I'm likely missing.
>
> So far, my approach has been: "look at the generic FileSystem
> implementation, look at how gcsio.py and gcsfilesystem.py are written, try
> to copy their approach as much as possible, at least for getting to the
> proof of concept".
>
> That said, I'd like to know a few things:
>
> 1. Are there any official or non-official guidelines or docs on writing
> filesystems? Even Java-specific ones may be really useful.
>
> 2. Are there any existing generic test suites that every filesystem is
> supposed to pass? Again, even if they exist only in Java world, I'd still
> be down for trying to adopt them in Python SDK too.
>
> 3. Are there any established ideas of how to pass AWS credentials to Beam
> for making the S3 filesystem actually work? I currently rely on the
> existing environment variables, which boto just picks up, but sounds like
> setting them up in runners like Dataflow or Spark would be troublesome.
> I've seen this discussion a couple times in the list, but couldn't tell if
> any closure was found. My personal preference would be having AWS settings
> passed in some global context (pipeline options, perhaps?), but there may
> be exceptions to that (say, people want to use different credentials for
> different AWS operations).
>
> Thanks!
>
> --
> Best regards,
> Dmitry Demeshchuk.
>



-- 
Best regards,
Dmitry Demeshchuk.


Docs/guidelines on writing filesystem sources and sinks

2017-07-06 Thread Dmitry Demeshchuk
Hi folks,

I'm working on an S3 filesystem for the Python SDK, which already works in
case of a happy path for both reading and writing, but I feel like there
are quite a few edge cases that I'm likely missing.

So far, my approach has been: "look at the generic FileSystem
implementation, look at how gcsio.py and gcsfilesystem.py are written, try
to copy their approach as much as possible, at least for getting to the
proof of concept".

That said, I'd like to know a few things:

1. Are there any official or non-official guidelines or docs on writing
filesystems? Even Java-specific ones may be really useful.

2. Are there any existing generic test suites that every filesystem is
supposed to pass? Again, even if they exist only in Java world, I'd still
be down for trying to adopt them in Python SDK too.

3. Are there any established ideas of how to pass AWS credentials to Beam
for making the S3 filesystem actually work? I currently rely on the
existing environment variables, which boto just picks up, but sounds like
setting them up in runners like Dataflow or Spark would be troublesome.
I've seen this discussion a couple times in the list, but couldn't tell if
any closure was found. My personal preference would be having AWS settings
passed in some global context (pipeline options, perhaps?), but there may
be exceptions to that (say, people want to use different credentials for
different AWS operations).

Thanks!

-- 
Best regards,
Dmitry Demeshchuk.


Re: Providing HTTP client to DoFn

2017-07-06 Thread Lukasz Cwik
#1: For all runners, the side input needs to be ready (data needs to exist
for the given window) before the main input is executed which means that in
your case the whole side input will be materialized before the main input
is executed.

#2: For Dataflow, a map/multimap based side input is loaded lazily in parts
based upon which key is being accessed. Each segment of the map is cached
in memory (using an LRU policy) and the loading the data remotely is the
largest cost in such a system. Depending on how large your main input is,
performing a group by key on your access key will speed up your lookups
(because you'll get a lot more cache hits) but you have to weight the cost
of doing the GBK vs speed up in side input usage.

What do you mean by "expanding the tuples to the expanded data"?
* Are you trying to say that typically you'll look up the same value 100+
times from the side input
** In this case performing a GBK based upon your lookup key may be of
benefit
* Are you trying to say that you could have the data stored within the side
input instead of just the index but it would be 100 times larger?
** A map based side input which has values which are 4 bytes vs 400 bytes
isn't going to change much in lookup cost



On Wed, Jul 5, 2017 at 6:22 PM, Randal Moore  wrote:

> Based on my understanding so far, I'm targeting Dataflow with a batch
> pipeline. Just starting to experiment with the setup/teardown with the
> local runner - that might work fine.
>
> Somewhat intrigued with the side inputs, though.  The pipeline might
> iterate over 1,000,000 tuples of two integers.  The integers are indices
> into a database of data. A given integer will be repeated in the inputs
> many times.  Am I prematurely optimizing to rule out expanding the tuples
> to the expanded data as each value might be expanded 100 or more times? As
> side inputs, it might expand to ~100GB.  Expanding the input would be
> significantly bigger.
>
> #1 how does Dataflow schedule the pipeline with a map side input - does it
> wait until the whole map is collected?
> #2 can the DoFn specify that it depends on only specific keys of the side
> input map?  does that affect the scheduling of the DoFn?
>
> Thanks for any pointers...
> rdm
>
> On Wed, Jul 5, 2017 at 4:58 PM Lukasz Cwik  wrote:
>
>> That should have said:
>> ~100s MiBs per window in streaming pipelines
>>
>> On Wed, Jul 5, 2017 at 2:58 PM, Lukasz Cwik  wrote:
>>
>>> #1, side inputs supported sizes and performance are specific to a
>>> runner. For example, I know that Dataflow supports side inputs which are 1+
>>> TiB (aggregate) in batch pipelines and ~100s MiBs per window because there
>>> have been several one off benchmarks/runs. What kinds of sizes/use case do
>>> you want to support, some runners will do a much better job with really
>>> small side inputs while others will be better with really large side inputs?
>>>
>>> #2, this depends on which library your using to perform the REST calls
>>> and whether it is thread safe. DoFns can be shared across multiple bundles
>>> and can contain methods marked with @Setup/@Teardown which only get invoked
>>> once per DoFn instance (which is relatively infrequently) and you could
>>> store an instance per DoFn instead of a singleton if the REST library was
>>> not thread safe.
>>>
>>> On Wed, Jul 5, 2017 at 2:45 PM, Randal Moore 
>>> wrote:
>>>
 I have a step in my beam pipeline that needs some data from a rest
 service. The data acquired from the rest service is dependent on the
 context of the data being processed and relatively large. The rest client I
 am using isn't serializable - nor is it likely possible to make it so
 (background threads, etc.).

 #1 What are the practical limits to the size of side inputs (e.g., I
 could try to gather all the data from the rest service and provide it as a
 side-input)?

 #2 Assuming that using the rest client is the better option, would a
 singleton instance be safe way to instantiate the rest client?

 Thanks,
 rdm

>>>
>>>
>>


Re: Slack Invite

2017-07-06 Thread Durairaj, Saravanakumar
Thank you. I have joined the channel.

From: Manu Zhang 
Reply-To: "user@beam.apache.org" 
Date: Thursday, July 6, 2017 at 9:47 AM
To: "user@beam.apache.org" 
Subject: Re: Slack Invite

Hi Saravana,

You've been invited.

On Thu, Jul 6, 2017 at 9:43 PM Durairaj, Saravanakumar 
>
 wrote:
Hi,

I would like to join the slack channel.

Thank you,
Saravana



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: Slack Invite

2017-07-06 Thread Manu Zhang
Hi Saravana,

You've been invited.

On Thu, Jul 6, 2017 at 9:43 PM Durairaj, Saravanakumar <
saravanakumar_durai...@homedepot.com> wrote:

> Hi,
>
>
>
> I would like to join the slack channel.
>
>
>
> Thank you,
>
> Saravana
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


Slack Invite

2017-07-06 Thread Durairaj, Saravanakumar
Hi,

I would like to join the slack channel.

Thank you,
Saravana



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: Join slack

2017-07-06 Thread Dennis Mårtensson
Hi Jb,

Thank you! I have accepted and are in.

Best

Dennis

On Thu, 6 Jul 2017 at 08:31 Jean-Baptiste Onofré  wrote:

> You should have received an invite.
>
> Welcome !
>
> Regards
> JB
>
> On 07/06/2017 08:29 AM, Dennis Mårtensson wrote:
> > Hi,
> >
> > I would like to join the slack channel.
> >
> > Thank you
> >
> > Dennis
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Join slack

2017-07-06 Thread Jean-Baptiste Onofré

You should have received an invite.

Welcome !

Regards
JB

On 07/06/2017 08:29 AM, Dennis Mårtensson wrote:

Hi,

I would like to join the slack channel.

Thank you

Dennis


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Join slack

2017-07-06 Thread Dennis Mårtensson
Hi,

I would like to join the slack channel.

Thank you

Dennis