Re: How to use "PortableRunner" in Python SDK?

2018-11-14 Thread Maximilian Michels

Hi Ruoyun,

I just ran the wordcount locally using the instructions on the page. 
I've tried the local file system and GCS. Both times it ran successfully 
and produced valid output.


I'm assuming there is some problem with your setup. Which platform are 
you using? I'm on MacOS.


Could you expand on the planned merge? From my understanding we will 
always need PortableRunner in Python to be able to submit against the 
Beam JobServer.


Thanks,
Max

On 14.11.18 00:39, Ruoyun Huang wrote:

A quick follow-up on using current PortableRunner.

I followed the exact three steps as Ankur and Maximilian shared in 
https://beam.apache.org/roadmap/portability/#python-on-flink  ;   The 
wordcount example keeps hanging after 10 minutes.  I also tried 
specifying explicit input/output args, either using gcs folder or local 
file system, but none of them works.


Spent some time looking into it but conclusion yet.  At this point 
though, I guess it does not matter much any more, given we already have 
the plan of merging PortableRunner into using java reference runner 
(i.e. :beam-runners-reference-job-server).


Still appreciated if someone can try out the python-on-flink 
instructions 
in case it is just due to my local machine setup.  Thanks!




On Thu, Nov 8, 2018 at 5:04 PM Ruoyun Huang > wrote:


Thanks Maximilian!

I am working on migrating existing PortableRunner to using java ULR
(Link to Notes

).
If this issue is non-trivial to solve, I would vote for removing
this default behavior as part of the consolidation.

On Thu, Nov 8, 2018 at 2:58 AM Maximilian Michels mailto:m...@apache.org>> wrote:

In the long run, we should get rid of the Docker-inside-Docker
approach,
which was only intended for testing anyways. It would be cleaner to
start the SDK harness container alongside with JobServer container.

Short term, I think it should be easy to either fix the
permissions of
the mounted "docker" executable or use a Docker image for the
JobServer
which comes with Docker pre-installed.

JIRA: https://issues.apache.org/jira/browse/BEAM-6020

Thanks for reporting this Ruoyun!

-Max

On 08.11.18 00:10, Ruoyun Huang wrote:
 > Thanks Ankur and Maximilian.
 >
 > Just for reference in case other people encountering the same
error
 > message, the "permission denied" error in my original email
is exactly
 > due to dockerinsidedocker issue that Ankur mentioned. 
Thanks Ankur!

 > Didn't make the link when you said it, had to discover that
in a hard
 > way (I thought it is due to my docker installation messed up).
 >
 > On Tue, Nov 6, 2018 at 1:53 AM Maximilian Michels
mailto:m...@apache.org>
 > >> wrote:
 >
 >     Hi,
 >
 >     Please follow
 > https://beam.apache.org/roadmap/portability/#python-on-flink
 >
 >     Cheers,
 >     Max
 >
 >     On 06.11.18 01:14, Ankur Goenka wrote:
 >      > Hi,
 >      >
 >      > The Portable Runner requires a job server uri to work
with. The
 >     current
 >      > default job server docker image is broken because of
docker inside
 >      > docker issue.
 >      >
 >      > Please refer to
 >      >
https://beam.apache.org/roadmap/portability/#python-on-flink for
 >     how to
 >      > run a wordcount using Portable Flink Runner.
 >      >
 >      > Thanks,
 >      > Ankur
 >      >
 >      > On Mon, Nov 5, 2018 at 3:41 PM Ruoyun Huang
mailto:ruo...@google.com>
 >     >
 >      > 
      >
 >      >     Hi, Folks,
 >      >
 >      >           I want to try out Python PortableRunner, by
using following
 >      >     command:
 >      >
 >      >     *sdk/python: python -m apache_beam.examples.wordcount
 >      >       --output=/tmp/test_output   --runner PortableRunner*
 >      >
 >      >           It complains with following error message:
 >      >
 >      >     Caused by: java.lang.Exception: The user defined
'open()' method
 >      >     caused an exception: java.io.IOException: Cannot
run program
 >      >     "docker": error=13, Permission denied
  

Re: contributor permission for Beam Jira tickets

2018-11-14 Thread Craig Chambers
Resending.

On Wed, Nov 14, 2018 at 1:06 PM Craig Chambers  wrote:

> Hi, I'm working on the Dataflow runner and the portable APIs.  I'd like to
> be granted contributor permission for Beam JIRA tickets.  My user id is
> CraigChambersG.  Thanks.
>
> -- Craig
>
>


Re: contributor permission for Beam Jira tickets

2018-11-14 Thread Lukasz Cwik
Welcome, I have granted you contributor access to Beam's JIRA project.

On Wed, Nov 14, 2018 at 3:09 PM Craig Chambers  wrote:

> Resending.
>
> On Wed, Nov 14, 2018 at 1:06 PM Craig Chambers 
> wrote:
>
>> Hi, I'm working on the Dataflow runner and the portable APIs.  I'd like
>> to be granted contributor permission for Beam JIRA tickets.  My user id is
>> CraigChambersG.  Thanks.
>>
>> -- Craig
>>
>>


Please do not merge Python PRs

2018-11-14 Thread Udi Meiri
It seems that Gradle is not getting the correct exit status from test runs.
Possible culprit: https://github.com/apache/beam/pull/6903


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Please do not merge Python PRs

2018-11-14 Thread Udi Meiri
https://github.com/apache/beam/pull/7048 is the rollback PR

On Wed, Nov 14, 2018 at 5:28 PM Ahmet Altay  wrote:

> Thank you Udi. Could you send a rollback PR?
>
> I believe this is https://issues.apache.org/jira/browse/BEAM-6048
>
> On Wed, Nov 14, 2018 at 5:16 PM, Udi Meiri  wrote:
>
>> It seems that Gradle is not getting the correct exit status from test
>> runs.
>> Possible culprit: https://github.com/apache/beam/pull/6903
>>
>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Please do not merge Python PRs

2018-11-14 Thread Ahmet Altay
Thank you Udi. Could you send a rollback PR?

I believe this is https://issues.apache.org/jira/browse/BEAM-6048

On Wed, Nov 14, 2018 at 5:16 PM, Udi Meiri  wrote:

> It seems that Gradle is not getting the correct exit status from test runs.
> Possible culprit: https://github.com/apache/beam/pull/6903
>


Re: Please do not merge Python PRs

2018-11-14 Thread Udi Meiri
Recreated locally: https://gradle.com/s/psqgcywnc3h2m

On Wed, Nov 14, 2018 at 5:16 PM Udi Meiri  wrote:

> It seems that Gradle is not getting the correct exit status from test runs.
> Possible culprit: https://github.com/apache/beam/pull/6903
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Design review for supporting AutoValue Coders and conversions to Row

2018-11-14 Thread Reuven Lax
We already have a framework for ByteBuddy codegen for JavaBean Row
interfaces, which should hopefully be easy to extend AutoValue (and more
efficient than using reflection). I'm working on adding constructor support
to this right now.

On Wed, Nov 14, 2018 at 12:29 AM Jeff Klukas  wrote:

> Sounds, then, like we need to a define a new `AutoValueSchema extends
> SchemaProvider` and users would opt-in to this via the DefaultSchema
> annotation:
>
> @DefaultSchema(AutoValueSchema.class)
> @AutoValue
> public abstract MyClass ...
>
> Since we already have the JavaBean and JavaField reflection-based schema
> providers to use as a guide, it sounds like it may be best to try to
> implement this using reflection rather than implementing an AutoValue
> extension.
>
> A reflection-based approach here would hinge on being able to discover the
> package-private constructor for the concrete class and read its types.
> Those types would define the schema, and the fromRow impementation would
> call the discovered constructor.
>
> On Mon, Nov 12, 2018 at 10:02 AM Reuven Lax  wrote:
>
>>
>>
>> On Mon, Nov 12, 2018 at 11:38 PM Jeff Klukas  wrote:
>>
>>> Reuven - A SchemaProvider makes sense. It's not clear to me, though,
>>> whether that's more limited than a Coder. Do all values of the schema have
>>> to be simple types, or does Beam SQL support nested schemas?
>>>
>>
>> Nested schemas, collection types (lists and maps), and collections of
>> nested types are all supported.
>>
>>>
>>> Put another way, would a user be able to create an AutoValue class
>>> comprised of simple types and then use that as a field inside another
>>> AutoValue class? I can see how that's possible with Coders, but not clear
>>> whether that's possible with Row schemas.
>>>
>>
>> Yes, this is explicitly supported.
>>
>>>
>>> On Fri, Nov 9, 2018 at 8:22 PM Reuven Lax  wrote:
>>>
 Hi Jeff,

 I would suggest a slightly different approach. Instead of generating a
 coder, writing a SchemaProvider that generates a schema for AutoValue. Once
 a PCollection has a schema, a coder is not needed (as Beam knows how to
 encode any type with a schema), and it will work seamlessly with Beam SQL
 (in fact you don't need to write a transform to turn it into a Row if a
 schema is registered).

 We already do this for POJOs and basic JavaBeans. I'm happy to help do
 this for AutoValue.

 Reuven

 On Sat, Nov 10, 2018 at 5:50 AM Jeff Klukas 
 wrote:

> Hi all - I'm looking for some review and commentary on a proposed
> design for providing built-in Coders for AutoValue classes. There's
> existing discussion in BEAM-1891 [0] about using AvroCoder, but that's
> blocked on incompatibility between AutoValue and Avro's reflection
> machinery that don't look resolvable.
>
> I wrote up a design document [1] that instead proposes using
> AutoValue's extension API to automatically generate a Coder for each
> AutoValue class that users generate. A similar technique could be used to
> generate conversions to and from Row for use with BeamSql.
>
> I'd appreciate review of the design and thoughts on whether this seems
> feasible to support within the Beam codebase. I may be missing a simpler
> approach.
>
> [0] https://issues.apache.org/jira/browse/BEAM-1891
> [1]
> https://docs.google.com/document/d/1ucoik4WzUDfilqIz3I1AuMHc1J8DE6iv7gaUCDI42BI/edit?usp=sharing
>



Re: Bigquery streaming TableRow size limit

2018-11-14 Thread Reuven Lax
Generally I would agree, but the consequences here of a mistake are severe.
Not only will the beam pipeline get stuck for 24 hours, _anything_ else in
the user's GCP project that tries to load data into BigQuery will also fail
for the next 24 hours. Given the severity, I think it's best to make the
user opt into this behavior rather than do it magically.

On Wed, Nov 14, 2018 at 4:24 AM Lukasz Cwik  wrote:

> I would rather not have the builder method and run into the quota issue
> then require the builder method and still run into quota issues.
>
> On Mon, Nov 12, 2018 at 5:25 PM Reuven Lax  wrote:
>
>> I'm a bit worried about making this automatic, as it can have unexpected
>> side effects on BigQuery load-job quota. This is a 24-hour quota, so if
>> it's accidentally exceeded all load jobs for the project may be blocked for
>> the next 24 hours. However if the user opts in (possibly via .a builder
>> method), this seems like it could be automatic.
>>
>> Reuven
>>
>> On Tue, Nov 13, 2018 at 7:06 AM Lukasz Cwik  wrote:
>>
>>> Having data ingestion work without needing to worry about how big the
>>> blobs are would be nice if it was automatic for users.
>>>
>>> On Mon, Nov 12, 2018 at 1:03 AM Wout Scheepers <
>>> wout.scheep...@vente-exclusive.com> wrote:
>>>
 Hey all,



 The TableRow size limit is 1mb when streaming into bigquery.

 To prevent data loss, I’m going to implement a TableRow size check and
 add a fan out to do a bigquery load job in case the size is above the 
 limit.

 Of course this load job would be windowed.



 I know it doesn’t make sense to stream data bigger than 1mb, but as
 we’re using pub sub and want to make sure no data loss happens whatsoever,
 I’ll need to implement it.



 Is this functionality any of you would like to see in BigqueryIO
 itself?

 Or do you think my use case is too specific and implementing my
 solution around BigqueryIO will suffice.



 Thanks for your thoughts,

 Wout





>>>


Re: [Call for items] November Beam Newsletter

2018-11-14 Thread Rose Nguyen
Thank you, thank you! :)

On Wed, Nov 14, 2018 at 10:34 AM Maximilian Michels  wrote:

> Updated the Flink Runner section. Just in time for the deadline :)
>
> On 14.11.18 00:37, Rui Wang wrote:
> > Hi,
> >
> > I just added some thing related to BeamSQL.
> >
> > -Rui
> >
> > On Tue, Nov 13, 2018 at 3:26 AM Etienne Chauchot  > > wrote:
> >
> > Hi,
> > I just added some things that were done.
> >
> > Etienne
> >
> > Le lundi 12 novembre 2018 à 12:22 +, Matthias Baetens a écrit :
> >> Looks great, thanks for the effort and for including the Summit
> >> blogpost, Rose!
> >>
> >> On Thu, 8 Nov 2018 at 22:55 Rose Nguyen  >> > wrote:
> >>> Hi Beamers:
> >>> *
> >>> *
> >>> Time to sync with the community on all the awesome stuff we've
> >>> been doing!
> >>> *
> >>> *
> >>> *Add the highlights from October to now (or planned events and
> >>> talks) that you want to share by 11/14 11:59 p.m. PDT.*
> >>>
> >>> We will collect the notes via Google docs but send out the final
> >>> version directly to the user mailing list. If you do not know how
> >>> to format something, it is OK to just put down the info and I
> >>> will edit. I'll ship out the newsletter on 11/15.
> >>>
> >>> [1]
> >>>
> https://docs.google.com/document/d/1kKQ4a9RdptB6NwYlqmI9tTcdLAUzDnWi2dkvUi0J_Ww
> >>> --
> >>> Rose Thị Nguyễn
> >> --
> >
>


-- 
Rose Thị Nguyễn


Re: How to use "PortableRunner" in Python SDK?

2018-11-14 Thread Thomas Weise
Works for me on macOS as well.

In case you don't launch the pipeline through Gradle, this would be the
command:

python -m apache_beam.examples.wordcount \
  --input=/etc/profile \
  --output=/tmp/py-wordcount-direct \
  --runner=PortableRunner \
  --job_endpoint=localhost:8099 \
  --parallelism=1 \
  --OPTIONALflink_master=localhost:8081 \
  --streaming

We talked about adding the wordcount to pre-commit..

Regarding using ULR vs. Flink runner: There seems to be confusion between
PortableRunner using the user supplied endpoint vs. trying to launch a job
server. I commented in the doc.

Thomas



On Wed, Nov 14, 2018 at 3:30 AM Maximilian Michels  wrote:

> Hi Ruoyun,
>
> I just ran the wordcount locally using the instructions on the page.
> I've tried the local file system and GCS. Both times it ran successfully
> and produced valid output.
>
> I'm assuming there is some problem with your setup. Which platform are
> you using? I'm on MacOS.
>
> Could you expand on the planned merge? From my understanding we will
> always need PortableRunner in Python to be able to submit against the
> Beam JobServer.
>
> Thanks,
> Max
>
> On 14.11.18 00:39, Ruoyun Huang wrote:
> > A quick follow-up on using current PortableRunner.
> >
> > I followed the exact three steps as Ankur and Maximilian shared in
> > https://beam.apache.org/roadmap/portability/#python-on-flink  ;   The
> > wordcount example keeps hanging after 10 minutes.  I also tried
> > specifying explicit input/output args, either using gcs folder or local
> > file system, but none of them works.
> >
> > Spent some time looking into it but conclusion yet.  At this point
> > though, I guess it does not matter much any more, given we already have
> > the plan of merging PortableRunner into using java reference runner
> > (i.e. :beam-runners-reference-job-server).
> >
> > Still appreciated if someone can try out the python-on-flink
> > instructions
>
> > in case it is just due to my local machine setup.  Thanks!
> >
> >
> >
> > On Thu, Nov 8, 2018 at 5:04 PM Ruoyun Huang  > > wrote:
> >
> > Thanks Maximilian!
> >
> > I am working on migrating existing PortableRunner to using java ULR
> > (Link to Notes
> > <
> https://docs.google.com/document/d/1S86saZqiDaE_M5wxO0zOQ_rwC6QHv7sp1BmGTm0dLNE/edit#
> >).
> > If this issue is non-trivial to solve, I would vote for removing
> > this default behavior as part of the consolidation.
> >
> > On Thu, Nov 8, 2018 at 2:58 AM Maximilian Michels  > > wrote:
> >
> > In the long run, we should get rid of the Docker-inside-Docker
> > approach,
> > which was only intended for testing anyways. It would be cleaner
> to
> > start the SDK harness container alongside with JobServer
> container.
> >
> > Short term, I think it should be easy to either fix the
> > permissions of
> > the mounted "docker" executable or use a Docker image for the
> > JobServer
> > which comes with Docker pre-installed.
> >
> > JIRA: https://issues.apache.org/jira/browse/BEAM-6020
> >
> > Thanks for reporting this Ruoyun!
> >
> > -Max
> >
> > On 08.11.18 00:10, Ruoyun Huang wrote:
> >  > Thanks Ankur and Maximilian.
> >  >
> >  > Just for reference in case other people encountering the same
> > error
> >  > message, the "permission denied" error in my original email
> > is exactly
> >  > due to dockerinsidedocker issue that Ankur mentioned.
> > Thanks Ankur!
> >  > Didn't make the link when you said it, had to discover that
> > in a hard
> >  > way (I thought it is due to my docker installation messed up).
> >  >
> >  > On Tue, Nov 6, 2018 at 1:53 AM Maximilian Michels
> > mailto:m...@apache.org>
> >  > >> wrote:
> >  >
> >  > Hi,
> >  >
> >  > Please follow
> >  > https://beam.apache.org/roadmap/portability/#python-on-flink
> >  >
> >  > Cheers,
> >  > Max
> >  >
> >  > On 06.11.18 01:14, Ankur Goenka wrote:
> >  >  > Hi,
> >  >  >
> >  >  > The Portable Runner requires a job server uri to work
> > with. The
> >  > current
> >  >  > default job server docker image is broken because of
> > docker inside
> >  >  > docker issue.
> >  >  >
> >  >  > Please refer to
> >  >  >
> > https://beam.apache.org/roadmap/portability/#python-on-flink for
> >  > how to
> >  >  > run a wordcount using Portable Flink Runner.
> >  >  >
> >  >  > Thanks,
> >  >  > Ankur
> >  > 

Re: [DISCUSS] Publish vendored dependencies independently

2018-11-14 Thread Lukasz Cwik
Its a small hassle but could be checked in with some changes, my example
commit was so that people could try it out as is.

I'll work towards getting it checked in and then start a release for gRPC
and guava.

On Wed, Nov 14, 2018 at 11:45 AM Scott Wegner  wrote:

> Thanks for pushing this forward Luke.
>
> My understanding is that these vendored grpc artifacts will only be
> consumed directly by Beam internal components (as opposed to Beam user
> projects). So there should be a fairly low bar for publishing them. But
> perhaps we should have some short checklist for releasing them for
> consistency.
>
> One item I would suggest for such a checklist would be to publish
> artifacts from checked-in apache/beam sources and then tag the release
> commit. Is it possible to get your changes merged in first, or is there a
> chicken-and-egg problem that artifacts need to be published and available
> for consumption?
>
> On Wed, Nov 14, 2018 at 10:51 AM Lukasz Cwik  wrote:
>
>> Note, I could also release the vendored version of guava 20 in
>> preparation for us to start consuming it. Any concerns?
>>
>> On Tue, Nov 13, 2018 at 3:59 PM Lukasz Cwik  wrote:
>>
>>> I have made some incremental progress on this and wanted to release our
>>> first vendored dependency of gRPC 1.13.1 since I was able to fix a good
>>> number of the import/code completion errors that Intellij was experiencing.
>>> I have published an example of what the jar/pom looks like in the Apache
>>> Staging repo:
>>>
>>> https://repository.apache.org/content/groups/snapshots/org/apache/beam/beam-vendor-grpc-1_13_1/
>>>
>>> You can also checkout[1] and from a clean workspace run:
>>> ./gradlew :beam-vendor-grpc-1_13_1:publishToMavenLocal -PisRelease
>>> -PvendoredDependenciesOnly
>>> which will build a vendored version of gRPC that is published to your
>>> local maven repository. All the projects that depended on the gradle
>>> beam-vendor-grpc-1_13_1 project are now pointing at the Maven artifact
>>> org.apache.beam:beam-vendor-grpc-1_13_1:0.1
>>>
>>> I was planning to follow the Apache Beam release process but only for
>>> this specific artifact and start a vote thread if there aren't any concerns.
>>>
>>> 1:
>>> https://github.com/lukecwik/incubator-beam/commit/4b1b7b40ef316559f81c42dfdd44da988db201e9
>>>
>>>
>>> On Thu, Oct 25, 2018 at 10:59 AM Lukasz Cwik  wrote:
>>>
 Thats a good point Thomas, hadn't considered the lib/ case. I also am
 recommending what Thomas is suggesting as well.

 On Thu, Oct 25, 2018 at 10:52 AM Maximilian Michels 
 wrote:

> On 25.10.18 19:23, Lukasz Cwik wrote:
> >
> >
> > On Thu, Oct 25, 2018 at 9:59 AM Maximilian Michels  > > wrote:
> >
> > Question: How would a user end up with the same shaded dependency
> > twice?
> > The shaded dependencies are transitive dependencies of Beam and
> thus,
> > this shouldn't happen. Is this a safe-guard when running
> different
> > versions of Beam in the same JVM?
> >
> >
> > What I was referring to was that they aren't exactly the same
> dependency
> > but slightly different versions of the same dependency. Since we are
> > planning to vendor each dependency and its transitive dependencies
> as
> > part of the same jar, we can have  vendor-A that contains shaded
> > transitive-C 1.0 and vendor-B that contains transitive-C 2.0 both
> with
> > different package prefixes. It can be that transitive-C 1.0 and
> > transitive-C 2.0 can't be on the same classpath because they can't
> be
> > perfectly shaded due to JNI, java reflection, magical property
> > files/strings, ...
> >
>
> Ah yes. Get it. Thanks!
>

>
> --
>
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>


Re: [DISCUSS] Publish vendored dependencies independently

2018-11-14 Thread Scott Wegner
Thanks for pushing this forward Luke.

My understanding is that these vendored grpc artifacts will only be
consumed directly by Beam internal components (as opposed to Beam user
projects). So there should be a fairly low bar for publishing them. But
perhaps we should have some short checklist for releasing them for
consistency.

One item I would suggest for such a checklist would be to publish artifacts
from checked-in apache/beam sources and then tag the release commit. Is it
possible to get your changes merged in first, or is there a chicken-and-egg
problem that artifacts need to be published and available for consumption?

On Wed, Nov 14, 2018 at 10:51 AM Lukasz Cwik  wrote:

> Note, I could also release the vendored version of guava 20 in preparation
> for us to start consuming it. Any concerns?
>
> On Tue, Nov 13, 2018 at 3:59 PM Lukasz Cwik  wrote:
>
>> I have made some incremental progress on this and wanted to release our
>> first vendored dependency of gRPC 1.13.1 since I was able to fix a good
>> number of the import/code completion errors that Intellij was experiencing.
>> I have published an example of what the jar/pom looks like in the Apache
>> Staging repo:
>>
>> https://repository.apache.org/content/groups/snapshots/org/apache/beam/beam-vendor-grpc-1_13_1/
>>
>> You can also checkout[1] and from a clean workspace run:
>> ./gradlew :beam-vendor-grpc-1_13_1:publishToMavenLocal -PisRelease
>> -PvendoredDependenciesOnly
>> which will build a vendored version of gRPC that is published to your
>> local maven repository. All the projects that depended on the gradle
>> beam-vendor-grpc-1_13_1 project are now pointing at the Maven artifact
>> org.apache.beam:beam-vendor-grpc-1_13_1:0.1
>>
>> I was planning to follow the Apache Beam release process but only for
>> this specific artifact and start a vote thread if there aren't any concerns.
>>
>> 1:
>> https://github.com/lukecwik/incubator-beam/commit/4b1b7b40ef316559f81c42dfdd44da988db201e9
>>
>>
>> On Thu, Oct 25, 2018 at 10:59 AM Lukasz Cwik  wrote:
>>
>>> Thats a good point Thomas, hadn't considered the lib/ case. I also am
>>> recommending what Thomas is suggesting as well.
>>>
>>> On Thu, Oct 25, 2018 at 10:52 AM Maximilian Michels 
>>> wrote:
>>>
 On 25.10.18 19:23, Lukasz Cwik wrote:
 >
 >
 > On Thu, Oct 25, 2018 at 9:59 AM Maximilian Michels >>> > > wrote:
 >
 > Question: How would a user end up with the same shaded dependency
 > twice?
 > The shaded dependencies are transitive dependencies of Beam and
 thus,
 > this shouldn't happen. Is this a safe-guard when running different
 > versions of Beam in the same JVM?
 >
 >
 > What I was referring to was that they aren't exactly the same
 dependency
 > but slightly different versions of the same dependency. Since we are
 > planning to vendor each dependency and its transitive dependencies as
 > part of the same jar, we can have  vendor-A that contains shaded
 > transitive-C 1.0 and vendor-B that contains transitive-C 2.0 both
 with
 > different package prefixes. It can be that transitive-C 1.0 and
 > transitive-C 2.0 can't be on the same classpath because they can't be
 > perfectly shaded due to JNI, java reflection, magical property
 > files/strings, ...
 >

 Ah yes. Get it. Thanks!

>>>

-- 




Got feedback? tinyurl.com/swegner-feedback


Re: How to use "PortableRunner" in Python SDK?

2018-11-14 Thread Ruoyun Huang
Thanks Thomas!

My desktop runs Linux.  I was using gradle to run wordcount, and that was
how I got the job hanging. Since both of you get it working, I guess more
likely sth is wrong with my setup.


By using Thmoas's python command line exactly as is, I am able to see the
job run succeeds, however two questions:

1)  Did you check whether output file "/tmp/py-wordcount-direct" exists or
not?  I expect there should be a text output, but I don't see this file
afterwards.   (I am still in the stage building confidence in telling what
a succeeded run is.  Maybe I will try DataflowRunner and cross check
outputs).

2)  Why it needs a "--streaming" arg?  Isn't this a static batch input, by
feeding a txt file input?  In fact, I got failure message if I remove
'--streaming', not sure if it is due to my setup again.


On Wed, Nov 14, 2018 at 7:51 AM Thomas Weise  wrote:

> Works for me on macOS as well.
>
> In case you don't launch the pipeline through Gradle, this would be the
> command:
>
> python -m apache_beam.examples.wordcount \
>   --input=/etc/profile \
>   --output=/tmp/py-wordcount-direct \
>   --runner=PortableRunner \
>   --job_endpoint=localhost:8099 \
>   --parallelism=1 \
>   --OPTIONALflink_master=localhost:8081 \
>   --streaming
>
> We talked about adding the wordcount to pre-commit..
>
> Regarding using ULR vs. Flink runner: There seems to be confusion between
> PortableRunner using the user supplied endpoint vs. trying to launch a job
> server. I commented in the doc.
>
> Thomas
>
>
>
> On Wed, Nov 14, 2018 at 3:30 AM Maximilian Michels  wrote:
>
>> Hi Ruoyun,
>>
>> I just ran the wordcount locally using the instructions on the page.
>> I've tried the local file system and GCS. Both times it ran successfully
>> and produced valid output.
>>
>> I'm assuming there is some problem with your setup. Which platform are
>> you using? I'm on MacOS.
>>
>> Could you expand on the planned merge? From my understanding we will
>> always need PortableRunner in Python to be able to submit against the
>> Beam JobServer.
>>
>> Thanks,
>> Max
>>
>> On 14.11.18 00:39, Ruoyun Huang wrote:
>> > A quick follow-up on using current PortableRunner.
>> >
>> > I followed the exact three steps as Ankur and Maximilian shared in
>> > https://beam.apache.org/roadmap/portability/#python-on-flink  ;   The
>> > wordcount example keeps hanging after 10 minutes.  I also tried
>> > specifying explicit input/output args, either using gcs folder or local
>> > file system, but none of them works.
>> >
>> > Spent some time looking into it but conclusion yet.  At this point
>> > though, I guess it does not matter much any more, given we already have
>> > the plan of merging PortableRunner into using java reference runner
>> > (i.e. :beam-runners-reference-job-server).
>> >
>> > Still appreciated if someone can try out the python-on-flink
>> > instructions
>>
>> > in case it is just due to my local machine setup.  Thanks!
>> >
>> >
>> >
>> > On Thu, Nov 8, 2018 at 5:04 PM Ruoyun Huang > > > wrote:
>> >
>> > Thanks Maximilian!
>> >
>> > I am working on migrating existing PortableRunner to using java ULR
>> > (Link to Notes
>> > <
>> https://docs.google.com/document/d/1S86saZqiDaE_M5wxO0zOQ_rwC6QHv7sp1BmGTm0dLNE/edit#
>> >).
>> > If this issue is non-trivial to solve, I would vote for removing
>> > this default behavior as part of the consolidation.
>> >
>> > On Thu, Nov 8, 2018 at 2:58 AM Maximilian Michels > > > wrote:
>> >
>> > In the long run, we should get rid of the Docker-inside-Docker
>> > approach,
>> > which was only intended for testing anyways. It would be
>> cleaner to
>> > start the SDK harness container alongside with JobServer
>> container.
>> >
>> > Short term, I think it should be easy to either fix the
>> > permissions of
>> > the mounted "docker" executable or use a Docker image for the
>> > JobServer
>> > which comes with Docker pre-installed.
>> >
>> > JIRA: https://issues.apache.org/jira/browse/BEAM-6020
>> >
>> > Thanks for reporting this Ruoyun!
>> >
>> > -Max
>> >
>> > On 08.11.18 00:10, Ruoyun Huang wrote:
>> >  > Thanks Ankur and Maximilian.
>> >  >
>> >  > Just for reference in case other people encountering the same
>> > error
>> >  > message, the "permission denied" error in my original email
>> > is exactly
>> >  > due to dockerinsidedocker issue that Ankur mentioned.
>> > Thanks Ankur!
>> >  > Didn't make the link when you said it, had to discover that
>> > in a hard
>> >  > way (I thought it is due to my docker installation messed
>> up).
>> >  >
>> >  > On Tue, Nov 6, 2018 at 1:53 AM Maximilian Michels
>> > mailto:m...@apache.org>
>> >  > 

Re: How to use "PortableRunner" in Python SDK?

2018-11-14 Thread Ruoyun Huang
To answer Maximilian's question.

I am using Linux, debian distribution.

It probably sounded too much when I used the word 'planned merge'. What I
really meant entails less change than it sounds. More specifically:

1) The default behavior, where PortableRunner starts a flink server.  It is
confusing to new users.
2) All the related docs and inline comments.  Similarly, it could be very
confusing connecting PortableRunner to Flink server.
3) [Probably no longer an issue].   I couldn't make the flink server
example working.  And I could not make example working on Java-ULR either.
Both will require debugging for resolutions.  Thus I figured maybe let us
only focus on one single thing: the java-ULR part, without worrying about
Flink-server.   Again, looks like this may not be a valid concern, given
flink part is most likely due to my setup.


On Wed, Nov 14, 2018 at 3:30 AM Maximilian Michels  wrote:

> Hi Ruoyun,
>
> I just ran the wordcount locally using the instructions on the page.
> I've tried the local file system and GCS. Both times it ran successfully
> and produced valid output.
>
> I'm assuming there is some problem with your setup. Which platform are
> you using? I'm on MacOS.
>
> Could you expand on the planned merge? From my understanding we will
> always need PortableRunner in Python to be able to submit against the
> Beam JobServer.
>
> Thanks,
> Max
>
> On 14.11.18 00:39, Ruoyun Huang wrote:
> > A quick follow-up on using current PortableRunner.
> >
> > I followed the exact three steps as Ankur and Maximilian shared in
> > https://beam.apache.org/roadmap/portability/#python-on-flink  ;   The
> > wordcount example keeps hanging after 10 minutes.  I also tried
> > specifying explicit input/output args, either using gcs folder or local
> > file system, but none of them works.
> >
> > Spent some time looking into it but conclusion yet.  At this point
> > though, I guess it does not matter much any more, given we already have
> > the plan of merging PortableRunner into using java reference runner
> > (i.e. :beam-runners-reference-job-server).
> >
> > Still appreciated if someone can try out the python-on-flink
> > instructions
>
> > in case it is just due to my local machine setup.  Thanks!
> >
> >
> >
> > On Thu, Nov 8, 2018 at 5:04 PM Ruoyun Huang  > > wrote:
> >
> > Thanks Maximilian!
> >
> > I am working on migrating existing PortableRunner to using java ULR
> > (Link to Notes
> > <
> https://docs.google.com/document/d/1S86saZqiDaE_M5wxO0zOQ_rwC6QHv7sp1BmGTm0dLNE/edit#
> >).
> > If this issue is non-trivial to solve, I would vote for removing
> > this default behavior as part of the consolidation.
> >
> > On Thu, Nov 8, 2018 at 2:58 AM Maximilian Michels  > > wrote:
> >
> > In the long run, we should get rid of the Docker-inside-Docker
> > approach,
> > which was only intended for testing anyways. It would be cleaner
> to
> > start the SDK harness container alongside with JobServer
> container.
> >
> > Short term, I think it should be easy to either fix the
> > permissions of
> > the mounted "docker" executable or use a Docker image for the
> > JobServer
> > which comes with Docker pre-installed.
> >
> > JIRA: https://issues.apache.org/jira/browse/BEAM-6020
> >
> > Thanks for reporting this Ruoyun!
> >
> > -Max
> >
> > On 08.11.18 00:10, Ruoyun Huang wrote:
> >  > Thanks Ankur and Maximilian.
> >  >
> >  > Just for reference in case other people encountering the same
> > error
> >  > message, the "permission denied" error in my original email
> > is exactly
> >  > due to dockerinsidedocker issue that Ankur mentioned.
> > Thanks Ankur!
> >  > Didn't make the link when you said it, had to discover that
> > in a hard
> >  > way (I thought it is due to my docker installation messed up).
> >  >
> >  > On Tue, Nov 6, 2018 at 1:53 AM Maximilian Michels
> > mailto:m...@apache.org>
> >  > >> wrote:
> >  >
> >  > Hi,
> >  >
> >  > Please follow
> >  > https://beam.apache.org/roadmap/portability/#python-on-flink
> >  >
> >  > Cheers,
> >  > Max
> >  >
> >  > On 06.11.18 01:14, Ankur Goenka wrote:
> >  >  > Hi,
> >  >  >
> >  >  > The Portable Runner requires a job server uri to work
> > with. The
> >  > current
> >  >  > default job server docker image is broken because of
> > docker inside
> >  >  > docker issue.
> >  >  >
> >  >  > Please refer to
> >  >  >
> > 

Re: [Call for items] November Beam Newsletter

2018-11-14 Thread Maximilian Michels

Updated the Flink Runner section. Just in time for the deadline :)

On 14.11.18 00:37, Rui Wang wrote:

Hi,

I just added some thing related to BeamSQL.

-Rui

On Tue, Nov 13, 2018 at 3:26 AM Etienne Chauchot > wrote:


Hi,
I just added some things that were done.

Etienne

Le lundi 12 novembre 2018 à 12:22 +, Matthias Baetens a écrit :

Looks great, thanks for the effort and for including the Summit
blogpost, Rose!

On Thu, 8 Nov 2018 at 22:55 Rose Nguyen mailto:rtngu...@google.com>> wrote:

Hi Beamers:
*
*
Time to sync with the community on all the awesome stuff we've
been doing!
*
*
*Add the highlights from October to now (or planned events and
talks) that you want to share by 11/14 11:59 p.m. PDT.*

We will collect the notes via Google docs but send out the final
version directly to the user mailing list. If you do not know how
to format something, it is OK to just put down the info and I
will edit. I'll ship out the newsletter on 11/15.

[1]

https://docs.google.com/document/d/1kKQ4a9RdptB6NwYlqmI9tTcdLAUzDnWi2dkvUi0J_Ww
-- 
Rose Thị Nguyễn
--