Re: Beam Job Server Errors Out: No filesystem found for scheme s3

2021-08-05 Thread Navdeep Poonia
unsubscribe

On Fri, Aug 6, 2021 at 10:36 AM Jeremy Lewi  wrote:

> Hi Folks,
>
> I'm running Beam Python (2.31.0) on Flink on Kubernetes. I'm using the
> PortableRunner and a beam job server.
> I'm using S3 for the artifacts dir. The job server is throwing exceptions
> like the one below complaining that filesystem scheme S3 isn't registered.
>
> I'm using the stock job server
> container apache/beam_flink1.13_job_server:2.31.0. It doesn't look like the
> jar that ships in that server includes the AWS Filesystem classes.
>
> So I tried adding the
> jar beam-sdks-java-io-amazon-web-services-2.31.0.jar. Now when the job
> server loads I get a ClassNotFound exception for AWSCredentialsProvider 
> because
> I'm missing some of the dependencies for that.
>
> Does anyone have recommendations on the easiest path to obtaining the
> jar(s) needed to support S3 for the artifact service? Is there an uber jar
> published somewhere or will I have to build it myself?
>
> Thanks.
> J
>
>
>
> Aug 06, 2021 1:52:51 AM
> org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$2
> finishStaging
>
> SEVERE: Error staging artifacts
>
> java.util.concurrent.ExecutionException:
> java.lang.IllegalArgumentException: No filesystem found for scheme s3
>
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>
> at
> org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$2.finishStaging(ArtifactStagingService.java:461)
>
> at
> org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$2.resolveNextEnvironment(ArtifactStagingService.java:439)
>
> at
> org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$2.onNext(ArtifactStagingService.java:417)
>
> at
> org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$2.onNext(ArtifactStagingService.java:303)
>
> at
> org.apache.beam.vendor.grpc.v1p36p0.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:255)
>
> at
> org.apache.beam.vendor.grpc.v1p36p0.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
>
> at
> org.apache.beam.vendor.grpc.v1p36p0.io.grpc.Contexts$ContextualizedServerCallListener.onMessage(Contexts.java:76)
>
> at
> org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:309)
>
> at
> org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:292)
>
> at
> org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:765)
>
> at
> org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
>
> at
> org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
> at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.IllegalArgumentException: No filesystem found for
> scheme s3
>
> at
> org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:497)
>
> at
> org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:567)
>
> at
> org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$1.stagingDir(ArtifactStagingService.java:193)
>
> at
> org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$1.getDestination(ArtifactStagingService.java:169)
>
> at
> org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$StoreArtifact.call(ArtifactStagingService.java:271)
>
> at
> org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$StoreArtifact.call(ArtifactStagingService.java:247)
>
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>
> ... 3 more
>
>
>
>
> ---
>
>
> Here's the exception after adding the amazon web services Jar
>
>
> Module could not be instantiated
>
> at java.util.ServiceLoader.fail(ServiceLoader.java:232)
>
> at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
>
> at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
>
> at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>
> at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>
> at
> com.fasterxml.jackson.databind.ObjectMapper.findModules(ObjectMapper.java:1081)
>
> at
> org.apache.beam.sdk.options.PipelineOptionsFactory.(PipelineOptionsFactory.java:478)
>
> at
> org.apache.beam.runners.flink.FlinkJobServerDriver.main(FlinkJobServerDriver.java:72)
>
> Caused by: java.lang.NoClassDefFoundError:
> com/amazonaws/auth/AWSCredentialsProvider
>
> at org.apache.beam.sdk.io.aws.options.AwsModule.(AwsModule.java:86)
>
> at 

Beam Job Server Errors Out: No filesystem found for scheme s3

2021-08-05 Thread Jeremy Lewi
 Hi Folks,

I'm running Beam Python (2.31.0) on Flink on Kubernetes. I'm using the
PortableRunner and a beam job server.
I'm using S3 for the artifacts dir. The job server is throwing exceptions
like the one below complaining that filesystem scheme S3 isn't registered.

I'm using the stock job server
container apache/beam_flink1.13_job_server:2.31.0. It doesn't look like the
jar that ships in that server includes the AWS Filesystem classes.

So I tried adding the jar beam-sdks-java-io-amazon-web-services-2.31.0.jar.
Now when the job server loads I get a ClassNotFound exception for
AWSCredentialsProvider because I'm missing some of the dependencies for
that.

Does anyone have recommendations on the easiest path to obtaining the
jar(s) needed to support S3 for the artifact service? Is there an uber jar
published somewhere or will I have to build it myself?

Thanks.
J



Aug 06, 2021 1:52:51 AM
org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$2
finishStaging

SEVERE: Error staging artifacts

java.util.concurrent.ExecutionException:
java.lang.IllegalArgumentException: No filesystem found for scheme s3

at java.util.concurrent.FutureTask.report(FutureTask.java:122)

at java.util.concurrent.FutureTask.get(FutureTask.java:192)

at
org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$2.finishStaging(ArtifactStagingService.java:461)

at
org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$2.resolveNextEnvironment(ArtifactStagingService.java:439)

at
org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$2.onNext(ArtifactStagingService.java:417)

at
org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$2.onNext(ArtifactStagingService.java:303)

at
org.apache.beam.vendor.grpc.v1p36p0.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:255)

at
org.apache.beam.vendor.grpc.v1p36p0.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)

at
org.apache.beam.vendor.grpc.v1p36p0.io.grpc.Contexts$ContextualizedServerCallListener.onMessage(Contexts.java:76)

at
org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailableInternal(ServerCallImpl.java:309)

at
org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:292)

at
org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:765)

at
org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)

at
org.apache.beam.vendor.grpc.v1p36p0.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.IllegalArgumentException: No filesystem found for
scheme s3

at
org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:497)

at org.apache.beam.sdk.io.FileSystems.matchNewResource(FileSystems.java:567)

at
org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$1.stagingDir(ArtifactStagingService.java:193)

at
org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$1.getDestination(ArtifactStagingService.java:169)

at
org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$StoreArtifact.call(ArtifactStagingService.java:271)

at
org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$StoreArtifact.call(ArtifactStagingService.java:247)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

... 3 more




---


Here's the exception after adding the amazon web services Jar


Module could not be instantiated

at java.util.ServiceLoader.fail(ServiceLoader.java:232)

at java.util.ServiceLoader.access$100(ServiceLoader.java:185)

at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)

at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)

at java.util.ServiceLoader$1.next(ServiceLoader.java:480)

at
com.fasterxml.jackson.databind.ObjectMapper.findModules(ObjectMapper.java:1081)

at
org.apache.beam.sdk.options.PipelineOptionsFactory.(PipelineOptionsFactory.java:478)

at
org.apache.beam.runners.flink.FlinkJobServerDriver.main(FlinkJobServerDriver.java:72)

Caused by: java.lang.NoClassDefFoundError:
com/amazonaws/auth/AWSCredentialsProvider

at org.apache.beam.sdk.io.aws.options.AwsModule.(AwsModule.java:86)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at 

Re: Spark Structured Streaming runner migrated to Spark 3

2021-08-05 Thread Austin Bennett
Hooray!  Thanks, Etienne!

On Thu, Aug 5, 2021 at 3:11 AM Etienne Chauchot 
wrote:

> Hi all,
>
> Just to let you know that Spark Structured Streaming runner was migrated
> to Spark 3.
>
> Enjoy !
>
> Etienne
>
>


Re: Allyship workshops for open source contributors

2021-08-05 Thread Aizhamal Nurmamat kyzy
Sorry. Here is the link:
https://doodle.com/poll/d57tcpt46tkvtvay?utm_source=poll_medium=link

On Wed, Aug 4, 2021 at 3:32 PM Aizhamal Nurmamat kyzy 
wrote:

> Hi all,
>
> I will get this workshop scheduled for September. I am trying to figure
> out which day/time works best considering the US and EU timezone
> participants. I suggest we start at 8am PST, which is 5pm CEST. Would that
> work for most of you? Could you please indicate days you would prefer by
> voting in this poll too?
>
> Thanks a lot!
>
> On Mon, Jun 14, 2021 at 12:00 PM Aizhamal Nurmamat kyzy <
> aizha...@apache.org> wrote:
>
>> Thank you all! Based on the feedback, I will set up a session for a
>> couple open source groups. Will share more details soon. Stay tuned.
>>
>> On Mon, Jun 7, 2021 at 4:42 PM Kenneth Knowles  wrote:
>>
>>> Yes please!
>>>
>>> On Thu, Jun 3, 2021, 18:32 Ratnakar Malla  wrote:
>>>
 +1


 --
 *From:* Austin Bennett 
 *Sent:* Thursday, June 3, 2021 6:20:25 PM
 *To:* user@beam.apache.org 
 *Cc:* dev 
 *Subject:* Re: Allyship workshops for open source contributors

 +1, assuming timing can work.

 On Wed, Jun 2, 2021 at 2:07 PM Aizhamal Nurmamat kyzy <
 aizha...@apache.org> wrote:

 If we have a good number of people who express interest in this thread,
 I will set up training for the Airflow community.


 I meant Beam ^^' I am organizing it for the Airflow community as well.




Re: Perf issue with Beam on spark (spark runner)

2021-08-05 Thread Tao Li
Hi Alexey,

It was a great presentation!

Regarding my perf testing, I was not doing aggregation, filtering, projection 
or joining. I was simply reading all the fields of parquet and then immediately 
save PCollection back to parquet.

Regarding SDF translation, is it enabled by default?

I will check out ParquetIO splittable. Thanks!

From: Alexey Romanenko 
Date: Thursday, August 5, 2021 at 6:40 AM
To: Tao Li 
Cc: "user@beam.apache.org" , Andrew Pilloud 
, Ismaël Mejía , Kyle Weaver 
, Yuchu Cao 
Subject: Re: Perf issue with Beam on spark (spark runner)

It’s very likely that Spark SQL may have much better performance because of SQL 
push-downs and avoiding additional ser/deser operations.

In the same time, did you try to leverage "withProjection()” in ParquetIO and 
project only the fields that you needed?

Did you use ParquetIO splittable (it's not enabled by default, fixed in [1])?

Also, using SDF translation for Read on Spark Runner can cause performance 
degradation as well (we noticed that in our experiments). Try to use non-SDF 
read (if not yet) [2]


PS: Yesterday, on Beam Summit, we (Ismael and me) gave a related talk. I’m not 
sure if a recording is already available but you can find the slides here [3] 
that can be helpful.


—
Alexey

[1] 
https://issues.apache.org/jira/browse/BEAM-12070
[2] 
https://issues.apache.org/jira/browse/BEAM-10670
[3] 
https://drive.google.com/file/d/17rJC0BkxpFFL1abVL01c-D0oHvRRmQ-O/view?usp=sharing



On 5 Aug 2021, at 03:07, Tao Li mailto:t...@zillow.com>> wrote:

@Alexey Romanenko @Ismaël 
Mejía I assume you are experts on spark runner. Can 
you please take a look at this thread and confirm this jira covers the causes 
https://issues.apache.org/jira/browse/BEAM-12646
 ?

This perf issue is currently a blocker to me..

Thanks so much!

From: Tao Li mailto:t...@zillow.com>>
Reply-To: "user@beam.apache.org" 
mailto:user@beam.apache.org>>
Date: Friday, July 30, 2021 at 3:53 PM
To: Andrew Pilloud mailto:apill...@google.com>>, 
"user@beam.apache.org" 
mailto:user@beam.apache.org>>
Cc: Kyle Weaver mailto:kcwea...@google.com>>, Yuchu Cao 
mailto:yuc...@trulia.com>>
Subject: Re: Perf issue with Beam on spark (spark runner)

Thanks everyone for your help.

We actually did another round of perf comparison between Beam (on spark) and 
native spark, without any projection/filtering in the query (to rule out the 
“predicate pushdown” factor).

The time spent on Beam with spark runner is still taking 3-5x period of time 
compared with native spark, and the cause 
ishttps://issues.apache.org/jira/browse/BEAM-12646
 according to the spark metrics. Spark runner is pretty much the bottleneck.



From: Andrew Pilloud mailto:apill...@google.com>>
Date: Thursday, July 29, 2021 at 2:11 PM
To: "user@beam.apache.org" 
mailto:user@beam.apache.org>>
Cc: Tao Li mailto:t...@zillow.com>>, 

Re: Perf issue with Beam on spark (spark runner)

2021-08-05 Thread Alexey Romanenko
It’s very likely that Spark SQL may have much better performance because of SQL 
push-downs and avoiding additional ser/deser operations.

In the same time, did you try to leverage "withProjection()” in ParquetIO and 
project only the fields that you needed? 

Did you use ParquetIO splittable (it's not enabled by default, fixed in [1])?

Also, using SDF translation for Read on Spark Runner can cause performance 
degradation as well (we noticed that in our experiments). Try to use non-SDF 
read (if not yet) [2]


PS: Yesterday, on Beam Summit, we (Ismael and me) gave a related talk. I’m not 
sure if a recording is already available but you can find the slides here [3] 
that can be helpful.


—
Alexey

[1] https://issues.apache.org/jira/browse/BEAM-12070
[2] https://issues.apache.org/jira/browse/BEAM-10670
[3] 
https://drive.google.com/file/d/17rJC0BkxpFFL1abVL01c-D0oHvRRmQ-O/view?usp=sharing
 

> On 5 Aug 2021, at 03:07, Tao Li  wrote:
> 
> @Alexey Romanenko  @Ismaël Mejía 
>  I assume you are experts on spark runner. Can you 
> please take a look at this thread and confirm this jira covers the causes 
> https://issues.apache.org/jira/browse/BEAM-12646 
> 
>  ?
>  
> This perf issue is currently a blocker to me..
>  
> Thanks so much!
>  
> From: Tao Li mailto:t...@zillow.com>>
> Reply-To: "user@beam.apache.org " 
> mailto:user@beam.apache.org>>
> Date: Friday, July 30, 2021 at 3:53 PM
> To: Andrew Pilloud mailto:apill...@google.com>>, 
> "user@beam.apache.org "  >
> Cc: Kyle Weaver mailto:kcwea...@google.com>>, Yuchu Cao 
> mailto:yuc...@trulia.com>>
> Subject: Re: Perf issue with Beam on spark (spark runner)
>  
> Thanks everyone for your help.
>  
> We actually did another round of perf comparison between Beam (on spark) and 
> native spark, without any projection/filtering in the query (to rule out the 
> “predicate pushdown” factor).
>  
> The time spent on Beam with spark runner is still taking 3-5x period of time 
> compared with native spark, and the cause 
> ishttps://issues.apache.org/jira/browse/BEAM-12646 
> 
>  according to the spark metrics. Spark runner is pretty much the bottleneck.
>  
> 
>  
> From: Andrew Pilloud mailto:apill...@google.com>>
> Date: Thursday, July 29, 2021 at 2:11 PM
> To: "user@beam.apache.org " 
> mailto:user@beam.apache.org>>
> Cc: Tao Li mailto:t...@zillow.com>>, Kyle Weaver 
> mailto:kcwea...@google.com>>, Yuchu Cao 
> mailto:yuc...@trulia.com>>
> Subject: Re: Perf issue with Beam on spark (spark runner)
>  
> Actually, ParquetIO got pushdown in Beam SQL starting at v2.29.0.
>  
> Andrew
>  
> On Mon, Jul 26, 2021 at 10:05 AM Andrew Pilloud  > wrote:
>> Beam SQL doesn't currently have project pushdown for ParquetIO (we are 
>> working to expand this to more IOs). Using ParquetIO withProjection directly 
>> will produce better results.
>>  
>> On Mon, Jul 26, 2021 at 9:46 AM Robert Bradshaw > > wrote:
>>> Could you try using Beam SQL [1] and see if that gives more similar result 
>>> to your Spark SQL query? I would also be curious if the performance is 
>>> sufficient using withProjection to only read the auction, price, and bidder 
>>> columns. 
>>>  
>>> [1] https://beam.apache.org/documentation/dsls/sql/overview/ 
>>> 
>>> [2] 
>>> https://beam.apache.org/releases/javadoc/2.25.0/org/apache/beam/sdk/io/parquet/ParquetIO.Read.html#withProjection-org.apache.avro.Schema-org.apache.avro.Schema-
>>>  
>>> 

Re: [Question] Invitation to Join Beam Slack Channel

2021-08-05 Thread Asif Iqbal
I also want to be invited :)

On Thu, Aug 5, 2021 at 1:45 AM Koosha Hosseiny 
wrote:

> Hello there
> Can I also be invited?
>
> Many thanks.
>
>
>
>
>
>
>
>
>
> --
> *From:* Bergmeier, Andreas 
> *Sent:* Thursday, August 5, 2021 06:48
> *To:* user@beam.apache.org 
> *Subject:* Re: [Question] Invitation to Join Beam Slack Channel
>
> Would not mind an invite either 
> --
> *From:* Kyle Weaver 
> *Sent:* Wednesday, 4 August 2021 23:44
> *To:* user@beam.apache.org 
> *Subject:* Re: [Question] Invitation to Join Beam Slack Channel
>
> Hi Jo,
>
> I invited you to join. It looks like Slack invite links expire after a
> couple days, so the one you were using may be out of date.
>
> On Wed, Aug 4, 2021 at 2:36 PM Jo Alex  wrote:
>
> Hi, currently I'm joining the Beam Summit 2021 and would like to learn
> more by joining the Beam Slack Channel. But it seems I can't join with my
> personal email and I don't have the @apache.org
> 
> email either. I would like to ask if I can get invited to the Slack
> channel, or maybe there is another way to join? Thanks! Regards
>
>

-- 
Asif Iqbal

Senior Software Engineer
*BenchSci*
*www.benchsci.com *
*E: *aiq...@benchsci.com
BenchSci helps scientists and organizations improve the speed and quality
of their research, learn how 


Spark Structured Streaming runner migrated to Spark 3

2021-08-05 Thread Etienne Chauchot

Hi all,

Just to let you know that Spark Structured Streaming runner was migrated 
to Spark 3.


Enjoy !

Etienne