Re: Beam Website Feedback

2022-02-27 Thread Ismaël Mejía
Hello Abe,

Can you check if you are subscribed to the user mailing list, it seems
you are to dev@ maybe the issue is the missing user@ suscription.
You can do this by sending an email to:
user-subscr...@beam.apache.org

Ahmet and the others, it might be a good idea to mention the user@
mailing list in the contribute doc too, I just quickly checked and did
not see it there.
https://beam.apache.org/contribute/

Regards,
Ismaël

On Mon, Feb 28, 2022 at 7:05 AM Ahmet Altay  wrote:
>
> Forwarding to user@ list. And adding relevant folks who could reply @Pablo 
> Estrada @Robert Burke
>
> (Folks, if you respond please reply to the list and Abe together. Abe is 
> having problems with replying back to the user lists.
>
> -- Forwarded message -
> From: 阿部智大 
> Date: Sun, Feb 27, 2022 at 10:00 PM
> Subject: Re: Beam Website Feedback
> To: Ahmet Altay 
>
>
> Thank you for your reply.
>
> The content I tried to send is as follows.
>
>
> ーーー
> In the following steps, we are creating a process to write the data on MySQL 
> to the BQ table.
>
> 1. Get data from MySQL with "SELECT * FROM table" with Query function of 
> databaseio package
> 2. Write to the BQ table specified by the Write function of the bigquery 
> package
>
> However, there is a difference between the number of data that can be 
> actually acquired by MySQL and the number that is inserted into the BQ table.
> Is this a bug?
>
> I have tried both Apache Beam versions 2.30.0 and 2.36.0 with the same 
> results.
>
> 2022年2月28日(月) 14:58 Ahmet Altay :
>>
>> Hi,
>>
>> What is your question?
>>
>> Ahmet
>>
>> On Sun, Feb 27, 2022 at 9:56 PM 阿部智大  wrote:
>>>
>>> Hi Beam Users,
>>>
>>>
>>> I'm a Japanese developer.
>>>
>>> I have a question about using the Go SDK, so I want to send an email to 
>>> user@beam.apache.org, but I can't because MAILER-DAEMON is returned.
>>> Could you tell me how to send it?
>>>
>>> Thanks & Regards


Re: Provider com.fasterxml.jackson.module.jaxb.JaxbAnnotationModule not a subtype

2021-06-05 Thread Ismaël Mejía
Hello, seems to be a known issue:
https://issues.apache.org/jira/browse/BEAM-10430

I don't know however if someone has already find a proper fix or workaround.


On Fri, Jun 4, 2021 at 8:22 PM Trevor Kramer  wrote:
>
> Relating to my earlier message I sometimes get this error instead.
>
> java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: 
> Provider com.fasterxml.jackson.module.jaxb.JaxbAnnotationModule not a subtype
>
> Searching google I have found some indications that this might be a version 
> conflict with Jackson. I don't see how to resolve this with Python code. Does 
> anyone have Beam running on Flink on EMR?
>
> Thanks,
>
> Trevor
>


Re: No filesystem found for scheme hdfs

2021-05-31 Thread Ismaël Mejía
You probably need to include the beam-sdks-java-io-hadoop-file-system
module.


On Mon, May 31, 2021 at 11:41 AM Gershi, Noam  wrote:

> Hi
>
>
>
> I am using Spark-runner, and when I am using Apache Beam TextIO to read a
> file from HDFS:
>
>
>
> .apply(TextIO.read().from(“hdfs://path-to-file”)
>
>
>
> I get:
>
>
>
> Caused by: java.lang.IllegalArgumentException: No filesystem found for
> scheme hdfs
>
>  at
> org.apache.beam.sdk.io.FileSystems.getFileSystemInternal(FileSystems.java:463)
>
>  at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:120)
>
>  at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:141)
>
>  at org.apache.beam.sdk.io.FileSystems.match(FileSystems.java:153)
>
>  at
> org.apache.beam.sdk.io.FileBasedSource.split(FileBasedSource.java:262)
>
>  at
> org.apache.beam.runners.spark.io.SourceRDD$Bounded.getPartitions(SourceRDD.java:115)
>
>
>
>
>
> What is missing to read files from hdfs ?
>
>
>
>
>
> Noam Gershi
>
> ICG Technology – TLV Lab
>
>
>
>[image:
> http://www.citigroup.com/emeaemailresources/gra30973_EmailSignature.jpg]
>
>
>


[DISCUSS] Drop support for Flink 1.10

2021-05-28 Thread Ismaël Mejía
Hello,

With Beam support for Flink 1.13 just merged it is the time to discuss the
end of
support for Flink 1.10 following the agreed policy on supporting only the
latest
three Flink releases [1].

I would like to propose that for Beam 2.31.0 we stop supporting Flink 1.10
[2].
I prepared a PR for this [3] but of course I wanted to bring the subject
here
(and to user@) for your attention and in case someone has a different
opinion or
reason to still support the older version.

WDYT?

Regards,
Ismael

[1]
https://lists.apache.org/thread.html/rfb5ac9d889d0e3f4400471de3c25000a15352bde879622c899d97581%40%3Cdev.beam.apache.org%3E
[2] https://issues.apache.org/jira/browse/BEAM-12281
[3] https://github.com/apache/beam/pull/14906


Re: Extremely Slow DirectRunner

2021-05-08 Thread Ismaël Mejía
Can you try running direct runner with the option
`--experiments=use_deprecated_read`

Seems like an instance of
https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
also reported in
https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E

We should rollback using the SDF wrapper by default because of the
usability and performance issues reported.


On Sat, May 8, 2021 at 12:57 AM Evan Galpin  wrote:

> Hi all,
>
> I’m experiencing very slow performance and startup delay when testing a
> pipeline locally. I’m reading data from a Google PubSub subscription as the
> data source, and before each pipeline execution I ensure that data is
> present in the subscription (readable from GCP console).
>
> I’m seeing startup delay on the order of minutes with DirectRunner (5-10
> min). Is that expected? I did find a Jira ticket[1] that at first seemed
> related, but I think it has more to do with BQ than DirectRunner.
>
> I’ve run the pipeline with a debugger connected and confirmed that it’s
> minutes before the first DoFn in my pipeline receives any data. Is there a
> way I can profile the direct runner to see what it’s churning on?
>
> Thanks,
> Evan
>
> [1]
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548
>


Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Ismaël Mejía
+dev

Since we all agree that we should return something different than
PDone the real question is what should we return.
As a reminder we had a pretty interesting discussion about this
already in the past but uniformization of our return values has not
happened.
This thread is worth reading for Vincent or anyone who wants to
contribute Write transforms that return.
https://lists.apache.org/thread.html/d1a4556a1e13a661cce19021926a5d0997fbbfde016d36989cf75a07%40%3Cdev.beam.apache.org%3E

> Returning PDone is an anti-pattern that should be avoided, but changing it 
> now would be backwards incompatible.

Periodic reminder most IOs are still Experimental so I suppose it is
worth to the maintainers to judge if the upgrade to return someething
different of PDone is worth, in that case we can deprecate and remove
the previous signature in short time (2 releases was the average for
previous cases).


On Wed, Mar 24, 2021 at 10:24 PM Alexey Romanenko
 wrote:
>
> I thought that was said about returning a PCollection of write results as 
> it’s done in other IOs (as I mentioned as examples) that have _additional_ 
> write methods, like “withWriteResults()” etc, that return PTransform<…, 
> PCollection>.
> In this case, we keep backwards compatibility and just add new funtionality. 
> Though, we need to follow the same pattern for user API and maybe even naming 
> for this feature across different IOs (like we have for "readAll()” methods).
>
>  I agree that we have to avoid returning PDone for such cases.
>
> On 24 Mar 2021, at 20:05, Robert Bradshaw  wrote:
>
> Returning PDone is an anti-pattern that should be avoided, but changing it 
> now would be backwards incompatible. PRs to add non-PDone returning variants 
> (probably as another option to the builders) that compose well with Wait, 
> etc. would be welcome.
>
> On Wed, Mar 24, 2021 at 11:14 AM Alexey Romanenko  
> wrote:
>>
>> In this way, I think “Wait” PTransform should work for you but, as it was 
>> mentioned before, it doesn’t work with PDone, only with PCollection as a 
>> signal.
>>
>> Since you already adjusted your own writer for that, it would be great to 
>> contribute it back to Beam in the way as it was done for other IOs (for 
>> example, JdbcIO [1] or BigtableIO [2])
>>
>> In general, I think we need to have it for all IOs, at least to use with 
>> “Wait” because this pattern it's quite often required.
>>
>> [1] 
>> https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L1078
>> [2] 
>> https://github.com/apache/beam/blob/ab1dfa13a983d41669e70e83b11f58a83015004c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java#L715
>>
>> On 24 Mar 2021, at 18:01, Vincent Marquez  wrote:
>>
>> No, it only needs to ensure that one record seen on Pubsub has successfully 
>> written to a database.  So "record by record" is fine, or even "bundle".
>>
>> ~Vincent
>>
>>
>> On Wed, Mar 24, 2021 at 9:49 AM Alexey Romanenko  
>> wrote:
>>>
>>> Do you want to wait for ALL records are written for Cassandra and then 
>>> write all successfully written records to PubSub or it should be performed 
>>> "record by record"?
>>>
>>> On 24 Mar 2021, at 04:58, Vincent Marquez  wrote:
>>>
>>> I have a common use case where my pipeline looks like this:
>>> CassandraIO.readAll -> Aggregate -> CassandraIO.write -> PubSubIO.write
>>>
>>> I do NOT want my pipeline to look like the following:
>>>
>>> CassandraIO.readAll -> Aggregate -> CassandraIO.write
>>>  |
>>>   -> PubsubIO.write
>>>
>>> Because I need to ensure that only items written to Pubsub have 
>>> successfully finished a (quorum) write.
>>>
>>> Since CassandraIO.write is a PTransform I can't actually use it 
>>> here so I often roll my own 'writer', but maybe there is a recommended way 
>>> of doing this?
>>>
>>> Thanks in advance for any help.
>>>
>>> ~Vincent
>>>
>>>
>>
>


Re: [DISCUSS] Drop support for Flink 1.8 and 1.9

2021-03-12 Thread Ismaël Mejía
> Do we now support 1.8 through 1.12?

Yes and that's clearly too much given that the Flink community only
support the two latest release.
It also hits us because we run tests for all those versions on precommit.

On Fri, Mar 12, 2021 at 7:27 PM Robert Bradshaw  wrote:
>
> Do we now support 1.8 through 1.12?
>
> Unless there are specific objections, makes sense to me.
>
> On Fri, Mar 12, 2021 at 8:29 AM Alexey Romanenko  
> wrote:
>>
>> +1 too but are there any potential objections for this?
>>
>> On 12 Mar 2021, at 11:21, David Morávek  wrote:
>>
>> +1
>>
>> D.
>>
>> On Thu, Mar 11, 2021 at 8:33 PM Ismaël Mejía  wrote:
>>>
>>> +user
>>>
>>> > Should we add a warning or something to 2.29.0?
>>>
>>> Sounds like a good idea.
>>>
>>>
>>>
>>>
>>> On Thu, Mar 11, 2021 at 7:24 PM Kenneth Knowles  wrote:
>>> >
>>> > Should we add a warning or something to 2.29.0?
>>> >
>>> > On Thu, Mar 11, 2021 at 10:19 AM Ismaël Mejía  wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >> We have been supporting older versions of Flink that we had agreed in 
>>> >> previous
>>> >> discussions where we said we will be supporting only the latest three 
>>> >> releases
>>> >> [1].
>>> >>
>>> >> I would like to propose that for Beam 2.30.0 we stop supporting Flink 
>>> >> 1.8 and
>>> >> 1.9 [2].  I prepared a PR for this [3] but of course I wanted to bring 
>>> >> the
>>> >> subject here (and to user@) for your attention and in case someone has a
>>> >> different opinion or reason to still support the older versions.
>>> >>
>>> >> WDYT?
>>> >>
>>> >> Regards,
>>> >> Ismael
>>> >>
>>> >> [1] 
>>> >> https://lists.apache.org/thread.html/rfb5ac9d889d0e3f4400471de3c25000a15352bde879622c899d97581%40%3Cdev.beam.apache.org%3E
>>> >> [2] https://issues.apache.org/jira/browse/BEAM-11948
>>> >> [3] https://github.com/apache/beam/pull/14203
>>
>>


Re: [DISCUSS] Drop support for Flink 1.8 and 1.9

2021-03-11 Thread Ismaël Mejía
+user

> Should we add a warning or something to 2.29.0?

Sounds like a good idea.




On Thu, Mar 11, 2021 at 7:24 PM Kenneth Knowles  wrote:
>
> Should we add a warning or something to 2.29.0?
>
> On Thu, Mar 11, 2021 at 10:19 AM Ismaël Mejía  wrote:
>>
>> Hello,
>>
>> We have been supporting older versions of Flink that we had agreed in 
>> previous
>> discussions where we said we will be supporting only the latest three 
>> releases
>> [1].
>>
>> I would like to propose that for Beam 2.30.0 we stop supporting Flink 1.8 and
>> 1.9 [2].  I prepared a PR for this [3] but of course I wanted to bring the
>> subject here (and to user@) for your attention and in case someone has a
>> different opinion or reason to still support the older versions.
>>
>> WDYT?
>>
>> Regards,
>> Ismael
>>
>> [1] 
>> https://lists.apache.org/thread.html/rfb5ac9d889d0e3f4400471de3c25000a15352bde879622c899d97581%40%3Cdev.beam.apache.org%3E
>> [2] https://issues.apache.org/jira/browse/BEAM-11948
>> [3] https://github.com/apache/beam/pull/14203


Re: Apache Beam's UX Research Findings Readout

2021-02-12 Thread Ismaël Mejía
Is there a recorded version of this presentation?
For the ones that missed it.

On Thu, Feb 11, 2021 at 6:06 PM Carlos Camacho
 wrote:
>
> Hi everyone,
> This is a friendly reminder to join the UX Research Findings Readout.
>
> We are live now! Join us: ⁨https://meet.google.com/xfc-majk-byk⁩
>
> --
>
> Carlos Camacho | WIZELINE
>
> UX Designer
>
> carlos.cama...@wizeline.com
>
> Amado Nervo 2200, Esfera P6, Col. Jardines del Sol, 45050 Zapopan, Jal.
>
> Follow us @WizelineGlobal | Facebook | LinkedIn
>
> Share feedback on Clutch
>
>
> This email and its contents (including any attachments) are being sent to
> you on the condition of confidentiality and may be protected by legal
> privilege. Access to this email by anyone other than the intended recipient
> is unauthorized. If you are not the intended recipient, please immediately
> notify the sender by replying to this message and delete the material
> immediately from your system. Any further use, dissemination, distribution
> or reproduction of this email is strictly prohibited. Further, no
> representation is made with respect to any content contained in this email.


Re: Quick question regarding ParquetIO

2021-01-18 Thread Ismaël Mejía
Catching up on this thread sorry if late to the party :) and my excuses because
this is going to be lng but worth.

> It does look like BEAM-11460 could work for you. Note that relies on a dynamic
> object which won't work with schema-aware transforms and SqlTransform. It's
> likely this isn't a problem for you, I just wanted to point it out.

We may be missing in this discussion the existence of the
`withBeamSchemas(true)` method on the IOs that produce Avro objects. This method
sets up a Schema-based coder for the output of the PCollection generated by the
read. This allows both SQL and Schema-based transforms just afterwards by
auto-infering the Beam Row schema and auto-transforming everything into Rows
when needed.

PCollection input =
p.apply(
  ParquetIO.read(SCHEMA)
  .from(path)
  .withBeamSchemas(true));

Now input can be used by SQL/Schema-based PTransforms.

> @Kobe Feng thank you so much for the insights. Agree that it may be a good
> practice to read all sorts of file formats (e.g. parquet, avro etc) into a
> PCollection and then perform the schema aware transforms that you are
> referring to.

This is not the case at the moment because most IOs precede the schema-based
APIs, but more and more PTransforms are supporting it. Notice that for dynamic
objects or Schema-aware PCollection you don't even need them to produce
PCollection. You can take a PCollection (like above) and
connect directly to schema-aware transformations as if it was a PCollection
the transformation is done automatically for the user because of the
Schema-based coder.

You can do this manually if you have a non-schema PCollection of GenericRecords
by setting explicitly a Schema-based coder for the PCollection:

mycollection.setCoder(AvroUtils.schemaCoder(schema));

Beam also includes the schema-based `Convert` transform to convert different
types from/to Rows so this could be handy for cases when you need to transform
in both directions and it is not supported. Beam 2.28.0 introduces an
improvement that allows to Convert from any Schema-based PCollection (Rows or
others) into GenericRecords. This is really useful because Avro/Parquet based
writes expect a PCollection not one of rows, and now you can just
transform a schema-based PCollection (e.g. PCollection or of other objects)
into a PCollection like this:


myrowcollection.apply(Convert.to(GenericRecord.class)).apply(AnAvroBasedSinkIO.write(...))

https://issues.apache.org/jira/browse/BEAM-11571

So now the full scenario is covered for reads via .withBeamSchemas(true) or by
setting manually an AvroCoder for schemas and for writes by preceding the Sinks
with `Convert.to`. That's the beauty of Beam's bidirectional Schema coders.

Note that this probably can be better documented in the programming guide or in
the javadocs so contributions welcome!

And now back to the initial question:

> Quick question about ParquetIO. Is there a way to avoid specifying the avro
> schema when reading parquet files?

No, you cannot at the moment. BEAM-11460 allows you to parametrize the
transformation from a GenericRecord (with a schema you expect in advance even if
you don't specify it) into your own type of objects.

In Parquet/Avro the schema you use to write can differ from the schema you use
to read, this is done to support schema evolution, so the most general use case
is to allow users to read from specific versions of the Schema provided into
their objects. That's probably one of the reasons why this is not supported.

Since the Schema is part of the Parquet file metadata I suppose we could somehow
use it and produce the Schema for the output collection, notice however that if
the schema differs on the files this will break in runtime.

Filled https://issues.apache.org/jira/browse/BEAM-11650 to track this.

On Wed, Jan 13, 2021 at 7:42 PM Tao Li  wrote:
>
> @Kobe Feng thank you so much for the insights. Agree that it may be a good 
> practice to read all sorts of file formats (e.g. parquet, avro etc) into a 
> PCollection and then perform the schema aware transforms that you are 
> referring to.
>
>
>
> The new dataframe APIs for Python SDK sound pretty cool and I can imagine it 
> will save a lot of hassles during a beam app development. Hopefully it will 
> be added to Java SDK as well.
>
>
>
> From: Kobe Feng 
> Reply-To: "user@beam.apache.org" 
> Date: Friday, January 8, 2021 at 11:39 AM
> To: "user@beam.apache.org" 
> Subject: Re: Quick question regarding ParquetIO
>
>
>
> Tao,
> I'm not an expert, and good intuition, all you want is schema awareness 
> transformations or let's say schema based transformation in Beam not only for 
> IO but also for other DoFn, etc, and possibly have schema revolution in 
> future as well.
>
>
> This is how I try to understand and explain in other places before:  Not like 
> spark, flink to leverage internal/built-in types (e.g, catalyst struct type)  
> for built-in operators as more as possible to infer the 

Re: [VOTE] Release 2.27.0, release candidate #1

2020-12-28 Thread Ismaël Mejía
It seems we are not publishing the latest versions of the Flink Job
Server (Flink 1.11 and 1.12) docker images

These do not exist:

https://hub.docker.com/r/apache/beam_flink1.11_job_server
https://hub.docker.com/r/apache/beam_flink1.12_job_server

but this does and has the good 2.27.0_rc1 tag:
https://hub.docker.com/r/apache/beam_flink1.10_job_server

I wonder if the issue might be related to the fact that we need to
request the repo to be created or if it is something different.

On Thu, Dec 24, 2020 at 5:33 PM Brian Hulette  wrote:
>
> +Boyuan Zhang helped me get to the bottom of the sql_taxi issue. The problem 
> is with the WriteStringsToPubSub API, which is deprecated since 2.7.0, but 
> used in the example. Boyuan has [1] out to fix WriteStringsToPubSub and I 
> just sent [2] to replace WriteStringsToPubSub with WriteToPubSub in example 
> code. Issue is tracked in [3].
>
> [1] https://github.com/apache/beam/pull/13614
> [2] https://github.com/apache/beam/pull/13615
> [3] https://issues.apache.org/jira/browse/BEAM-11524
>
> On Thu, Dec 24, 2020 at 8:26 AM Pablo Estrada  wrote:
>>
>> Alright! Thanks everyone for your validations. I'm cancelling this RC, and 
>> I'll perform cherry picks to prepare the next one.
>>
>> Please update this thread with any other cherry pick requests!
>> -P.
>>
>> On Thu, Dec 24, 2020, 3:17 AM Ismaël Mejía  wrote:
>>>
>>> It might be a good idea to include also:
>>>
>>> [BEAM-11403] Cache UnboundedReader per UnboundedSourceRestriction in
>>> SDF Wrapper DoFn
>>> https://github.com/apache/beam/pull/13592
>>>
>>> So Java development experience is less affected (as with 2.26.0) (There
>>> is a flag to exclude but defaults matter).
>>>
>>> On Thu, Dec 24, 2020 at 2:56 AM Valentyn Tymofieiev  
>>> wrote:
>>> >
>>> > We discovered a regression on CombineFn.from_callable() started in 
>>> > 2.26.0. Even though it's not a regression in 2.27.0, I strongly prefer we 
>>> > fix it in 2.27.0 as it leads to buggy behavior, so I vote -1.
>>> >
>>> > The fix to release branch is in flight: 
>>> > https://github.com/apache/beam/pull/13613.
>>> >
>>> >
>>> >
>>> > On Wed, Dec 23, 2020 at 3:38 PM Brian Hulette  wrote:
>>> >>
>>> >> -1 (non-binding)
>>> >> Good news: I validated a dataframe pipeline on Dataflow which looked 
>>> >> good (with expected performance improvements!)
>>> >> Bad news: I also tried to run the sql_taxi example pipeline (streaming 
>>> >> SQL in python) on Dataflow and ran into PubSub IO related issues. The 
>>> >> example fails in the same way with 2.26.0, but it works in 2.25.0. It's 
>>> >> possible this is a Dataflow bug and not a Beam one, but I'd like to 
>>> >> investigate further to make sure.
>>> >>
>>> >> On Wed, Dec 23, 2020 at 12:25 PM Kyle Weaver  wrote:
>>> >>>
>>> >>> +1 (non-binding) Validated wordcount with Python source + Flink and 
>>> >>> Spark job server jars. Also checked that the ...:sql:udf jar was added 
>>> >>> and includes our cherry-picks. Thanks Pablo :)
>>> >>>
>>> >>> On Wed, Dec 23, 2020 at 12:02 PM Ahmet Altay  wrote:
>>> >>>>
>>> >>>> +1 (binding).
>>> >>>>
>>> >>>> I validated python quickstarts. Thank you Pablo.
>>> >>>>
>>> >>>> On Tue, Dec 22, 2020 at 10:04 PM Jean-Baptiste Onofre 
>>> >>>>  wrote:
>>> >>>>>
>>> >>>>> +1 (binding)
>>> >>>>>
>>> >>>>> Regards
>>> >>>>> JB
>>> >>>>>
>>> >>>>> Le 23 déc. 2020 à 06:46, Pablo Estrada  a écrit :
>>> >>>>>
>>> >>>>> Hi everyone,
>>> >>>>> Please review and vote on the release candidate #1 for the version 
>>> >>>>> 2.27.0, as follows:
>>> >>>>> [ ] +1, Approve the release
>>> >>>>> [ ] -1, Do not approve the release (please provide specific comments)
>>> >>>>>
>>> >>>>>
>>> >>>>> Reviewers are encouraged to test their own use cases with the release 
>>> >>>>> candidate, and vote +1
>>> >>>>>  if no issues are found.
>>>

Re: beam and compatible Flink Runner versions

2020-12-28 Thread Ismaël Mejía
It seems the website has not been updated with the latest changes. I
just opened https://github.com/apache/beam/pull/13620/files for this

Just to clarify the newer versions of Flink are supported in these versions:

2.21.0 - 2.24.0 Support for Flink 1.10.0
2.25.0 - 2.26.0 Support for Flink 1.11.0
2.27.0 Support for Flink 1.12.0

Ismaël

On Sun, Dec 27, 2020 at 2:12 PM Günter Hipler  wrote:
>
> Hi,
>
> the compatibility matrix on
> https://beam.apache.org/documentation/runners/flink/
>
> says, beam versions >= 2.21.0 are only valid for Flink versions <= 1.10
> and I get an error using a version higher than 1.10
> mas_zb_demo_marc_author_count.py: error: argument --flink_version:
> invalid choice: '1.11' (choose from '1.7', '1.8', '1.9', '1.10')
>
> On the other hand issues like this
> https://issues.apache.org/jira/browse/BEAM-11191
>
> are referencing Flink version 1.11 and the upcoming 2.27 release is
> referencing even Flink's version 1.12
> https://issues.apache.org/jira/browse/BEAM-11419
>
> Thanks for hints for some clarification
>
> Günter
>
>


Re: [VOTE] Release 2.27.0, release candidate #1

2020-12-24 Thread Ismaël Mejía
It might be a good idea to include also:

[BEAM-11403] Cache UnboundedReader per UnboundedSourceRestriction in
SDF Wrapper DoFn
https://github.com/apache/beam/pull/13592

So Java development experience is less affected (as with 2.26.0) (There
is a flag to exclude but defaults matter).

On Thu, Dec 24, 2020 at 2:56 AM Valentyn Tymofieiev  wrote:
>
> We discovered a regression on CombineFn.from_callable() started in 2.26.0. 
> Even though it's not a regression in 2.27.0, I strongly prefer we fix it in 
> 2.27.0 as it leads to buggy behavior, so I vote -1.
>
> The fix to release branch is in flight: 
> https://github.com/apache/beam/pull/13613.
>
>
>
> On Wed, Dec 23, 2020 at 3:38 PM Brian Hulette  wrote:
>>
>> -1 (non-binding)
>> Good news: I validated a dataframe pipeline on Dataflow which looked good 
>> (with expected performance improvements!)
>> Bad news: I also tried to run the sql_taxi example pipeline (streaming SQL 
>> in python) on Dataflow and ran into PubSub IO related issues. The example 
>> fails in the same way with 2.26.0, but it works in 2.25.0. It's possible 
>> this is a Dataflow bug and not a Beam one, but I'd like to investigate 
>> further to make sure.
>>
>> On Wed, Dec 23, 2020 at 12:25 PM Kyle Weaver  wrote:
>>>
>>> +1 (non-binding) Validated wordcount with Python source + Flink and Spark 
>>> job server jars. Also checked that the ...:sql:udf jar was added and 
>>> includes our cherry-picks. Thanks Pablo :)
>>>
>>> On Wed, Dec 23, 2020 at 12:02 PM Ahmet Altay  wrote:

 +1 (binding).

 I validated python quickstarts. Thank you Pablo.

 On Tue, Dec 22, 2020 at 10:04 PM Jean-Baptiste Onofre  
 wrote:
>
> +1 (binding)
>
> Regards
> JB
>
> Le 23 déc. 2020 à 06:46, Pablo Estrada  a écrit :
>
> Hi everyone,
> Please review and vote on the release candidate #1 for the version 
> 2.27.0, as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> Reviewers are encouraged to test their own use cases with the release 
> candidate, and vote +1
>  if no issues are found.
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org 
> [2], which is signed with the key with fingerprint 
> C79DDD47DAF3808F0B9DDFAC02B2D9F742008494 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.27.0-RC1" [5],
> * website pull request listing the release [6], publishing the API 
> reference manual [7], and the blog post [8].
> * Python artifacts are deployed along with the source release to the 
> dist.apache.org [2].
> * Validation sheet with a tab for 2.27.0 release to help with validation 
> [9].
> * Docker images published to Docker Hub [10].
>
> The vote will be open for at least 72 hours, but given the holidays, we 
> will likely extend for a few more days. The release will be adopted by 
> majority approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> -P.
>
> [1] 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12349380
> [2] https://dist.apache.org/repos/dist/dev/beam/2.27.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1145/
> [5] https://github.com/apache/beam/tree/v2.27.0-RC1
> [6] https://github.com/apache/beam/pull/13602
> [7] https://github.com/apache/beam-site/pull/610
> [8] https://github.com/apache/beam/pull/13603
> [9] 
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=194829106
> [10] https://hub.docker.com/search?q=apache%2Fbeam=image
>
>


Re: Combine with multiple outputs case Sample and the rest

2020-12-23 Thread Ismaël Mejía
Thanks for the answer Robert. Producing a combiner with two lists as
outputs was one idea I was considering too but I was afraid of
OutOfMemory issues. I had not thought much about the consequences on
combining state, thanks for pointing that. For the particular sampling
use case it might be not an issue, or am I missing something?

I am still curious if for Sampling there could be another approach to
achieve the same goal of producing the same result (uniform sample +
the rest) but without the issues of combining.

On Mon, Dec 21, 2020 at 7:23 PM Robert Bradshaw  wrote:
>
> There are two ways to emit multiple outputs: either to multiple distinct 
> PCollections (e.g. withOutputTags) or multiple (including 0) outputs to a 
> single PCollection (the difference between Map and FlatMap). In full 
> generality, one can always have a CombineFn that outputs lists (say  result>*) followed by a DoFn that emits to multiple places based on this 
> result.
>
> One other cons of emitting multiple values from a CombineFn is that they are 
> used in other contexts as well, e.g. combining state, and trying to make 
> sense of a multi-outputting CombineFn in that context is trickier.
>
> Note that for Sample in particular, it works as a CombineFn because we throw 
> most of the data away. If we kept most of the data, it likely wouldn't fit 
> into one machine to do the final sampling. The idea of using a side input to 
> filter after the fact should work well (unless there's duplicate elements, in 
> which case you'd have to uniquify them somehow to filter out only the "right" 
> copies).
>
> - Robert
>
>
>
> On Fri, Dec 18, 2020 at 8:20 AM Ismaël Mejía  wrote:
>>
>> I had a question today from one of our users about Beam’s Sample
>> transform (a Combine with an internal top-like function to produce a
>> uniform sample of size n of a PCollection). They wanted to obtain also
>> the rest of the PCollection as an output (the non sampled elements).
>>
>> My suggestion was to use the sample (since it was little) as a side
>> input and then reprocess the collection to filter its elements,
>> however I wonder if this is the ‘best’ solution.
>>
>> I was thinking also if Combine is essentially GbK + ParDo why we don’t
>> have a Combine function with multiple outputs (maybe an evolution of
>> CombineWithContext). I know this sounds weird and I have probably not
>> thought much about issues or the performance of the translation but I
>> wanted to see what others thought, does this make sense, do you see
>> some pros/cons or other ideas.
>>
>> Thanks,
>> Ismaël


Combine with multiple outputs case Sample and the rest

2020-12-18 Thread Ismaël Mejía
I had a question today from one of our users about Beam’s Sample
transform (a Combine with an internal top-like function to produce a
uniform sample of size n of a PCollection). They wanted to obtain also
the rest of the PCollection as an output (the non sampled elements).

My suggestion was to use the sample (since it was little) as a side
input and then reprocess the collection to filter its elements,
however I wonder if this is the ‘best’ solution.

I was thinking also if Combine is essentially GbK + ParDo why we don’t
have a Combine function with multiple outputs (maybe an evolution of
CombineWithContext). I know this sounds weird and I have probably not
thought much about issues or the performance of the translation but I
wanted to see what others thought, does this make sense, do you see
some pros/cons or other ideas.

Thanks,
Ismaël


Re: Support streaming side-inputs in the Spark runner

2020-10-05 Thread Ismaël Mejía
The limitation of non being able to have side inputs in streaming has been
pending since a long time ago, and there is sadly not anyone to my
knowledge working on this.

One extra aspect to have in mind is that the support for streaming in the
Spark runner uses the Spark DStream API which does not have explicit
watermark handling so the runner has to implement this, this is where the
issues of lacking holds exist, this hopefully could be fixed in the future.

One important detail is that the native Spark Structured Streaming API has
the concept of Watermark but it is more limited than the definition of
Watermark on Beam so even the Spark runner that uses the new API would
require to implement Watermark handling too.

Notice that if you go with native Spark Streaming instead of Beam please
pay attention to the fact that the support for streaming is still limited
for example you cannot do two aggregations in sequence (this is a
limitation that affects the ongoing work on the Structured Streaming
runner) https://issues.apache.org/jira/browse/SPARK-26655 also less
important but still good to know you cannot do continuous processing on
Spark save for  map-like operations, so you should default to the
micro-batch based approach if you any aggregation.
https://issues.apache.org/jira/browse/SPARK-20928



On Fri, Oct 2, 2020 at 9:05 PM Luke Cwik  wrote:

> Support for watermark holds is missing for both Spark streaming
> implementations (DStream and structured streaming) so watermark based
> triggers don't produce the correct output.
>
> Excluding the direct runner, Flink is the OSS runner with the most people
> working on it adding features and fixing bugs in it.
> Spark batch is in a good state but streaming development is still ongoing
> and also has a small group of folks.
>
>
> On Fri, Oct 2, 2020 at 10:16 AM  wrote:
>
>> For clarification, is it just streaming side inputs that present an issue
>> for SparkRunner or are there other areas that need work?  We've started
>> work on a Beam-based project that includes both streaming and batch
>> oriented work and a Spark cluster was our choice due to the perception that
>> it could handle both types of applications.
>>
>> However, that would have to be reevaluated if SparkRunner isn't up for
>> streaming deployments.  And it seems that SparkStructuredStreamingRunner
>> still needs some time before it's a fully-featured solution.  I guess I'm
>> trying to get a sense of whether these runners are still being actively
>> developed or were they donated by a third-party and are now suffering from
>> bit-rot.
>>
>> Oct 1, 2020, 10:54 by lc...@google.com:
>>
>> I would suggest trying FlinkRunner as it is a much more complete
>> streaming implementation.
>> SparkRunner has several key things that are missing that won't allow your
>> pipeline to function correctly.
>> If you're really invested in getting SparkRunner working though feel free
>> to contribute the necessary implementations for watermark holds and
>> broadcast state necessary for side inputs.
>>
>> On Tue, Sep 29, 2020 at 9:06 AM Rajagopal, Viswanathan <
>> rajagop...@dnb.com> wrote:
>>
>> Hi Team,
>>
>>
>>
>> I have a streaming pipeline (built using Apache Beam with Spark
>> Runner)which consumes events tagged with timestamps from Unbounded source
>> (Kinesis Stream) and batch them into FixedWindows of 5 mins each and then,
>> write all events in a window into a single / multiple files based on shards.
>>
>> We are trying to achieve the following through Apache Beam constructs
>>
>> *1.   **Create a PCollectionView from unbounded source and pass it
>> as a side-input to our main pipeline.*
>>
>> *2.   **Have a hook method that invokes per window that enables us
>> to do some operational activities per window.*
>>
>> *3.   **Stop the stream processor (graceful stop) from external
>> system.*
>>
>>
>>
>> *Approaches that we tried for 1).*
>>
>> ·Creating a PCollectionView from unbounded source and pass it as
>> a side-input to our main pipeline.
>>
>> ·Input Pcollection goes through FixedWindow transform.
>>
>> ·Created custom CombineFn that takes combines all inputs for a
>> window and produce single value Pcollection.
>>
>> ·Output of Window transform it goes to CombineFn (custom fn) and
>> creates a PCollectionView from CombineFn (using
>> Combine.Globally().asSingletonView() as this output would be passed as a
>> side-input for our main pipeline.
>>
>> o   Getting the following exception (while running with streaming option
>> set to true)
>>
>> ·java.lang.IllegalStateException: No TransformEvaluator
>> registered for UNBOUNDED transform View.CreatePCollectionView
>>
>>- Noticed that SparkRunner doesn’t support the streaming side-inputs
>>   in the Spark runner
>>   -
>>  
>> https://www.javatips.net/api/beam-master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/streaming/StreamingTransformTranslator.java
>>  

Re: [DISCUSS] Deprecation of AWS SDK v2 IO connectors

2020-09-15 Thread Ismaël Mejía
The reason why most people are using AWSv1 IOs is probably because they are
in Beam since 2017 instead of just added in the last year which is the case of
the AWSv2 ones.

Alexey mentions that maintaining both versions is becoming painful and I would
like to expand on that because we have now duplicated work for new features, for
example someone contributing some small improvement does it in one of the two
versions and we try to encourage them to do it in both and general confusion and
lots of extra work is going into keeping them aligned. And for more complex IOs
like Kinesis this might prove harder in the future.

Notice that the migration path is incremental because users can have both Amazon
SDKs in the same classpath without conflicts. And Alexey's proposal is about
deprecating AWSv1 IOs to reduce the maintenance burden, not removing them from
the codebase. This could help to raise awareness about the AWSv2 IOs so users
migrate and diminish the extra overhead for contributors and maintainers.

One minor comment to the proposal is that if we proceed with this plan we should
deprecate a v1 IO ONLY when we have full feature parity in the v2 version.
I think we don't have a replacement for AWSv1 S3 IO so that one should not be
deprecated.

On Tue, Sep 15, 2020 at 6:07 PM Robert Bradshaw  wrote:
>
> The 10x-100x ratio looks like an answer right there about (non-)suitability 
> for deprecation. The new question would be *why* people are using the v1 
> APIs. Is it because it was the original, or that it's been around longer, or 
> it has more features?
>


Re: Running Nexmark for Flink Streaming

2020-04-28 Thread Ismaël Mejía
Max would it make sense to make the rocksdb runtime only at the runner
level just to hint its use. I assume that most Flink users might want
to have RocksDB as the default state backend?

runtimeOnly
"org.apache.flink:flink-statebackend-rocksdb_2.11:$flink_version"


On Tue, Apr 28, 2020 at 1:14 PM Sruthi Sree Kumar
 wrote:
>
> Hello,
>
> I have looked the Beam code. The statebackend that we pass should be an 
> instance of  FlinkStateBackendFactory . But there is no implementation for 
> the Interface.
> None of the FlinkStateBackend Implements this interface.
>
> So even when I try to pass the default MemoryStateBackend as an argument, it 
> throws an error.
>
> But I could manage to specify the statebackend (filesystem) using the config 
> file in the config directory specified by the env variable ENV_FLINK_CONF_DIR.
>
> Regards,
> Sruthi
>
> On Tue, Apr 28, 2020 at 11:35 AM Maximilian Michels  wrote:
>>
>> Hi Sruthi,
>>
>> Not possible out-of-the-box at the moment. You'll have to add the
>> RocksDB Flink dependency in flink_runner.gradle, e.g.:
>>
>>   compile "org.apache.flink:flink-statebackend-rocksdb_2.11:$flink_version"
>>
>> Also in the Flink config you have to set
>>
>>   state.backend: rocksdb
>>
>> Then you can run Nexmark against the cluster, e.g.
>>
>> ./gradlew :sdks:java:testing:nexmark:run \
>> -Pnexmark.runner=":runners:flink:1.10" \
>> -Pnexmark.args="
>> --runner=FlinkRunner
>> --flinkMaster=
>> --streaming=true
>> --shutdownSourcesOnFinalWatermark=true
>> --query=12
>> --suite=SMOKE
>> --manageResources=false
>> --monitorJobs=true
>> --enforceEncodability=true
>> --enforceImmutability=true"
>>
>>
>> Admittedly, this is a bit complicated. We could make that easier without
>> much work.
>>
>> Cheers,
>> Max
>>
>> On 28.04.20 10:26, Sruthi Sree Kumar wrote:
>> > Hello,
>> >
>> > Is it possible to run the nexmark queries by specifying a
>> > state-backed(Ex: RocksDB) ?
>> >
>> >
>> > Regards,
>> > Sruthi


Re: Stateful & Timely Call

2020-04-24 Thread Ismaël Mejía
Sounds like a good addition to the Beam patterns page Reza :)

On Fri, Apr 24, 2020 at 3:22 AM Aniruddh Sharma  wrote:
>
> Thanks Robert,
>
> This is a life saver and its a great help :). It works like a charm.
>
> Thanks
> Aniruddh
>
> On Thu, Apr 23, 2020 at 4:45 PM Robert Bradshaw  wrote:
>>
>> I may have misinterpreted your email, I thought you didn't have a need for 
>> keys at all. If this is actually the case, you don't need a GroupByKey, just 
>> have your DoFn take Rows as input, and emit List as output. That is, 
>> it's a DoFn>.
>>
>> You can buffer multiple Rows in an instance variable between process element 
>> calls. For example,
>>
>> class MyBufferingDoFn> {
>>   List buffer = new ArrayList<>();
>>   @ProcessElement public void processElement(T elt, OutputReceiver> 
>> out) {
>> buffer.append(out);
>> if (buffer.size() > 100) {
>>   out.output(buffer);
>>   buffer = new ArrayList<>();
>> }
>>   }
>>   @FinishBundle public void finishBundle(OutputReceiver> out) {
>> out.output(buffer);
>> buffer = new ArrayList<>();
>>   }
>> }
>>
>> See 
>> https://beam.apache.org/releases/javadoc/2.20.0/org/apache/beam/sdk/transforms/ParDo.html
>>  for more information on the lifetime of DoFns.
>>
>> As for why your GBK is taking so long, yes, this can be a bottleneck. 
>> However, it should be noted that Dataflow (like most other runners) executes 
>> this step in conjunction with other steps as part of a "fused stage." So if 
>> your pipeline looks like
>>
>> Read -> DoFnA -> GBK -> DoFnB -> Write
>>
>> then Read, DoFnA, and GBK[part1] will execute concurrently (all starting up 
>> almost immediately), one element at at time, and when that's finished, 
>> GBK[part2, DoFnB, Write will execute concurrently, one element at a time, so 
>> you can't just look at the last unfinished stage to determine where the 
>> bottleneck is. (One helpful tool, however, is looking at the amount of time 
>> spent on each step in the UI.)
>>
>> Hopefully that helps.
>>
>> - Robert
>>
>>
>> On Thu, Apr 23, 2020 at 12:43 PM Aniruddh Sharma  
>> wrote:
>>>
>>> Thanks Robert and Luke
>>>
>>> This approach seems good to me. I am trying that , i have to include a 
>>> GroupBy to make Iterable available to do ParDo function to do same. 
>>> Now GroupBy is a bottleneck, its working for last 2 hours and proceed only 
>>> 40 GB data (still waiting for rest of 100's of GB of data).
>>>
>>> Currently I used GroupByKey.Create()
>>>
>>> What's recommended way to use what key to make it execute faster like same 
>>> key for all rows, vs different key for each row vs same row for a group of 
>>> keys.
>>>
>>> Thanks
>>> Aniruddh
>>>
>>> On Thu, Apr 23, 2020 at 12:47 PM Luke Cwik  wrote:

 As Robert suggested, what prevents you from doing:
 ReadFromBQ -> ParDo(BatchInMemory) -> DLP
 where BatchInMemory stores elements in the @ProcessElement method in an in 
 memory list and produce output every time the list is large enough with a 
 final output in the @FinishBundle method?

 On Thu, Apr 23, 2020 at 9:42 AM Aniruddh Sharma  
 wrote:
>
> Hi Luke
>
> Sorry forgot to mention the functions. Dataflow adds following function 
> and ["PartitionKeys", new GroupByKeyAndSortValuesOnly] this is super 
> slow, How to choose keys to make it faster ?
>
>  .apply("ReifyWindows", ParDo.of(new ReifyWindowedValueFn<>()))
>   .setCoder(
>   KvCoder.of(
>   keyCoder,
>   KvCoder.of(InstantCoder.of(), 
> WindowedValue.getFullCoder(kvCoder, windowCoder
>
>   // Group by key and sort by timestamp, dropping windows as they 
> are reified
>   .apply("PartitionKeys", new GroupByKeyAndSortValuesOnly<>())
>
>   // The GBKO sets the windowing strategy to the global default
>   .setWindowingStrategyInternal(inputWindowingStrategy);
>
> THanks
> ANiruddh
>
> On 2020/04/23 16:35:58, Aniruddh Sharma  wrote:
> > Thanks Luke for your response.
> >
> > My use case is following.
> > a) I read data from BQ (TableRow)
> > b) Convert it into (Table.Row) for DLP calls.
> > c) have to batch Table.Row collection up to a max size of 512 KB (i.e 
> > fit may rows from BQ into a single DLP table) and call DLP.
> >
> > Functionally, I don't have a need of key and window. As I just want to 
> > fit rows in DLP table up to a max size.
> >
> > In batch mode, when I call StateFulAPI,
> > it adds a "BatchStatefulParDoOverrides.GroupByKeyAndSortValuesOnly" 
> > step and this step is super slow. Like it is running on 50 node cluster 
> > for 800 GB data for last 10 hours.
> >
> > This step is not added when I call Dataflow in streaming mode. But I 
> > can't call it in Streaming mode for other reasons.
> >
> > So I am trying to understand following
> > a) 

Re: Running NexMark Tests

2020-04-21 Thread Ismaël Mejía
You need to instruct the Flink runner to shutdown the the source
otherwise it will stay waiting.
You can this by adding the extra
argument`--shutdownSourcesOnFinalWatermark=true`
And if that works and you want to open a PR to update our
documentation that would be greatly appreciated.

Regards,
Ismaël


On Tue, Apr 21, 2020 at 10:04 PM Sruthi Sree Kumar
 wrote:
>
> Hello,
>
> I am trying to run nexmark queries using flink runner streaming. Followed the 
> documentation and used the command
> ./gradlew :sdks:java:testing:nexmark:run \
>
> -Pnexmark.runner=":runners:flink:1.10" \
> -Pnexmark.args="
> --runner=FlinkRunner
> --suite=SMOKE
> --streamTimeout=60
> --streaming=true
> --manageResources=false
> --monitorJobs=true
> --flinkMaster=[local]"
>
>
> But after the events are read from the source, there is no further progress 
> and the job is always stuck at 99%. Is there any configuration that I am 
> missing?
>
> Regards,
> Sruthi


Re: Meetups

2020-03-25 Thread Ismaël Mejía
That sounds like a great idea. We can do (and hopefully record) a
virtual meetup. Have you checked a the 'logistics' for this?
I think Alex Van Boxel was looking for something similar so maybe he
can share his findings
https://twitter.com/alexvb/status/1239816659763953664?s=20

Any volunteers for sessions / subjects?

On Tue, Mar 24, 2020 at 1:24 AM Austin Bennett
 wrote:
>
> Seems we won't be convening in-person in about any city anytime soon.
>
> Seems like a chance to come together virtually.
>
> WHO CAN SHARE?
>
> Seeking:
> * Use Cases
> * Developing Beam/Components
> * Other
>
> If anything particular, also, what would you like to hear -- can see if we 
> can track such speakers down.


Re: Hello Beam Community!

2020-03-13 Thread Ismaël Mejía
Welcome !

On Fri, Mar 13, 2020 at 3:00 PM Connell O'Callaghan 
wrote:

> Welcome Brittany
>
> On Fri, Mar 13, 2020 at 6:45 AM Rustam Mehmandarov 
> wrote:
>
>> Welcome Brittany!
>>
>> Cheers,
>> Rustam
>> @rmehmandarov 
>>
>> On Fri, Mar 13, 2020 at 2:31 AM Brittany Hermann 
>> wrote:
>>
>>> Hello Beam Community!
>>>
>>> My name is Brittany Hermann and I recently joined the Open Source team
>>> in Data Analytics at Google. As a Program Manager, I will be focusing on
>>> community engagement while getting to work on Apache Beam and Airflow
>>> projects! I have always thrived on creating healthy, diverse, and overall
>>> happy communities and am excited to bring that to the team. For a fun fact,
>>> I am a big Wisconsin Badgers Football fan and have a goldendoodle puppy
>>> named Ollie!
>>>
>>> I look forward to collaborating with you all!
>>>
>>> Kind regards,
>>>
>>> Brittany Hermann
>>>
>>> --
>>>
>>> Brittany Hermann
>>>
>>> Open Source Program Manager (Provided by Adecco Staffing)
>>>
>>> 1190 Bordeaux Drive , Building 4, Sunnyvale, CA 94089
>>> 
>>>
>>>
>>>


Re: Problem with Classgraph in Beam 2.19

2020-03-05 Thread Ismaël Mejía
BEAM-9452 has been solved today the fix will be included in the next
release 2.20.0
Thanks again for reporting Péter

On Thu, Mar 5, 2020 at 5:25 PM Ismaël Mejía  wrote:

> Oh we were not aware of that issue, next time do not hesitate to let us
> know in advance.
>
> On Thu, Mar 5, 2020 at 5:20 PM Kjetil Halvorsen <
> kjetil.halvor...@cognite.com> wrote:
>
>> Thanks for the quick response--it is much appreciated!
>>
>> Also, great news on the one fix already done and that you guys are
>> looking into the other issue. These have so far been blockers for us to
>> move forward from 2.17.
>>
>> Best,
>> Kjetil
>>
>> On Thu, Mar 5, 2020 at 4:27 PM Ismaël Mejía  wrote:
>>
>>> Thanks Péter for bringing this info and creating the issue, thanks
>>> Kjetil too for confirming that you can also reproduce it.
>>> For the opencensus issue it is already fixed, for ref
>>> https://issues.apache.org/jira/browse/BEAM-9304
>>>
>>>
>>> On Thu, Mar 5, 2020 at 4:14 PM Kjetil Halvorsen <
>>> kjetil.halvor...@cognite.com> wrote:
>>>
>>>> I have observed the same thing.
>>>>
>>>> Also, quite a bit more dependency challenges started from 2.18. when
>>>> using the Dataflow runner and GCP IO extensions. In particular
>>>> io.opencensus.
>>>>
>>>> Cheers,
>>>> Kjetil
>>>>
>>>> On Thu, Mar 5, 2020 at 3:55 PM Péter Farkas 
>>>> wrote:
>>>>
>>>>> BEAM-9452
>>>>>
>>>>> On Thu, 5 Mar 2020 at 15:04, Ismaël Mejía  wrote:
>>>>>
>>>>>> Can you please create a JIRA for this issue.
>>>>>>
>>>>>> On Thu, Mar 5, 2020 at 2:35 PM Péter Farkas 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm trying to upgrade to version 2.19, but when I try to test it I
>>>>>>> keep getting this error:
>>>>>>> Exception in thread "main" java.lang.RuntimeException: Failed to
>>>>>>> construct instance from factory method 
>>>>>>> DataflowRunner#fromOptions(interface
>>>>>>> org.apache.beam.sdk.options.PipelineOptions)
>>>>>>> at
>>>>>>> org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:224)
>>>>>>> at
>>>>>>> org.apache.beam.sdk.util.InstanceBuilder.build(InstanceBuilder.java:155)
>>>>>>> at
>>>>>>> org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:55)
>>>>>>> at org.apache.beam.sdk.Pipeline.create(Pipeline.java:147)
>>>>>>> ...
>>>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>> at
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>>> at
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>>> at
>>>>>>> org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:214)
>>>>>>> ... 8 more
>>>>>>> Caused by: java.lang.IllegalArgumentException: No files to stage has
>>>>>>> been found.
>>>>>>> at
>>>>>>> org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:281)
>>>>>>> ... 13 more
>>>>>>>
>>>>>>> Process finished with exit code 1
>>>>>>>
>>>>>>> Additional information:
>>>>>>> - The same pipeline works well with 2.18
>>>>>>> - The problem only occurs on Windows machines, on Linux it works
>>>>>>> perfectly fine
>>>>>>> - After digging into the code, the library seems to think that my
>>>>>>> local jars are remote jars (the remote jar regex matches to windows 
>>>>>>> paths
>>>>>>> as well):
>>>>>>>   -
>>>>>>> nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler:257
>>>>>>>   - io.github.classgraph.ClasspathElementZip:163
>>>>>>>
>>>>>>> Did anyone else run into this? How did you solve it?
>>>>>>> --
>>>>>>>
>>>>>>> Peter Farkas
>>>>>>>
>>>>>>> Lead Data Architect
>>>>>>>
>>>>>>>
>>>>>>> www.aliz.ai
>>>>>>>
>>>>>>> LinkedIn <https://www.linkedin.com/company/alizcompany/>| Facebook
>>>>>>> <https://www.facebook.com/aliztechnologies/>| Blog
>>>>>>> <https://medium.com/@aliz_ai>
>>>>>>>
>>>>>>> <http://www.aliz.ai/>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Peter Farkas
>>>>>
>>>>> Lead Data Architect
>>>>>
>>>>>
>>>>> www.aliz.ai
>>>>>
>>>>> LinkedIn <https://www.linkedin.com/company/alizcompany/>| Facebook
>>>>> <https://www.facebook.com/aliztechnologies/>| Blog
>>>>> <https://medium.com/@aliz_ai>
>>>>>
>>>>> <http://www.aliz.ai/>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> *Kjetil Halvorsen*
>>>> Chief Architect, Enterprise Integration
>>>> +47 48 01 13 75 | kjetil.halvor...@cognite.com
>>>> www.cognite.com | LIBERATE YOUR DATA™
>>>>
>>>>
>>
>> --
>>
>> *Kjetil Halvorsen*
>> Chief Architect, Enterprise Integration
>> +47 48 01 13 75 | kjetil.halvor...@cognite.com
>> www.cognite.com | LIBERATE YOUR DATA™
>>
>>


Re: Problem with Classgraph in Beam 2.19

2020-03-05 Thread Ismaël Mejía
Oh we were not aware of that issue, next time do not hesitate to let us
know in advance.

On Thu, Mar 5, 2020 at 5:20 PM Kjetil Halvorsen <
kjetil.halvor...@cognite.com> wrote:

> Thanks for the quick response--it is much appreciated!
>
> Also, great news on the one fix already done and that you guys are looking
> into the other issue. These have so far been blockers for us to move
> forward from 2.17.
>
> Best,
> Kjetil
>
> On Thu, Mar 5, 2020 at 4:27 PM Ismaël Mejía  wrote:
>
>> Thanks Péter for bringing this info and creating the issue, thanks Kjetil
>> too for confirming that you can also reproduce it.
>> For the opencensus issue it is already fixed, for ref
>> https://issues.apache.org/jira/browse/BEAM-9304
>>
>>
>> On Thu, Mar 5, 2020 at 4:14 PM Kjetil Halvorsen <
>> kjetil.halvor...@cognite.com> wrote:
>>
>>> I have observed the same thing.
>>>
>>> Also, quite a bit more dependency challenges started from 2.18. when
>>> using the Dataflow runner and GCP IO extensions. In particular
>>> io.opencensus.
>>>
>>> Cheers,
>>> Kjetil
>>>
>>> On Thu, Mar 5, 2020 at 3:55 PM Péter Farkas 
>>> wrote:
>>>
>>>> BEAM-9452
>>>>
>>>> On Thu, 5 Mar 2020 at 15:04, Ismaël Mejía  wrote:
>>>>
>>>>> Can you please create a JIRA for this issue.
>>>>>
>>>>> On Thu, Mar 5, 2020 at 2:35 PM Péter Farkas 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to upgrade to version 2.19, but when I try to test it I
>>>>>> keep getting this error:
>>>>>> Exception in thread "main" java.lang.RuntimeException: Failed to
>>>>>> construct instance from factory method 
>>>>>> DataflowRunner#fromOptions(interface
>>>>>> org.apache.beam.sdk.options.PipelineOptions)
>>>>>> at
>>>>>> org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:224)
>>>>>> at
>>>>>> org.apache.beam.sdk.util.InstanceBuilder.build(InstanceBuilder.java:155)
>>>>>> at
>>>>>> org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:55)
>>>>>> at org.apache.beam.sdk.Pipeline.create(Pipeline.java:147)
>>>>>> ...
>>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>> at
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>> at
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>> at
>>>>>> org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:214)
>>>>>> ... 8 more
>>>>>> Caused by: java.lang.IllegalArgumentException: No files to stage has
>>>>>> been found.
>>>>>> at
>>>>>> org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:281)
>>>>>> ... 13 more
>>>>>>
>>>>>> Process finished with exit code 1
>>>>>>
>>>>>> Additional information:
>>>>>> - The same pipeline works well with 2.18
>>>>>> - The problem only occurs on Windows machines, on Linux it works
>>>>>> perfectly fine
>>>>>> - After digging into the code, the library seems to think that my
>>>>>> local jars are remote jars (the remote jar regex matches to windows paths
>>>>>> as well):
>>>>>>   - nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler:257
>>>>>>   - io.github.classgraph.ClasspathElementZip:163
>>>>>>
>>>>>> Did anyone else run into this? How did you solve it?
>>>>>> --
>>>>>>
>>>>>> Peter Farkas
>>>>>>
>>>>>> Lead Data Architect
>>>>>>
>>>>>>
>>>>>> www.aliz.ai
>>>>>>
>>>>>> LinkedIn <https://www.linkedin.com/company/alizcompany/>| Facebook
>>>>>> <https://www.facebook.com/aliztechnologies/>| Blog
>>>>>> <https://medium.com/@aliz_ai>
>>>>>>
>>>>>> <http://www.aliz.ai/>
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Peter Farkas
>>>>
>>>> Lead Data Architect
>>>>
>>>>
>>>> www.aliz.ai
>>>>
>>>> LinkedIn <https://www.linkedin.com/company/alizcompany/>| Facebook
>>>> <https://www.facebook.com/aliztechnologies/>| Blog
>>>> <https://medium.com/@aliz_ai>
>>>>
>>>> <http://www.aliz.ai/>
>>>>
>>>
>>>
>>> --
>>>
>>> *Kjetil Halvorsen*
>>> Chief Architect, Enterprise Integration
>>> +47 48 01 13 75 | kjetil.halvor...@cognite.com
>>> www.cognite.com | LIBERATE YOUR DATA™
>>>
>>>
>
> --
>
> *Kjetil Halvorsen*
> Chief Architect, Enterprise Integration
> +47 48 01 13 75 | kjetil.halvor...@cognite.com
> www.cognite.com | LIBERATE YOUR DATA™
>
>


Re: Problem with Classgraph in Beam 2.19

2020-03-05 Thread Ismaël Mejía
Thanks Péter for bringing this info and creating the issue, thanks Kjetil
too for confirming that you can also reproduce it.
For the opencensus issue it is already fixed, for ref
https://issues.apache.org/jira/browse/BEAM-9304


On Thu, Mar 5, 2020 at 4:14 PM Kjetil Halvorsen <
kjetil.halvor...@cognite.com> wrote:

> I have observed the same thing.
>
> Also, quite a bit more dependency challenges started from 2.18. when using
> the Dataflow runner and GCP IO extensions. In particular io.opencensus.
>
> Cheers,
> Kjetil
>
> On Thu, Mar 5, 2020 at 3:55 PM Péter Farkas  wrote:
>
>> BEAM-9452
>>
>> On Thu, 5 Mar 2020 at 15:04, Ismaël Mejía  wrote:
>>
>>> Can you please create a JIRA for this issue.
>>>
>>> On Thu, Mar 5, 2020 at 2:35 PM Péter Farkas 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to upgrade to version 2.19, but when I try to test it I keep
>>>> getting this error:
>>>> Exception in thread "main" java.lang.RuntimeException: Failed to
>>>> construct instance from factory method DataflowRunner#fromOptions(interface
>>>> org.apache.beam.sdk.options.PipelineOptions)
>>>> at
>>>> org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:224)
>>>> at
>>>> org.apache.beam.sdk.util.InstanceBuilder.build(InstanceBuilder.java:155)
>>>> at
>>>> org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:55)
>>>> at org.apache.beam.sdk.Pipeline.create(Pipeline.java:147)
>>>> ...
>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>> at
>>>> org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:214)
>>>> ... 8 more
>>>> Caused by: java.lang.IllegalArgumentException: No files to stage has
>>>> been found.
>>>> at
>>>> org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:281)
>>>> ... 13 more
>>>>
>>>> Process finished with exit code 1
>>>>
>>>> Additional information:
>>>> - The same pipeline works well with 2.18
>>>> - The problem only occurs on Windows machines, on Linux it works
>>>> perfectly fine
>>>> - After digging into the code, the library seems to think that my local
>>>> jars are remote jars (the remote jar regex matches to windows paths as
>>>> well):
>>>>   - nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler:257
>>>>   - io.github.classgraph.ClasspathElementZip:163
>>>>
>>>> Did anyone else run into this? How did you solve it?
>>>> --
>>>>
>>>> Peter Farkas
>>>>
>>>> Lead Data Architect
>>>>
>>>>
>>>> www.aliz.ai
>>>>
>>>> LinkedIn <https://www.linkedin.com/company/alizcompany/>| Facebook
>>>> <https://www.facebook.com/aliztechnologies/>| Blog
>>>> <https://medium.com/@aliz_ai>
>>>>
>>>> <http://www.aliz.ai/>
>>>>
>>>
>>
>> --
>>
>> Peter Farkas
>>
>> Lead Data Architect
>>
>>
>> www.aliz.ai
>>
>> LinkedIn <https://www.linkedin.com/company/alizcompany/>| Facebook
>> <https://www.facebook.com/aliztechnologies/>| Blog
>> <https://medium.com/@aliz_ai>
>>
>> <http://www.aliz.ai/>
>>
>
>
> --
>
> *Kjetil Halvorsen*
> Chief Architect, Enterprise Integration
> +47 48 01 13 75 | kjetil.halvor...@cognite.com
> www.cognite.com | LIBERATE YOUR DATA™
>
>


Re: Problem with Classgraph in Beam 2.19

2020-03-05 Thread Ismaël Mejía
Can you please create a JIRA for this issue.

On Thu, Mar 5, 2020 at 2:35 PM Péter Farkas  wrote:

> Hi,
>
> I'm trying to upgrade to version 2.19, but when I try to test it I keep
> getting this error:
> Exception in thread "main" java.lang.RuntimeException: Failed to construct
> instance from factory method DataflowRunner#fromOptions(interface
> org.apache.beam.sdk.options.PipelineOptions)
> at
> org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:224)
> at org.apache.beam.sdk.util.InstanceBuilder.build(InstanceBuilder.java:155)
> at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:55)
> at org.apache.beam.sdk.Pipeline.create(Pipeline.java:147)
> ...
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:214)
> ... 8 more
> Caused by: java.lang.IllegalArgumentException: No files to stage has been
> found.
> at
> org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:281)
> ... 13 more
>
> Process finished with exit code 1
>
> Additional information:
> - The same pipeline works well with 2.18
> - The problem only occurs on Windows machines, on Linux it works perfectly
> fine
> - After digging into the code, the library seems to think that my local
> jars are remote jars (the remote jar regex matches to windows paths as
> well):
>   - nonapi.io.github.classgraph.fastzipfilereader.NestedJarHandler:257
>   - io.github.classgraph.ClasspathElementZip:163
>
> Did anyone else run into this? How did you solve it?
> --
>
> Peter Farkas
>
> Lead Data Architect
>
>
> www.aliz.ai
>
> LinkedIn | Facebook
> | Blog
> 
>
> 
>


Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-02-26 Thread Ismaël Mejía
Since it was merged yesterday you can test with the 2.20.0-SNAPSHOT until
the first candidate is out.

On Wed, Feb 26, 2020 at 4:37 PM Kaymak, Tobias 
wrote:

> If I am not running in detached mode (so that my pipeline starts) I am
> unable to Stop it in the webinterface. The only option available is to
> cancel it. Is this expected?
>
> [image: Screenshot 2020-02-26 at 16.34.08.png]
>
> On Wed, Feb 26, 2020 at 4:16 PM Kaymak, Tobias 
> wrote:
>
>> Hello,
>>
>> we fixed the issue and are ready to test :) - is there a RC already
>> available?
>>
>> Best,
>> Tobi
>>
>> On Wed, Feb 26, 2020 at 12:59 PM Kaymak, Tobias 
>> wrote:
>>
>>> Hello,
>>>
>>> happy to help testing! I am currently fixing a networking issue between
>>> our dev cluster for integration tests and the Kafka it is consuming from.
>>> After that I would be ready to spin it up and test
>>>
>>> Best,
>>> Tobi
>>>
>>> On Mon, Feb 24, 2020 at 10:13 PM Maximilian Michels 
>>> wrote:
>>>
>>>> Thank you for reporting / filing / collecting the issues.
>>>>
>>>> There is a fix pending: https://github.com/apache/beam/pull/10950
>>>>
>>>> As for the upgrade issues, the 1.8 and 1.9 upgrade is trivial. I will
>>>> check out the Flink 1.10 PR tomorrow.
>>>>
>>>> Cheers,
>>>> Max
>>>>
>>>> On 24.02.20 09:26, Ismaël Mejía wrote:
>>>> > We are cutting the release branch for 2.20.0 next wednesday, so not
>>>> sure
>>>> > if these tickets will make it, but hopefully.
>>>> >
>>>> > For ref,
>>>> > BEAM-9295 Add Flink 1.10 build target and Make FlinkRunner compatible
>>>> > with Flink 1.10
>>>> > BEAM-9299 Upgrade Flink Runner to 1.8.3 and 1.9.2
>>>> >
>>>> > In any case if you have cycles to help test any of the related
>>>> tickets
>>>> > PRs that would help too.
>>>> >
>>>> >
>>>> > On Mon, Feb 24, 2020 at 8:47 AM Kaymak, Tobias <
>>>> tobias.kay...@ricardo.ch
>>>> > <mailto:tobias.kay...@ricardo.ch>> wrote:
>>>> >
>>>> > Hi Kyle,
>>>> >
>>>> > thank you for creating the JIRA ticket, I think my best option
>>>> right
>>>> > now is to wait for a Beam version that is running on Flink 1.10
>>>> then
>>>> > - unless there is a new Beam release around the corner :)
>>>> >
>>>> > Best,
>>>> > Tobi
>>>> >
>>>> > On Thu, Feb 20, 2020 at 11:52 PM Kyle Weaver >>> > <mailto:kcwea...@google.com>> wrote:
>>>> >
>>>> > Hi Tobi,
>>>> >
>>>> > This seems like a bug with Beam 2.19. I filed
>>>> > https://issues.apache.org/jira/browse/BEAM-9345 to track the
>>>> issue.
>>>> >
>>>> >  > What puzzles me is that the session cluster should be
>>>> allowed
>>>> > to have multiple environments in detached mode - or am I
>>>> wrong?
>>>> >
>>>> > It looks like that check is removed in Flink 1.10:
>>>> > https://issues.apache.org/jira/browse/FLINK-15201
>>>> >
>>>> > Thanks for reporting.
>>>> > Kyle
>>>> >
>>>> > On Thu, Feb 20, 2020 at 4:10 AM Kaymak, Tobias
>>>> > mailto:tobias.kay...@ricardo.ch>>
>>>> wrote:
>>>> >
>>>> > Hello,
>>>> >
>>>> > I am trying to upgrade from a Flink session cluster 1.8 to
>>>> > 1.9 and from Beam 2.16.0 to 2.19.0.
>>>> > Everything went quite smoothly, the local runner and the
>>>> > local Flink runner work flawlessly.
>>>> >
>>>> > However when I:
>>>> >1. Generate a Beam jar for the FlinkRunner via maven
>>>> (mvn
>>>> > package -PFlinkRunner)
>>>> >2. Glue that into a Flink 1.9 docker image
>>>> >3. Start the image as a Standalone Session Cluster
>>>&

Re: Unbounded input join Unbounded input then write to Bounded Sink

2020-02-25 Thread Ismaël Mejía
Hello,

Sinks are not bounded or unbounded, they are just normal ParDos (DoFns)
that behave consistently with the pipeline data, so if your pipeline deals
with unbounded data the sink will write this data correspondingly (when
windows close, triggers match, etc so data is ready to be out).

One pattern that was reported with a similar pipeline but with JdbcIO as a
sink was the connection exhaustion of the database, this happened because
the pipeline was producing lots of windows / materializations that ended
creating thousands of DoFn writes to the sink and every write to the sink
was requesting a new connection so the database got overwhelmed. This was
fixed for JdbcIO via a pool of connections and smarter instantiation. If
this is happening on HCatalogIO it is an issue that we Beam devs should
fix, but this should not be a concern for your pipeline design.

Regards,
Ismaël


On Tue, Feb 25, 2020 at 3:30 AM rahul patwari 
wrote:

> Hi Kenn, Rui,
>
> The pipeline that we are trying is exactly what Kenn has mentioned above
> i.e.
> Read From Kafka => Apply Fixed Windows of 1 Min => SqlTransform => Write
> to Hive using HcatalogIO
>
> We are interested in understanding the behaviour when the source is
> Unbounded and Sink is bounded as this pipeline is being used for ETL.
> Does the same pipeline work for any other Bounded Sink, instead of
> HcatalogIO?
> What are the features required to be supported by the Bounded Sink, for it
> to be used along with an Unbounded Source?
>
> Are there any best practices (or) pipeline patterns for these kinds of
> pipelines? Will there be any performance hits?
>
> Regards,
> Rahul
>
> On Tue, Feb 25, 2020 at 6:57 AM Rui Wang  wrote:
>
>> Sorry please remove " .apply(Window.into(FixedWindows.of(1 minute))"
>> from the query above.
>>
>>
>>
>> -Rui
>>
>> On Mon, Feb 24, 2020 at 5:26 PM Rui Wang  wrote:
>>
>>> I see. So I guess I wasn't fully understand the requirement:
>>>
>>> Do you want to have a 1-min window join on two unbounded sources and
>>> write to sink when the window closes ? Or there is an extra requirement
>>> such that you also want to write to sink every minute per window?
>>>
>>> For the former, you can do it by SQL:
>>>
>>> pipeline.apply(KafkaIO.read() ... )
>>> .apply(Window.into(FixedWindows.of(1 minute))
>>> .apply(SqlTransform(
>>>   "SELECT ... FROM
>>> (select TUMBLE_START() as window_start, * FROM
>>> stream1 GROUP BY TUMBLE(ts, INTERVAL '1' MINUTE))  as table_a
>>>   JOIN
>>> (select TUMBLE_START() as window_start, * FROM
>>> stream2 GROUP BY TUMBLE(ts, INTERVAL '1' MINUTE))  as table_b
>>>on table_a.window_start = table_b.window_start ...")
>>> .apply(HCatalogIO.write() ...)
>>>
>>> But as Kenneth mentioned HCatalogIO might not work as expected.
>>>
>>>
>>>
>>> For the latter, the mixed Java and SQL pipeline won't help you.
>>>
>>>
>>>
>>> -Rui
>>>
>>> On Mon, Feb 24, 2020 at 5:04 PM Kenneth Knowles  wrote:
>>>
 I think actually it depends on the pipeline. You cannot do it all in
 SQL, but if you mix Java and SQL I think you can do this. If you write 
 this:

 pipeline.apply(KafkaIO.read() ... )
 .apply(Window.into(FixedWindows.of(1 minute))
 .apply(SqlTransform("SELECT ... FROM stream1 JOIN stream2 ...")
 .apply(HCatalogIO.write() ...)

 This should apply the SQL on each window. When the SQL does not do any
 windowing, it is required to be a "per-window" SQL execution. That is the
 spec for SqlTransform. If that does not work, please report your 
 experience.

 But the SQL semantics do not require waiting. Today the
 stream-to-stream join will do a CoGroupByKey so it will wait. But SQL may
 in the future adopt a better join for this case that can output records
 with lower latency.

 It may be a bigger question whether HCatalogIO.write() has all the
 knobs you would like.

 Kenn

 On Mon, Feb 24, 2020 at 12:14 PM Rui Wang  wrote:

> SQL does not support such joins with your requirement: write to sink
> after every 1 min after window closes.
>
> You might can use state and timer API to achieve your goal.
>
>
>
> -Rui
>
> On Mon, Feb 24, 2020 at 9:50 AM shanta chakpram <
> shantachakp...@gmail.com> wrote:
>
>> Hi,
>>
>> I am trying to join inputs from Unbounded Sources then write to
>> Bounded Sink.
>> The pipeline I'm trying is:
>> Kafka Sources -> SqlTransform -> HCatalogIO  Sink
>>  And, a FixedWindow of 1 minute duration is applied.
>>
>> I'm expecting the inputs from unbounded sources joined within the
>> current window to be written to the HCatalogIO Sink after every 1 min i.e
>> after each window interval.
>>
>> Can someone please tell if this is a valid scenario and what is the
>> 

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-02-24 Thread Ismaël Mejía
We are cutting the release branch for 2.20.0 next wednesday, so not sure if
these tickets will make it, but hopefully.

For ref,
BEAM-9295 Add Flink 1.10 build target and Make FlinkRunner compatible with
Flink 1.10
BEAM-9299 Upgrade Flink Runner to 1.8.3 and 1.9.2

In any case if you have cycles to help test any of the related tickets  PRs
that would help too.


On Mon, Feb 24, 2020 at 8:47 AM Kaymak, Tobias 
wrote:

> Hi Kyle,
>
> thank you for creating the JIRA ticket, I think my best option right now
> is to wait for a Beam version that is running on Flink 1.10 then - unless
> there is a new Beam release around the corner :)
>
> Best,
> Tobi
>
> On Thu, Feb 20, 2020 at 11:52 PM Kyle Weaver  wrote:
>
>> Hi Tobi,
>>
>> This seems like a bug with Beam 2.19. I filed
>> https://issues.apache.org/jira/browse/BEAM-9345 to track the issue.
>>
>> > What puzzles me is that the session cluster should be allowed to have
>> multiple environments in detached mode - or am I wrong?
>>
>> It looks like that check is removed in Flink 1.10:
>> https://issues.apache.org/jira/browse/FLINK-15201
>>
>> Thanks for reporting.
>> Kyle
>>
>> On Thu, Feb 20, 2020 at 4:10 AM Kaymak, Tobias 
>> wrote:
>>
>>> Hello,
>>>
>>> I am trying to upgrade from a Flink session cluster 1.8 to 1.9 and from
>>> Beam 2.16.0 to 2.19.0.
>>> Everything went quite smoothly, the local runner and the local Flink
>>> runner work flawlessly.
>>>
>>> However when I:
>>>   1. Generate a Beam jar for the FlinkRunner via maven (mvn package
>>> -PFlinkRunner)
>>>   2. Glue that into a Flink 1.9 docker image
>>>   3. Start the image as a Standalone Session Cluster
>>>
>>> When I try to launch the first pipeline I get the following exception
>>>
>>> org.apache.flink.client.program.ProgramInvocationException: The main
>>> method caused an error: Failed to construct instance from factory method
>>> FlinkRunner#fromOptions(interface
>>> org.apache.beam.sdk.options.PipelineOptions)
>>> at
>>> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:593)
>>> at
>>> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438)
>>> at
>>> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>>> at
>>> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>>> at
>>> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>>> at
>>> org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
>>> at
>>> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010)
>>> at
>>> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083)
>>> at
>>> org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
>>> at
>>> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083)
>>> Caused by: java.lang.RuntimeException: Failed to construct instance from
>>> factory method FlinkRunner#fromOptions(interface
>>> org.apache.beam.sdk.options.PipelineOptions)
>>> at
>>> org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:224)
>>> at
>>> org.apache.beam.sdk.util.InstanceBuilder.build(InstanceBuilder.java:155)
>>> at
>>> org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:55)
>>> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:309)
>>> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:301)
>>> at
>>> ch.ricardo.di.beam.KafkaToBigQuery.main(KafkaToBigQuery.java:180)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>> at
>>> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576)
>>> ... 9 more
>>> Caused by: java.lang.reflect.InvocationTargetException
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>> at
>>> org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:214)
>>> ... 19 more
>>> Caused by: org.apache.flink.api.common.InvalidProgramException: Multiple
>>> environments cannot be created in detached mode
>>> at
>>> org.apache.flink.client.program.ContextEnvironmentFactory.createExecutionEnvironment(ContextEnvironmentFactory.java:67)
>>> at java.util.Optional.map(Optional.java:215)
>>> at
>>> 

Re: Running a Beam Pipeline on GCP Dataproc Flink Cluster

2020-02-07 Thread Ismaël Mejía
+user@beam.apache.org 


On Fri, Feb 7, 2020 at 12:54 AM Xander Song  wrote:

> I am attempting to run a Beam pipeline on a GCP Dataproc Flink cluster. I
> have followed the instructions at this repo
> 
>  to
> create a Flink cluster on Dataproc using an initialization action. However,
> the resulting cluster uses version 1.5.6 of Flink, and my project requires
> a more recent version (version 1.7, 1.8, or 1.9) for compatibility with
> Beam .
>
> Inside of the flink.sh script in the linked repo, there is a line for
> installing Flink from a snapshot URL instead of apt
> .
> Is this the correct mechanism for installing a different version of Flink
> using the initialization script? If so, how is it meant to be used?
>
> Thank you in advance.
>


[CVE-2020-1929] Apache Beam MongoDB IO connector disables certificate trust verification

2020-01-15 Thread Ismaël Mejía
CVE-2020-1929 Apache Beam MongoDB IO connector disables certificate trust
verification

Severity: Major
Vendor: The Apache Software Foundation

Versions Affected:
Apache Beam 2.10.0 to 2.16.0

Description:
The Apache Beam MongoDB connector in versions 2.10.0 to 2.16.0 has an
option to
disable SSL trust verification. However this configuration is not respected
and
the certificate verification disables trust verification in every case. This
exclusion also gets registered globally which disables trust checking for
any
code running in the same JVM.

Mitigation:
Users of the affected versions should apply one of the following
mitigations:
- Upgrade to Apache Beam 2.17.0 or later

Acknowledgements:
This issue was reported (and fixed) by Colm Ó hÉigeartaigh.


Re: Beam KinesisIO Migration V1 to V2

2019-10-01 Thread Ismaël Mejía
Thanks a lot Cam for bringing this document to the mailing list (I let
some comments there). There was a recent proposal doc about supporting
async on Beam so you can be interested on taking a look at the
evolution of that [1]. It is definitely interesting for the
implications for IO authors.

I piggyback a bit this email to make other members of the community
aware of the ongoing work by Cam. He has been a contributor for a
while in the Java IO front, he contributed the IO for DynamoDB and due
to his work there he started working on the update of all the Amazon
Web Services (AWS) related IOs (SNS, SQS, etc) to use the new AWS SDK
(v2), that’s how he got into the Kinesis migration point, btw AWS IO
migration is an ongoing effort so if other people are interested on
contributing into this area, do not hesitate to contact him (or me).
BEAM-7555 [2] for more info. (Sorry for the ad :P)

Ismaël

[1] 
https://lists.apache.org/thread.html/26d0b7b4f89dcc265fd5deb9cdca3f3bc6daa7cdf2fe56e09b5e7a36@%3Cdev.beam.apache.org%3E
[2] https://issues.apache.org/jira/browse/BEAM-7555


On Mon, Sep 30, 2019 at 2:02 PM Cam Mach  wrote:
>
> Hello Beam Dev,
>
> I have discussed with a couple of Beam dev regarding this topic. We found 
> something interesting in the new AWS Kinesis SDK and Libraries V2, so like to 
> propose a design for this migration.
>
> Here is the design doc: 
> https://docs.google.com/document/d/1XeIVbiDHBReZY8rEI2OWA3cTEQuaR7RPdwGAup6S1DM
>
> I would love to hear from you, your feedback and comments
>
> Thanks,
> Cam
>
>


Re: Beam KinesisIO Migration V1 to V2

2019-10-01 Thread Ismaël Mejía
+dev

On Tue, Oct 1, 2019 at 8:35 PM Ismaël Mejía  wrote:
>
> Thanks a lot Cam for bringing this document to the mailing list (I let
> some comments there). There was a recent proposal doc about supporting
> async on Beam so you can be interested on taking a look at the
> evolution of that [1]. It is definitely interesting for the
> implications for IO authors.
>
> I piggyback a bit this email to make other members of the community
> aware of the ongoing work by Cam. He has been a contributor for a
> while in the Java IO front, he contributed the IO for DynamoDB and due
> to his work there he started working on the update of all the Amazon
> Web Services (AWS) related IOs (SNS, SQS, etc) to use the new AWS SDK
> (v2), that’s how he got into the Kinesis migration point, btw AWS IO
> migration is an ongoing effort so if other people are interested on
> contributing into this area, do not hesitate to contact him (or me).
> BEAM-7555 [2] for more info. (Sorry for the ad :P)
>
> Ismaël
>
> [1] 
> https://lists.apache.org/thread.html/26d0b7b4f89dcc265fd5deb9cdca3f3bc6daa7cdf2fe56e09b5e7a36@%3Cdev.beam.apache.org%3E
> [2] https://issues.apache.org/jira/browse/BEAM-7555
>
>
> On Mon, Sep 30, 2019 at 2:02 PM Cam Mach  wrote:
> >
> > Hello Beam Dev,
> >
> > I have discussed with a couple of Beam dev regarding this topic. We found 
> > something interesting in the new AWS Kinesis SDK and Libraries V2, so like 
> > to propose a design for this migration.
> >
> > Here is the design doc: 
> > https://docs.google.com/document/d/1XeIVbiDHBReZY8rEI2OWA3cTEQuaR7RPdwGAup6S1DM
> >
> > I would love to hear from you, your feedback and comments
> >
> > Thanks,
> > Cam
> >
> >


Re: [Python] Read Hadoop Sequence File?

2019-07-03 Thread Ismaël Mejía
That's great. I can help whenever you need. We just need to choose its
destination. Both the `hadoop-format` and `hadoop-file-system` modules
are good candidates, I would even feel inclined to put it in its own
module `sdks/java/extensions/sequencefile` to make it more easy to
discover by the final users.

A thing to consider is the SeekableByteChannel adapters, we can move
that into hadoop-common if needed and refactor the modules to share
code. Worth to take a look at
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
to see if some of it could be useful.

On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein  wrote:
>
> Hi all,
>
> I wrote those classes with the intention of upstreaming them to Beam. I can 
> try to make some time this quarter to clean them up. I would need a bit of 
> guidance from a beam expert in how to make them coexist with HadoopFormatIO 
> though.
>
>
> On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis  wrote:
>>
>> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes.
>>
>> Solomon Duskis | Google Cloud clients | sdus...@google.com | 914-462-0531
>>
>>
>> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía  wrote:
>>>
>>> (Adding dev@ and Solomon Duskis to the discussion)
>>>
>>> I was not aware of these thanks for sharing David. Definitely it would
>>> be a great addition if we could have those donated as an extension in
>>> the Beam side. We can even evolve them in the future to be more FileIO
>>> like. Any chance this can happen? Maybe Solomon and his team?
>>>
>>>
>>>
>>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek  wrote:
>>> >
>>> > Hi, you can use SequenceFileSink and Source, from a BigTable client. 
>>> > Those works nice with FileIO.
>>> >
>>> > https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
>>> > https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>>> >
>>> > It would be really cool to move these into Beam, but that's up to 
>>> > Googlers to decide, whether they want to donate this.
>>> >
>>> > D.
>>> >
>>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan 
>>> >  wrote:
>>> >>
>>> >> It's not outside the realm of possibilities. For now I've created an 
>>> >> intermediary step of a hadoop job that converts from sequence to text 
>>> >> file.
>>> >>
>>> >> Looking into better options.
>>> >>
>>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath  
>>> >> wrote:
>>> >>>
>>> >>> Java SDK has a HadoopInputFormatIO using which you should be able to 
>>> >>> read Sequence files: 
>>> >>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
>>> >>> I don't think there's a direct alternative for this for Python.
>>> >>>
>>> >>> Is it possible to write to a well-known format such as Avro instead of 
>>> >>> a Hadoop specific format which will allow you to read from both 
>>> >>> Dataproc/Hadoop and Beam Python SDK ?
>>> >>>
>>> >>> Thanks,
>>> >>> Cham
>>> >>>
>>> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan 
>>> >>>  wrote:
>>> >>>>
>>> >>>> That's a pretty big hole for a missing source/sink when looking at 
>>> >>>> transitioning from Dataproc to Dataflow using GCS as storage buffer 
>>> >>>> instead of a traditional hdfs.
>>> >>>>
>>> >>>> From what I've been able to tell from source code and documentation, 
>>> >>>> Java is able to but not Python?
>>> >>>>
>>> >>>> Thanks,
>>> >>>> Shannon
>>> >>>>
>>> >>>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath 
>>> >>>>  wrote:
>>> >>>>>
>>> >>>>> I don't think we have a source/sink for reading Hadoop sequence 
>>> >>>>> files. Y

Re: [Python] Read Hadoop Sequence File?

2019-07-02 Thread Ismaël Mejía
(Adding dev@ and Solomon Duskis to the discussion)

I was not aware of these thanks for sharing David. Definitely it would
be a great addition if we could have those donated as an extension in
the Beam side. We can even evolve them in the future to be more FileIO
like. Any chance this can happen? Maybe Solomon and his team?



On Tue, Jul 2, 2019 at 9:39 AM David Morávek  wrote:
>
> Hi, you can use SequenceFileSink and Source, from a BigTable client. Those 
> works nice with FileIO.
>
> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>
> It would be really cool to move these into Beam, but that's up to Googlers to 
> decide, whether they want to donate this.
>
> D.
>
> On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan  
> wrote:
>>
>> It's not outside the realm of possibilities. For now I've created an 
>> intermediary step of a hadoop job that converts from sequence to text file.
>>
>> Looking into better options.
>>
>> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath  wrote:
>>>
>>> Java SDK has a HadoopInputFormatIO using which you should be able to read 
>>> Sequence files: 
>>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
>>> I don't think there's a direct alternative for this for Python.
>>>
>>> Is it possible to write to a well-known format such as Avro instead of a 
>>> Hadoop specific format which will allow you to read from both 
>>> Dataproc/Hadoop and Beam Python SDK ?
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan  
>>> wrote:

 That's a pretty big hole for a missing source/sink when looking at 
 transitioning from Dataproc to Dataflow using GCS as storage buffer 
 instead of a traditional hdfs.

 From what I've been able to tell from source code and documentation, Java 
 is able to but not Python?

 Thanks,
 Shannon

 On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath  
 wrote:
>
> I don't think we have a source/sink for reading Hadoop sequence files. 
> Your best bet currently will probably be to use FileSystem abstraction to 
> create a file from a ParDo and read directly from there using a library 
> that can read sequence files.
>
> Thanks,
> Cham
>
> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan 
>  wrote:
>>
>> I'm wanting to read a Sequence/Map file from Hadoop stored on Google 
>> Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the Python 
>> SDK.
>>
>> I cannot locate any good adapters for this, and the one Hadoop 
>> Filesystem reader seems to only read from a "hdfs://" url.
>>
>> I'm wanting to use Dataflow and GCS exclusively to start mixing in Beam 
>> pipelines with our current Hadoop Pipelines.
>>
>> Is this a feature that is supported or will be supported in the future?
>> Does anyone have any good suggestions for this that is performant?
>>
>> I'd also like to be able to write back out to a SequenceFile if possible.
>>
>> Thanks!
>>


Re: gRPC method to get a pipeline definition?

2019-06-26 Thread Ismaël Mejía
+1 don't hesitate to create a JIRA + PR. You may be interested in [1].
This is a simple util class that takes a proto pipeline object and
converts it into its graph representation in .dot format. You can
easily reuse the code or the idea as a first approach to show what the
pipeline is about.

[1] 
https://github.com/apache/beam/blob/2df702a1448fa6cbd22cd225bf16e9ffc4c82595/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/renderer/PortablePipelineDotRenderer.java#L29

On Wed, Jun 26, 2019 at 10:27 AM Robert Bradshaw  wrote:
>
> Yes, offering a way to get a pipeline from the job service directly
> would be a completely reasonable thing to do (and likely not hard at
> all). We welcome pull requests.
>
> Alternative UIs built on top of this abstraction would be an
> interesting project to explore.
>
> On Wed, Jun 26, 2019 at 8:44 AM Chad Dombrova  wrote:
> >
> > Hi all,
> > I've been poking around the beam source code trying to determine whether 
> > it's possible to get the definition of a pipeline via beam's gPRC-based 
> > services.   It looks like the message types are there for describing a 
> > Pipeline but as far as I can tell, they're only used by 
> > JobService.Prepare() for submitting a new job.
> >
> > If I were to create a PR to add support for a JobService.GetPipeline() 
> > method, would that be interesting to others?  Is it technically feasible?  
> > i.e. is the pipeline definition readily available to the job service after 
> > the job has been prepared and sent to the runner?
> >
> > Bigger picture, what I'm thinking about is writing a UI that's designed to 
> > view and monitor Beam pipelines via the portability abstraction, rather 
> > than using the (rather clunky) UIs that come with runners like Flink and 
> > Dataflow.  My thinking is that using beam's abstractions would future proof 
> > the UI by allowing it to work with any portable runner.  Right now it's 
> > just an idea, so I'd love to know what others think of this.
> >
> > thanks!
> > -chad
> >


Re: [ANNOUNCE] Spark portable runner (batch) now available for Java, Python, Go

2019-06-18 Thread Ismaël Mejía
I have been thrilled from seeing from the first row this happening.

Thanks a lot Kyle. Excellent work!


On Mon, Jun 17, 2019 at 9:15 PM Ankur Goenka  wrote:
>
> Thanks Kyle!
> This is a great addition towards supporting portability on Beam.
>
> On Mon, Jun 17, 2019 at 9:21 AM Ahmet Altay  wrote:
>>
>> Thank you Kyle! This is great news :)
>>
>> On Mon, Jun 17, 2019 at 6:40 AM Andres Angel <
ingenieroandresan...@gmail.com> wrote:
>>>
>>> Really great achievement!!! congrats.
>>>
>>> On Mon, Jun 17, 2019 at 7:49 AM Robert Bradshaw 
wrote:

 Excellent work, very excited to see this!

 On Fri, Jun 14, 2019 at 11:02 PM Kyle Weaver 
wrote:
>
> Hello Beamers,
>
> I'm happy to announce that the portable Spark runner is now mostly
feature-complete [0] for BATCH processing (STREAMING is not yet available).
This means you can run your new or existing Beam Python and Go pipelines
using Apache Spark as the underlying execution engine.
>
> "Portable," you ask? Essentially, it shares a lot of the same code as
the existing Spark runner, but also leverages Beam's portability APIs [1]
to add Python and Go support, in addition to Java (note that the Go SDK
itself is still considered experimental [2]).
>
> Instructions on how to run pipelines on the portable Spark runner are
available on the Beam website [3].
>
> While we are passing Beam's fairly comprehensive test suites
[4][5][6], the portable Spark runner has yet to be tested in a production
environment, so please feel free to file a Jira and tag me if you have
issues or feature requests (username: ibzib).
>
> Thanks,
> Kyle
>
> [0] https://s.apache.org/apache-beam-portability-support-table
> [1] https://beam.apache.org/roadmap/portability/
> [2]
https://lists.apache.org/thread.html/8f729da2d3009059d7a8b2d8624446be161700dcfa953939dd3530c6@%3Cdev.beam.apache.org%3E
> [3] https://beam.apache.org/documentation/runners/spark/
> [4] https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch
> [5] https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/
> [6] https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/
>
> Kyle Weaver | Software Engineer | github.com/ibzib |
kcwea...@google.com | +1650203


Re: AvroIO read SpecificRecord with custom reader schema?

2019-06-14 Thread Ismaël Mejía
> There is an alternative approach with getting PCollection using AvroIO 
> and then converting Row to SpecificRecord using new Schemas APIs

Yes but this will have a higher run time, no ? Maybe worth a JIRA + PR
for this feature.

On Fri, Jun 14, 2019 at 10:15 AM Gleb Kanterov  wrote:
>
> There is an alternative approach with getting PCollection using AvroIO 
> and then converting Row to SpecificRecord using new Schemas APIs.
>
> On Fri, Jun 14, 2019 at 1:58 AM Chamikara Jayalath  
> wrote:
>>
>> I see. I don't think this is supported by AvroIO currently so your best bet 
>> will probably be to do a GenericRecord to SpecificRecord conversion in a 
>> ParDo that follows the read transform.
>>
>> On Thu, Jun 13, 2019 at 4:32 PM Neville Li  wrote:
>>>
>>> That gives me a GenericRecord which is not type safe.
>>> In my case I have the compiled SpecificRecord class i.e. MyRecord 
>>> available, but would like to pass in a schema other than 
>>> MyRecord.getClassSchema() to say populate a subset of the fields.
>>>
>>> On Thu, Jun 13, 2019 at 6:18 PM Chamikara Jayalath  
>>> wrote:

 Does AvroIO.readGenericRecords() work ?
 https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L333

 Thanks,
 Cham

 On Thu, Jun 13, 2019 at 1:46 PM Neville Li  wrote:
>
> Hi,
>
> Is it just me or is there no way for AvroIO to read SpecificRecords with 
> a custom reader schema?
>
> AvroIO.read(Class recordClass) will use the schema of T and there's no 
> way to override it.
>
> Cheers,
> Neville
>
>
>
> --
> Cheers,
> Gleb


Re: Why is my RabbitMq message never acknowledged ?

2019-06-14 Thread Ismaël Mejía
Is there a JIRA for this ? if this solves an issue to multiple users
maybe is worth of integrating the patch.
Would you be up to do this Augustin?

On Fri, Jun 14, 2019 at 10:35 AM Augustin Lafanechere
 wrote:
>
> Hello Nicolas,
> I also encountered the same problem.
> RabbitMQIo indeed acknowledges messages on finalizeCheckpoint calls but this 
> was not clear to me on when this method is called because no message were ack 
> on pipeline runtime.
> I finally decided to implement a patch of the RabbitMqIO to set auto ack of 
> received messages, this is fine for my current use case but is not the safest 
> way of consuming messages.
>
> If someone has a cleaner solution I’ll be happy to hear it.
>
> Augustin
>
>
>
>
> > Le 13 juin 2019 à 15:47, Nicolas Delsaux  a écrit :
> >
> > I'm having big troubles reading data from RabbitMQ.
> >
> > To understand my troubles, i've simplified my previous code to the extreme :
> >
> >
> > Pipeline pipeline = Pipeline.create(options);
> >
> > PCollection wat = (PCollection) 
> > pipeline.apply("read_from_rabbit",
> > RabbitMqIO.read()
> > .withUri(options.getRabbitMQUri())
> > .withQueue(options.getRabbitMQQueue())
> > )
> > .apply("why not", RabbitMqIO.write()
> > .withQueue("written_in_rabbit")
> > .withQueueDeclare(true)
> > .withUri(options.getRabbitMQUri())
> > )
> >
> >
> > So if I put a simple message in my input queue, it should be "moved" 
> > (quotes are here since new message is not the original one, but has same 
> > content) into my "written_in_rabbit" message.
> >
> > Unfortunatly, for reasons I don't understand, the original message stays in 
> > input queue.
> >
> > It seems to be due to the fact that 
> > RabbitMQCheckpointMark#finalizeCheckpoint() method is never called. So 
> > where is the finalizeCheckpoint method called ?
> >
> > And how can I understand why this method is never called in my case ?
> >
> > Thanks
> >
> >
>


Re: Beam's HCatalogIO for Hive Parquet Data

2019-05-02 Thread Ismaël Mejía
Hello,

Support for Parquet in HCatalog (Hive) started on version 3.0.0
HIVE-8838 Support Parquet through HCatalog
https://issues.apache.org/jira/browse/HIVE-8838?attachmentSortBy=dateTime

The current version used on Beam is 2.1.0. I filled a new JIRA to tackle it
BEAM-7209 - Update HCatalogIO to use HCatalog version 3
https://issues.apache.org/jira/browse/BEAM-7209

Thanks for reporting.

On Thu, May 2, 2019 at 3:26 PM  wrote:
>
> Hi,
>
> Am trying to read a table in CDH's Hive which is in Parquet Format.
> I am able to see the data using beeline.
> However, using HCatalogIO, all I get during read are NULL values for all rows.
>
> I see this problem only with Parquet Format. Other File Formats work as 
> expected.
> Has someone else encountered this issue and knows how to solve this problem?
>
> Regards,
> Utkarsh


Re: Couchbase

2019-04-08 Thread Ismaël Mejía
Hello,

Guobao is working on this, but he is OOO at least until end of next week so
if you can wait it will be available 'soon'.
If you need this urgently and you decide to write your own implementation
of write, it would be a valuable contribution that I will be happy to
review.

Regards,
Ismaël



On Mon, Apr 8, 2019 at 2:16 PM Joshua Fox  wrote:

> Note that  the Read part  has recently been developed. I need a
> very simply write functionality -- simply inserting JsonObjects to
> Couchbase.
>
> On Mon, Apr 8, 2019 at 3:13 PM Joshua Fox  wrote:
>
>> I am looking for the equivalent of
>> org.apache.beam.sdk.io.gcp.datastore.DatastoreV1.Write for Couchbase
>>
>> What is the status? From this
>>  and this
>> , it does not seem to
>> be in-progress.
>>
>>
>
> --
>
>
> *JOSHUA FOX*
> Director, Software Architecture | Freightos
>
>
>
> *T (Israel): *+972-545691165 | *T (US)*:  +1-3123400953
> Smooth shipping.
>
>
>
>


Re: Do I have any control over bundle sizes?

2019-04-04 Thread Ismaël Mejía
It seems you can 'hack' it with the State API. See the discussion on
this ticket:
https://issues.apache.org/jira/browse/BEAM-6886

On Thu, Apr 4, 2019 at 9:42 PM Jeff Klukas  wrote:
>
> As far as I can tell, Beam expects runners to have full control over 
> separation of individual elements into bundles and this is something users 
> have no control over. Is that true? Or are there any ways that I might exert 
> some influence over bundle sizes?
>
> My main interest at the moment is investigating lighter-weight alternatives 
> to FileIO for a simple but high-throughput Dataflow job that batches up 
> messages from Pub/Sub and sinks them to GCS. I'm imagining a ParDo that 
> buffers incoming messages and then writes them all as an object to GCS in a 
> @FinalizeBundle method, avoiding the multiple GroupByKey operations needed 
> for writing sharded output from FileIO.
>
> The problem is that bundles in practice look to be far too small to make this 
> feasible. I deployed a 4 node test job that simply reads ~30k messages per 
> second from Pub/Sub and passes them to a transform that publishes some 
> metrics about the bundles passing through. I found a mean bundle size of ~500 
> elements corresponding to ~10 MB of data, which is too small for the proposed 
> approach to be feasible. Are there any tricks I could use to coerce Dataflow 
> to increase the size of bundles?
>
> I realize this is basically an abuse of the Beam programming model, but the 
> alternative I'm looking at is having to write a custom application using the 
> google-cloud APIs and deploying it on Kubernetes.


Re: joda-time dependency version

2019-03-21 Thread Ismaël Mejía
Does anyone have any context on why we have such an old version of
Joda time (2.4 released on  2014!) and if there is any possible issue
upgrading it? If not maybe we can try to upgrade it..

On Thu, Mar 21, 2019 at 5:35 PM Ismaël Mejía  wrote:
>
> Mmmm interesting issue. There is also a plan to use a vendored version
> of joda-time not sure on the progress on that one.
> https://issues.apache.org/jira/browse/BEAM-5827
>
> For Beam 3 that's the idea but  so far there is not at ETA for Beam 3.
> https://issues.apache.org/jira/browse/BEAM-5530
>
> On Thu, Mar 21, 2019 at 4:15 PM rahul patwari
>  wrote:
> >
> > Hi David,
> >
> > The only incompatibility we have come across is this:
> > We have some timestamp format conversions in our project, where we are 
> > converting from a timestamp format to another.
> >
> > With joda-time 2.4:
> > If we convert "2019-03-15 13:56:12" which is in "-MM-dd HH:mm:ss" 
> > format, to "hh:mm:ss yy-MMM-dd z" format, the converted value is "01:56:12 
> > 19-Mar-15 -07:00".
> >
> > Whereas with joda-time 2.9.3:
> > If we convert "2019-03-15 13:56:12" which is in "-MM-dd HH:mm:ss" 
> > format, to "hh:mm:ss yy-MMM-dd z" format, the converted value is "01:56:12 
> > 19-Mar-15 PDT".
> >
> > The javadoc for both the versions doesn't seem different though, for 'z' 
> > DateTimeFormat.
> >
> > Even though the javadoc says - Zone names: Time zone names ('z') cannot be 
> > parsed for both the versions, we are able to parse it in joda-time 2.9.3.
> >
> > Also, joda-time will be replaced with java time with Beam 3?
> >
> > Thanks,
> > Rahul
> >
> > On Thu, Mar 21, 2019 at 5:37 PM David Morávek  
> > wrote:
> >>
> >> Hello Rahul, are there any incompatibilities you are running into with 
> >> spark version? These versions should be backward compatible.
> >>
> >> For jodatime doc:
> >> The main public API will remain backwards compatible for both source and 
> >> binary in the 2.x stream.
> >>
> >> This means you should be able to safely use Spark's version.
> >>
> >> D.
> >>
> >> On Thu, Mar 21, 2019 at 5:45 AM rahul patwari  
> >> wrote:
> >>>
> >>> Hi Ismael,
> >>>
> >>> We are using Beam with Spark Runner and Spark 2.4 has joda-time 2.9.3 as 
> >>> a dependency. So, we have used joda-time 2.9.3 in our shaded artifact 
> >>> set. As Beam has joda-time 2.4 as a dependency, I was wondering whether 
> >>> it would break anything in Beam.
> >>>
> >>> Will joda-time be replaced with java time in Beam 3? What is the expected 
> >>> release date of Beam 3?
> >>>
> >>> Thanks,
> >>> Rahul
> >>>
> >>> On Wed, Mar 20, 2019 at 7:23 PM Ismaël Mejía  wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> The long term goal would be to get rid of joda-time but that won't
> >>>> happen until Beam 3.
> >>>> Any 'particular' reason or motivation to push the upgrade?
> >>>>
> >>>> Regards,
> >>>> Ismaël
> >>>>
> >>>> On Wed, Mar 20, 2019 at 11:53 AM rahul patwari
> >>>>  wrote:
> >>>> >
> >>>> > Hi,
> >>>> >
> >>>> > Is there a plan to upgrade the dependency version of joda-time to 
> >>>> > 2.9.3 or latest version?
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Rahul


Re: joda-time dependency version

2019-03-21 Thread Ismaël Mejía
Mmmm interesting issue. There is also a plan to use a vendored version
of joda-time not sure on the progress on that one.
https://issues.apache.org/jira/browse/BEAM-5827

For Beam 3 that's the idea but  so far there is not at ETA for Beam 3.
https://issues.apache.org/jira/browse/BEAM-5530

On Thu, Mar 21, 2019 at 4:15 PM rahul patwari
 wrote:
>
> Hi David,
>
> The only incompatibility we have come across is this:
> We have some timestamp format conversions in our project, where we are 
> converting from a timestamp format to another.
>
> With joda-time 2.4:
> If we convert "2019-03-15 13:56:12" which is in "-MM-dd HH:mm:ss" format, 
> to "hh:mm:ss yy-MMM-dd z" format, the converted value is "01:56:12 19-Mar-15 
> -07:00".
>
> Whereas with joda-time 2.9.3:
> If we convert "2019-03-15 13:56:12" which is in "-MM-dd HH:mm:ss" format, 
> to "hh:mm:ss yy-MMM-dd z" format, the converted value is "01:56:12 19-Mar-15 
> PDT".
>
> The javadoc for both the versions doesn't seem different though, for 'z' 
> DateTimeFormat.
>
> Even though the javadoc says - Zone names: Time zone names ('z') cannot be 
> parsed for both the versions, we are able to parse it in joda-time 2.9.3.
>
> Also, joda-time will be replaced with java time with Beam 3?
>
> Thanks,
> Rahul
>
> On Thu, Mar 21, 2019 at 5:37 PM David Morávek  wrote:
>>
>> Hello Rahul, are there any incompatibilities you are running into with spark 
>> version? These versions should be backward compatible.
>>
>> For jodatime doc:
>> The main public API will remain backwards compatible for both source and 
>> binary in the 2.x stream.
>>
>> This means you should be able to safely use Spark's version.
>>
>> D.
>>
>> On Thu, Mar 21, 2019 at 5:45 AM rahul patwari  
>> wrote:
>>>
>>> Hi Ismael,
>>>
>>> We are using Beam with Spark Runner and Spark 2.4 has joda-time 2.9.3 as a 
>>> dependency. So, we have used joda-time 2.9.3 in our shaded artifact set. As 
>>> Beam has joda-time 2.4 as a dependency, I was wondering whether it would 
>>> break anything in Beam.
>>>
>>> Will joda-time be replaced with java time in Beam 3? What is the expected 
>>> release date of Beam 3?
>>>
>>> Thanks,
>>> Rahul
>>>
>>> On Wed, Mar 20, 2019 at 7:23 PM Ismaël Mejía  wrote:
>>>>
>>>> Hello,
>>>>
>>>> The long term goal would be to get rid of joda-time but that won't
>>>> happen until Beam 3.
>>>> Any 'particular' reason or motivation to push the upgrade?
>>>>
>>>> Regards,
>>>> Ismaël
>>>>
>>>> On Wed, Mar 20, 2019 at 11:53 AM rahul patwari
>>>>  wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> > Is there a plan to upgrade the dependency version of joda-time to 2.9.3 
>>>> > or latest version?
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Rahul


Re: joda-time dependency version

2019-03-20 Thread Ismaël Mejía
Hello,

The long term goal would be to get rid of joda-time but that won't
happen until Beam 3.
Any 'particular' reason or motivation to push the upgrade?

Regards,
Ismaël

On Wed, Mar 20, 2019 at 11:53 AM rahul patwari
 wrote:
>
> Hi,
>
> Is there a plan to upgrade the dependency version of joda-time to 2.9.3 or 
> latest version?
>
>
> Thanks,
> Rahul


Re: Beam Python streaming pipeline on Flink Runner

2019-01-31 Thread Ismaël Mejía
> Fortunately, there is already a pending PR for cross-language pipelines which
> will allow us to use Java IO like PubSub in Python jobs.

In addition to have support in the runners, this will require a
rewrite of PubsubIO to use the new SDF API.

On Thu, Jan 31, 2019 at 12:23 PM Maximilian Michels  wrote:
>
> Hi Matthias,
>
> This is already reflected in the compatibility matrix, if you look under SDF.
> There is no UnboundedSource interface for portable pipelines. That's a legacy
> abstraction that will be replaced with SDF.
>
> Fortunately, there is already a pending PR for cross-language pipelines which
> will allow us to use Java IO like PubSub in Python jobs.
>
> Thanks,
> Max
>
> On 31.01.19 12:06, Matthias Baetens wrote:
> > Hey Ankur,
> >
> > Thanks for the swift reply. Should I change this in the capability matrix
> >  then?
> >
> > Many thanks.
> > Best,
> > Matthias
> >
> > On Thu, 31 Jan 2019 at 09:31, Ankur Goenka  > > wrote:
> >
> > Hi Matthias,
> >
> > Unfortunately, unbounded reads including pubsub are not yet supported 
> > for
> > portable runners.
> >
> > Thanks,
> > Ankur
> >
> > On Thu, Jan 31, 2019 at 2:44 PM Matthias Baetens 
> >  > > wrote:
> >
> > Hi everyone,
> >
> > Last few days I have been trying to run a streaming pipeline (code 
> > on
> > Github ) on a Flink Runner.
> >
> > I am running a Flink cluster locally (v1.5.6
> > )
> > I have built the SDK Harness Container: /./gradlew
> > :beam-sdks-python-container:docker/
> > and started the JobServer: /./gradlew
> > :beam-runners-flink_2.11-job-server:runShadow
> > -PflinkMasterUrl=localhost:8081./
> >
> > I run my pipeline with: /env/bin/python streaming_pipeline.py
> > --runner=PortableRunner --job_endpoint=localhost:8099 --output xxx
> > --input_subscription xxx --output_subscription xxx/
> > /
> > /
> > All this is running inside a Ubuntu (Bionic) in a Virtualbox.
> >
> > The job submits fine, but unfortunately fails after a few seconds 
> > with
> > the error attached.
> >
> > Anything I am missing or doing wrong?
> >
> > Many thanks.
> > Best,
> > Matthias
> >
> >


Re: Suggestion or Alternative simples to read file from FTP

2019-01-08 Thread Ismaël Mejía
There was some time ago a PR that added support to VFS with this at
least theorically we could support FTP and even HTTP listings.
However this needs some love to rebase and update it, but if you are
interested on hacking on this don't hesitate to cohtact me and I can
mentor you to do so or eventually try to get some time to finish that
work.

https://github.com/apache/beam/pull/4803

On Fri, Jan 4, 2019 at 1:09 AM Rui Wang  wrote:
>
> For the calling external service, it's described in [1] as a pattern which 
> has a small sample of code instruction.
>
> However, why not write a script to prepare the data first and then write a 
> pipeline to process it?
>
> 1. 
> https://cloud.google.com/blog/products/gcp/guide-to-common-cloud-dataflow-use-case-patterns-part-1
>
> On Thu, Jan 3, 2019 at 6:13 AM Henrique Molina  
> wrote:
>>
>> Hi Folks ,
>> I'm newbie in Beam, but I looking for some way to read an File stored at FTP
>>
>> First of all, I could create ParDo, using FTPClient (Commons-net) and access 
>> returning an Byte[] of File *.csv.
>> second ParDO create the csv
>> third  PardDo using the TextIO to read lines
>>
>> Somebody could share some sources and materials about that ?
>> Or Do you have another alternative more simple ?
>> Thanks & Regards


Re: Recordings and presentations from Beam Summit London 2018

2018-12-21 Thread Ismaël Mejía
Thanks for sharing Mathias, I also did not assist and I really want to see
some of the presentations.
Great timing just before the holidays!

Regards,
Ismaël



On Fri, Dec 21, 2018 at 2:54 PM OrielResearch Eila Arich-Landkof <
e...@orielresearch.org> wrote:

> Thank you for sharing. will definitely watch!!!
> Happy New Year to all of you,
> Eila
>
> On Thu, Dec 20, 2018 at 6:49 PM Manu Zhang 
> wrote:
>
>> Thanks for sharing. The YouTube channel link is
>> https://www.youtube.com/channel/UChNnb_YO_7B0HlW6FhAXZZQ
>>
>> Thanks,
>> Manu Zhang
>> On Nov 2, 2018, 11:06 PM +0800, Matthias Baetens <
>> baetensmatth...@gmail.com>, wrote:
>>
>> Hi everyone,
>>
>> Very happy to be sharing the great materials from the speakers and
>> contributors at the Beam Summit we had in London last month in the *new
>> blogpost* on the website:
>> https://beam.apache.org/blog/2018/10/31/beam-summit-aftermath.html
>>
>> You can find the full list of presentations and recordings of the
>> sessions there, or you can head over to our YouTube channel
>> .
>>
>> Thanks for everyone who helped, contributed, was a speaker or sponsor at
>> the event.
>>
>> Best,
>> Matthias
>> --
>>
>>
>>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetu 
> p.co 
> m/Deep-Learning-In-Production/
> 
>
>
>


Re: Moving to spark 2.4

2018-12-07 Thread Ismaël Mejía
It seems that Cloudera has it now, not sure if worth to wait for the
Hortonworks maybe worth waiting for EMR.
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-core_2.11/

A pro move to Spark 2.4.0 argument is for the future oriented (non
hadoop friends), because the support for kubernetes has improved a lot
in this release.

On Fri, Dec 7, 2018 at 4:56 PM David Morávek  wrote:
>
> +1 for waiting for HDP and CDH adoption
>
> Sent from my iPhone
>
> On 7 Dec 2018, at 16:38, Alexey Romanenko  wrote:
>
> I agree with Ismael and I’d wait until the new Spark version will be 
> supported by major BigData distributors.
>
> On 7 Dec 2018, at 14:57, Vishwas Bm  wrote:
>
> Hi Ismael,
>
> We have upgraded the spark to 2.4.
> In our setup we had run few basic tests and found it to be pretty stable.
>
>
> Thanks & Regards,
> Vishwas
>
>
> On Fri, Dec 7, 2018 at 2:53 PM Ismaël Mejía  wrote:
>>
>> Hello Vishwas,
>>
>> The spark dependency in the spark runner is provided so you can
>> already pass the dependencies of spark 2.4 and it should work out of
>> the box.
>>
>> JB did a PR to upgrade the version of Spark in the runner, but maybe
>> it is worth to wait a bit before merging it, at least until some of
>> the Big Data distributions has spark 2.4.x support available, so far
>> nobody has upgraded it (well apart of databricks).
>>
>> What do others think, should we move ahead or are you aware of any
>> issue introduced by version 2.4.0? (Notice that the PR just updates
>> the version so code compatibility should be ok).
>>
>> Ismaël
>>
>> On Thu, Dec 6, 2018 at 12:14 PM Jean-Baptiste Onofré  
>> wrote:
>> >
>> > Hi Vishwas
>> >
>> > Yes, I already started the update.
>> >
>> > Regards
>> > JB
>> >
>> > On 06/12/2018 07:39, Vishwas Bm wrote:
>> > > Hi,
>> > >
>> > > Currently I see that the spark version dependency used in Beam is
>> > > //"2.3.2".
>> > > As spark 2.4 is released now, is there a plan to upgrade Beam spark
>> > > dependency ?
>> > >
>> > >
>> > > *Thanks & Regards,*
>> > > *Vishwas
>> > > *
>> > > *Mob : 9164886653*
>> >
>> > --
>> > Jean-Baptiste Onofré
>> > jbono...@apache.org
>> > http://blog.nanthrax.net
>> > Talend - http://www.talend.com
>
>


Re: Latin America Community

2018-12-07 Thread Ismaël Mejía
Hello,

It is a great idea to try to grow the community in the region. Notice that
already there are multiple latino members in the dev community (e.g. Pablo,
Gris and me). However no Brasilians so far, so glad that you want to be
part.

I suppose that given Sao Paulo's size it is probably the 'easiest' place to
find people interested. Wondering if there are others in the mailing list.
In any case don't hesitate to contact us for questions / support.

Regards,
Ismaël





On Tue, Dec 4, 2018 at 3:42 PM Eryx  wrote:

> Hi Leonardo,
>
> I'm Héctor Eryx from Guadalajara, México. I'm currently using Beam for
> personal projects, plus giving some training/mentoring on how to use to
> local communities.
>
> Also, I'm in touch with some friends at IBM Mexico who are using Beam to
> run data storage events analysis.
>
> We are few, but we are strong , hehe.
>
> Kind regards,
> Héctor Eryx Paredes Camacho
>
> El mar., 4 de diciembre de 2018 6:20 a. m., Leonardo Miguel <
> leonardo.mig...@arquivei.com.br> escribió:
>
>> Hi guys,
>>
>> Just want to check if there is someone using Beam and/or Scio at this
>> side of the globe.
>> I'd like to know also if there is any event near or some related
>> community.
>> If you are using Beam and/or Scio, please let me know.
>>
>> Let me start first:
>> I'm located at Sao Carlos, Sao Paulo, Brazil.
>> We use Beam and Scio running on Google Dataflow to serve data products
>> (streaming and batch) over fiscal documents.
>>
>> Thanks!
>>
>>
>> --
>> []s
>>
>> Leonardo Alves Miguel
>> Data Engineer
>> (16) 3509-5515 | www.arquivei.com.br
>> 
>> [image: Arquivei.com.br – Inteligência em Notas Fiscais]
>> 
>> [image: Google seleciona Arquivei para imersão e mentoria no Vale do
>> Silício]
>> 
>> 
>> 
>> 
>>
>


Re: Moving to spark 2.4

2018-12-07 Thread Ismaël Mejía
Hello Vishwas,

The spark dependency in the spark runner is provided so you can
already pass the dependencies of spark 2.4 and it should work out of
the box.

JB did a PR to upgrade the version of Spark in the runner, but maybe
it is worth to wait a bit before merging it, at least until some of
the Big Data distributions has spark 2.4.x support available, so far
nobody has upgraded it (well apart of databricks).

What do others think, should we move ahead or are you aware of any
issue introduced by version 2.4.0? (Notice that the PR just updates
the version so code compatibility should be ok).

Ismaël

On Thu, Dec 6, 2018 at 12:14 PM Jean-Baptiste Onofré  wrote:
>
> Hi Vishwas
>
> Yes, I already started the update.
>
> Regards
> JB
>
> On 06/12/2018 07:39, Vishwas Bm wrote:
> > Hi,
> >
> > Currently I see that the spark version dependency used in Beam is
> > //"2.3.2".
> > As spark 2.4 is released now, is there a plan to upgrade Beam spark
> > dependency ?
> >
> >
> > *Thanks & Regards,*
> > *Vishwas
> > *
> > *Mob : 9164886653*
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: RabbitMqIO missing in Maven Central

2018-11-08 Thread Ismaël Mejía
Hello,

RabbitMQ was merged into master after the 2.8.0 release, so you will
have to wait until 2.9.0 is released, or compile/package it by
yourself.

Regards,
Ismaël
On Thu, Nov 8, 2018 at 10:10 PM Jeroen Steggink | knowsy
 wrote:
>
> Hi guys,
>
> I tried getting the new RabbitMqIO, however, it's not present in Maven
> Central (http://central.maven.org/maven2/org/apache/beam/), while other
> 2.8.0 sdks are.
>
> Any reason why?
>
> Best,
>
> Jeroen
>


Re: Issue with GroupByKey in BeamSql using SparkRunner

2018-10-10 Thread Ismaël Mejía
Are you trying this in a particular spark distribution or just locally ?
I ask this because there was a data corruption issue with Spark 2.3.1
(previous version used by Beam)
https://issues.apache.org/jira/browse/SPARK-23243

Current Beam master (and next release) moves Spark to version 2.3.2
and that should fix some of the data correctness issues (maybe yours
too).
Can you give it a try and report back if this fixes your issue.


On Tue, Oct 9, 2018 at 6:45 PM Vishwas Bm  wrote:
>
> Hi Kenn,
>
> We are using Beam 2.6 and using Spark_submit to submit jobs to Spark 2.2 
> cluster on Kubernetes.
>
>
> On Tue, Oct 9, 2018, 9:29 PM Kenneth Knowles  wrote:
>>
>> Thanks for the report! I filed 
>> https://issues.apache.org/jira/browse/BEAM-5690 to track the issue.
>>
>> Can you share what version of Beam you are using?
>>
>> Kenn
>>
>> On Tue, Oct 9, 2018 at 3:18 AM Vishwas Bm  wrote:
>>>
>>> We are trying to setup a pipeline with using BeamSql and the trigger used 
>>> is default (AfterWatermark crosses the window).
>>> Below is the pipeline:
>>>
>>>KafkaSource (KafkaIO) ---> Windowing (FixedWindow 1min) ---> BeamSql 
>>> ---> KafkaSink (KafkaIO)
>>>
>>> We are using Spark Runner for this.
>>> The BeamSql query is:
>>>  select Col3, count(*) as count_col1 from PCOLLECTION GROUP BY 
>>> Col3
>>>
>>> We are grouping by Col3 which is a string. It can hold values string[0-9].
>>>
>>> The records are getting emitted out at 1 min to kafka sink, but the output 
>>> record in kafka is not as expected.
>>> Below is the output observed: (WST and WET are indicators for window start 
>>> time and window end time)
>>>
>>> {"count_col1":1,"Col3":"string5","WST":"2018-10-09  09-55-00   
>>> +","WET":"2018-10-09  09-56-00   +"}
>>> {"count_col1":3,"Col3":"string7","WST":"2018-10-09  09-55-00   
>>> +","WET":"2018-10-09  09-56-00   +"}
>>> {"count_col1":2,"Col3":"string8","WST":"2018-10-09  09-55-00   
>>> +","WET":"2018-10-09  09-56-00   +"}
>>> {"count_col1":1,"Col3":"string2","WST":"2018-10-09  09-55-00   
>>> +","WET":"2018-10-09  09-56-00   +"}
>>> {"count_col1":1,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>> +","WET":"2018-10-09  09-56-00   +"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>> +","WET":"2018-10-09  09-56-00   +"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>> +","WET":"2018-10-09  09-56-00   +"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>> +","WET":"2018-10-09  09-56-00   +"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>> +","WET":"2018-10-09  09-56-00   +"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>> +","WET":"2018-10-09  09-56-00   +"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>> +","WET":"2018-10-09  09-56-00   +"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>> +","WET":"2018-10-09  09-56-00   +"}
>>> {"count_col1":0,"Col3":"string6","WST":"2018-10-09  09-55-00   
>>> +","WET":"2018-10-09  09-56-00   +"}
>>>
>>> We ran the same pipeline using direct and flink runner and we dont see 0 
>>> entries for count_col1.
>>>
>>> As per beam matrix page 
>>> (https://beam.apache.org/documentation/runners/capability-matrix/#cap-summary-what),
>>>  GroupBy is not fully supported,is this one of those cases ?
>>> Thanks & Regards,
>>> Vishwas
>>>


Modular IO presentation at Apachecon

2018-09-26 Thread Ismaël Mejía
Hello, today Eugene and me did a talk about about modular APIs for IO
at ApacheCon. This talk introduces some common patterns that we have
found while creating IO connectors and also presents recent ideas like
dynamic destinations, sequential writes among others using FileIO as a
use case.

In case you guys want to take a look, here is a copy of the slides, we
will probably add this to the IO authoring documentation too.

https://s.apache.org/beam-modular-io-talk


Re: [Discuss] Upgrade story for Beam's execution engines

2018-09-17 Thread Ismaël Mejía
In the Spark runner the user provides the core spark dependencies at runtime and
we assume that backwards compatibility is kept (in upstream Spark). We support
the whole 2.x line but we try to keep the version close to the latest stable
release.

Notice however that we lack tests to validate that all versions do work, I
remember some issues with metrics during the migration to spark 2.x with older
versions of spark (<= 2.1). Those worked flawlessly with more recent versions.

I don't know if Flink could do something like this (become a provided
dep) in particular for the current case where there seems not to be
API breaking changes.

In any case +1 to try to get a bit the act together on this.

On Mon, Sep 17, 2018 at 9:31 AM Robert Bradshaw  wrote:
>
> On Mon, Sep 17, 2018 at 2:02 AM Austin Bennett  
> wrote:
>>
>> Do we currently maintain a finer grained list of compatibility between 
>> execution/runner versions and beam versions?  Is this only really a concern 
>> with recent Flink (sounded like at least Spark jump, too)?  I see the 
>> capability matrix:  
>> https://beam.apache.org/documentation/runners/capability-matrix/, but some 
>> sort of compatibility between runner versions with beam releases might be 
>> useful.
>>
>> I see compatibility matrix as far as beam features, but not for underlying 
>> runners.  Ex: something like this would save a user trying to get Beam 
>> working on recent Flink 1.6 and then subsequently hitting a (potentially not 
>> well documented) wall given known issues.
>
>
> +1. I was bitten by this as well.
>
> I don't know if it's worth having a compatibility matrix for each version (as 
> the overlap is likely to be all or nothing in most cases), but it should be 
> prominently displayed here and elsewhere. Want to send out a PR?


Re: [ANNOUNCE] Apache Beam 2.6.0 released!

2018-08-09 Thread Ismaël Mejía
Two really interesting features in 2.6.0 not mentioned in the announcement
email:


- Bounded SplittableDoFn support is available now in all runners (SDF is
the new IO connector API).

- HBaseIO was updated to be the first IO supporting Bounded SDF (using
readAll).




On Fri, Aug 10, 2018 at 12:14 AM Connell O'Callaghan 
wrote:

> Pablo and all involved thank you for working to get this release
> completed!!!
>
> On Thu, Aug 9, 2018 at 3:06 PM Pablo Estrada  wrote:
>
>> Of course, I messed that link up again. The release notes should be here:
>>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12343392
>>
>> On Thu, Aug 9, 2018 at 10:11 AM Pablo Estrada  wrote:
>>
>>> As was pointed out to me by Ryan Williams, the link I posted with the
>>> release notes is not right. Here is a link with Beam 2.6.0 release notes
>>> for those interested:
>>>
>>> https://issues.apache.org/jira/projects/BEAM/versions/12343392
>>>
>>> On Thu, Aug 9, 2018 at 9:41 AM Alexey Romanenko <
>>> aromanenko@gmail.com> wrote:
>>>
 Great news!
 Thanks to Pablo for driving this and to all contributors of this
 release!


 On 9 Aug 2018, at 00:48, Pablo Estrada  wrote:

 The Apache Beam team is pleased to announce the release of 2.6.0
 version!

 Apache Beam is an open source unified programming model to define and
 execute data processing pipelines, including ETL, batch and stream
 (continuous) processing. See https://beam.apache.org

 You can download the release here:

 https://beam.apache.org/get-started/downloads/

 This release includes the following major new features & improvements,
 among others:
 - Improvements for internal Context Management in Python SDK
 - A number of improvements to the Portability Framework
 - A Universal Local Runner has been added to Beam. This runner runs in
 a single machine using portability, and containerized SDK harnesses.
 - Increased the coverage of ErrorProne analysis of the codebase.
 - Updates to various dependency versions
 - Updates to stability, performance, and documentation.
 - SQL - improvements: support exists operator, implemented sum()
 aggregations, fixes to CASE expression, support for date comparison,
 support LIMIT on Unbounded Data
 - Provide automatic schema registration for POJOs

 You can take a look at the Release Notes for more details:

 *https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343392=12319527
 *

 Thanks to everyone who participated in this release, and we hope you'll
 have a good time using Beam 2.6.0.
 --
 Pablo Estrada, on behalf of The Apache Beam team
 --
 Got feedback? go/pabloem-feedback
 


 --
>>> Got feedback? go/pabloem-feedback
>>> 
>>>
>> --
>> Got feedback? go/pabloem-feedback
>> 
>>
>


Re: about PCollection process

2018-07-06 Thread Ismaël Mejía
Hello,

If I understood correctly you read from a file some parameters that you are
going to use to prepare an HBase Scan. If this is the case you cannot do
this with the current HBaseIO API, but there is ongoing work to support
this transparently with the new SDF API. If you want to track the progress
this is the JIRA https://issues.apache.org/jira/browse/BEAM-4020
Hopefully it will be ready in the following days/weeks.

In the meantime you can do a workaround by applying a ParDo after you
extract the scan parameters from the files and then do a DoFn to request
the data, something similar to what SDF is doing, for ref:
https://github.com/iemejia/beam/blob/2f9b54c6efa1c97c4b030a9b1af44b1327541e5f/sdks/java/io/hbase/src/main/java/org/apache/beam/sdk/io/hbase/HBaseReadSplittableDoFn.java#L37

Hope this helps,
Ismaël

On Thu, Jul 5, 2018 at 4:53 AM Frank Li  wrote:

> hello,every
> I'm running a Beam pipeline which uses the TextIO read same text
> from text file, PTransform  peer line search hbase.  result is   
> PCollection RecordData>>>
>
>
> @Override
> public PCollection>>
> expand(PCollection lines) {
>
> PCollection>> results = lines
> .apply(ParDo.of(new DoFn>>() {
> @ProcessElement
> public void processElement(ProcessContext c) {
> String vin = c.element();
>
> Pipeline pipelineHbase = Pipeline.create(c.getPipelineOptions());
>
> HBaseIO.Read read =
> HBaseIO.read().withConfiguration(conf).withTableId(hbaseTable).withKeyRange(
> Bytes.toBytes(String.format("%s-%s", vin, startTime)),
> Bytes.toBytes(String.format("%s-%s", vin, endTime)));
> PCollection results = pipelineHbase.apply(read);
>
> PCollection> recordResults = results
> .apply(ParDo.of(new Result2RecordNoModifyDataFn()));
>
> c.output(recordResults);
> }
>
> }));
>
> return results;
> }
>
>
> what process PCollection>> 
>
>


Re: Go SDK: How are re-starts handled?

2018-06-27 Thread Ismaël Mejía
Eduardo can you please create a JIra on the Go SDK to track this issue.
Thanks.

On Mon, Jun 25, 2018 at 10:22 PM Lukasz Cwik  wrote:

> Ah, sorry for the confusion. The SDK is meant to handle that for you as I
> described. You'll want to use the fact that the 409 was returned until that
> is implemented within the Go SDK.
>
> On Mon, Jun 25, 2018 at 1:13 PM eduardo.mora...@gmail.com <
> eduardo.mora...@gmail.com> wrote:
>
>> Nope. It returns an error.
>>
>> 2018/06/25 20:10:46 Failed to execute job: googleapi: Error 409:
>> (acbae89877e14d87): The workflow could not be created. Causes:
>> (590297f494c27357): There is already an active job named xxx-yyy_zzz. If
>> you want to  submit a second job, try again by setting a different name.,
>> alreadyExists
>>
>> On 2018/06/25 16:18:55, Lukasz Cwik  wrote:
>> > It should be that beamx.Run won't return an error if a job already
>> exists
>> > with the same job name.
>> >
>> > On Mon, Jun 25, 2018 at 9:15 AM eduardo.mora...@gmail.com <
>> > eduardo.mora...@gmail.com> wrote:
>> >
>> > > I am sorry. I am not expressing myself correctly. Let me do it though
>> code:
>> > >
>> > > if err := beamx.Run(ctx, pipe); err != nil {
>> > >   // How do I know 'err' is the result of a pipeline already running,
>> as
>> > > opposed to some
>> > >   // other problem that may need  special attention.
>> > > }
>> > >
>> > > On 2018/06/22 23:10:13, Lukasz Cwik  wrote:
>> > > > The job name is a user chosen value[1]. If you don't specify
>> something, a
>> > > > job name is generated for you automatically[2].
>> > > >
>> > > > 1:
>> > > >
>> > >
>> https://github.com/apache/beam/blob/c1927cd339c57125e29a651e614fb5105abf6d33/sdks/go/pkg/beam/options/jobopts/options.go#L38
>> > > > 2:
>> > > >
>> > >
>> https://github.com/apache/beam/blob/c1927cd339c57125e29a651e614fb5105abf6d33/sdks/go/pkg/beam/options/jobopts/options.go#L71
>> > > >
>> > > > On Fri, Jun 22, 2018 at 3:28 PM eduardo.mora...@gmail.com <
>> > > > eduardo.mora...@gmail.com> wrote:
>> > > >
>> > > > >
>> > > > >
>> > > > > On 2018/06/22 21:35:29, Lukasz Cwik  wrote:
>> > > > > > There can only be one pipeline in Dataflow with the same job
>> name so
>> > > if
>> > > > > you
>> > > > > > attempt to submit another job with the same job name you'll get
>> back
>> > > an
>> > > > > > identifier for the currently executing pipeline.
>> > > > >
>> > > > > But beam.Run() only returns an error. How do I get the job name
>> back?
>> > > > > My guess is that I have to use a different API (
>> > > > > https://godoc.org/google.golang.org/api/dataflow/v1b3). Is that
>> the
>> > > > > correct way to detect job name collisions?
>> > > > >
>> > > > > Thanks again.
>> > > > >
>> > > > > > On Fri, Jun 22, 2018 at 2:27 PM eduardo.mora...@gmail.com <
>> > > > > > eduardo.mora...@gmail.com> wrote:
>> > > > > >
>> > > > > > > If I have a k8s process launching dataflow pipelines, what
>> happens
>> > > when
>> > > > > > > the process is restarted? Can Apache Beam detect a running
>> > > pipeline and
>> > > > > > > join accordingly? or will the pipeline be duplicated?
>> > > > > > >
>> > > > > > > Thanks in advance.
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>


Re: [FYI] New Apache Beam Swag Store!

2018-06-13 Thread Ismaël Mejía
Great ! Thanks Gris and Matthias for putting this in place.
Hope to get that hoodie soon. As a suggestion, more colors too, and
eventually a t-shirt just with the big B logo.
On Mon, Jun 11, 2018 at 6:50 PM Mikhail Gryzykhin  wrote:
>
> That's nice!
>
> More colors are appreciated :)
>
> --Mikhail
>
>
> On Sun, Jun 10, 2018 at 8:20 PM Kenneth Knowles  wrote:
>>
>> Sweet! Agree with Raghu :-)
>>
>> Kenn
>>
>> On Sun, Jun 10, 2018 at 6:06 AM Matthias Baetens  
>> wrote:
>>>
>>> Great news, big thanks for all the work, Gris! Looking forward to people 
>>> wearing this around the globe ;)
>>>
>>> On Sat, 9 Jun 2018 at 01:28 Ankur Goenka  wrote:

 Awesome!


 On Fri, Jun 8, 2018 at 4:24 PM Pablo Estrada  wrote:
>
> Nice : D
>
> On Fri, Jun 8, 2018, 3:43 PM Raghu Angadi  wrote:
>>
>> Woo-hoo! This is terrific.
>>
>> If we are increasing color choices I would like black or charcoal... 
>> Beam logo would really pop on a dark background.
>>
>> On Fri, Jun 8, 2018 at 3:32 PM Griselda Cuevas  wrote:
>>>
>>> Hi Beam Community,
>>>
>>> I just want to share with you the exciting news about our brand new 
>>> Apache Beam Swag Store!
>>>
>>> You can find it here: https://store-beam.myshopify.com/
>>>
>>> How does it work?
>>>
>>> You can just select the items you want and check-out. Our Vendor ships 
>>> to anywhere in the world and normally can have swag to be delivered 
>>> within 1 week. Each company or user will need to pay for their own swag.
>>> If you are hosting an event or representing Beam at one, reach out to 
>>> me or the beam-events-meetups slack channel, I'll be happy to review 
>>> your event and see if we can sponsor the swag. We'll have codes for 
>>> this occasions thanks to Google, who has sponsored an initial inventory.
>>>
>>> If you have feedback, ideas on new swag, questions or suggestions, 
>>> reach out to me and/or Matthias Baetens.
>>>
>>> Happy Friday!
>>> G
>>>
>>>
> --
> Got feedback? go/pabloem-feedback
>>>
>>> --
>>>


Re: kafkaIO Run with Spark Runner: "streaming-job-executor-0"

2018-06-13 Thread Ismaël Mejía
Can you please update the version of Beam to at least version 2.2.0.
There were some important fixes in streaming after the 2.0.0 release
so this could be related. Ideally you should use the latest released
version (2.4.0). Remember that starting with Beam 2.3.0 the Spark
runner is based on Spark 2.

On Wed, Jun 13, 2018 at 5:11 PM Raghu Angadi  wrote:
>
> Can you check the logs on the worker?
>
> On Wed, Jun 13, 2018 at 2:26 AM  wrote:
>>
>> Dear all,
>>
>>
>>
>> I am using the kafkaIO in my project (Beam 2.0.0 with Spark runner).
>>
>> My running environment is:
>>
>> OS: Ubuntn 14.04.3 LTS
>>
>> The different version for these tools:
>>
>> JAVA: JDK 1.8
>>
>> Beam 2.0.0 (Spark runner with Standalone mode)
>>
>> Spark 1.6.0
>>
>> Standalone mode :One driver node: ubuntu7; One master node: ubuntu8; Two 
>> worker nodes: ubuntu8 and ubuntu9
>>
>> Kafka: 2.10-0.10.1.1
>>
>>
>>
>> The java code of my project is:
>>
>> ==
>>
>> SparkPipelineOptions options = 
>> PipelineOptionsFactory.as(SparkPipelineOptions.class);
>>
>> options.setRunner(SparkRunner.class);
>>
>> options.setSparkMaster("spark://ubuntu8:7077");
>>
>> options.setAppName("App kafkaBeamTest");
>>
>> options.setJobName("Job kafkaBeamTest");
>>
>> options.setMaxRecordsPerBatch(1000L);
>>
>>
>>
>> Pipeline p = Pipeline.create(options);
>>
>>
>>
>> System.out.println("Beamtokafka");
>>
>> PCollection> readData = p.apply(KafkaIO.read()
>>
>> .withBootstrapServers(ubuntu7:9092)
>>
>> .withTopic("kafkasink")
>>
>> .withKeyDeserializer(LongDeserializer.class)
>>
>> .withValueDeserializer(StringDeserializer.class)
>>
>>.withoutMetadata()
>>
>>);
>>
>>
>>
>> PCollection> readDivideData = readData.
>>
>> apply(Window.>into(FixedWindows.of(Duration.standardSeconds(1)))
>>
>>  .triggering(AfterWatermark.pastEndOfWindow()
>>
>>
>> .withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.ZERO)))
>>
>>  .withAllowedLateness(Duration.ZERO) .discardingFiredPanes());
>>
>>
>>
>> System.out.println("CountData");
>>
>>
>>
>> PCollection> countData = readDivideData.apply(Count.perKey());
>>
>>
>>
>> p.run();
>>
>> ==
>>
>>
>>
>> The message of error is:
>>
>> ==
>>
>> Exception in thread "streaming-job-executor-0" java.lang.Error: 
>> java.lang.InterruptedException
>>
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
>>
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>
>> at java.lang.Thread.run(Thread.java:748)
>>
>> Caused by: java.lang.InterruptedException
>>
>> at java.lang.Object.wait(Native Method)
>>
>> at java.lang.Object.wait(Object.java:502)
>>
>> at 
>> org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
>>
>> at 
>> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:612)
>>
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
>>
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
>>
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
>>
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:912)
>>
>> at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:910)
>>
>> at 
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>>
>> at 
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>>
>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>>
>> at org.apache.spark.rdd.RDD.foreach(RDD.scala:910)
>>
>> …
>>
>> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
>>
>> at 
>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
>>
>> at 
>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
>>
>> at 
>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
>>
>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>
>> at 
>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
>>
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>
>> ... 2 more
>>
>> ==
>>
>>
>>
>> Maven 3.5.0, in which related dependencies are listed in my project’s 
>> pom.xml:
>>
>> 
>>
>> org.apache.beam
>>
>>   beam-sdks-java-core
>>
>>   2.0.0
>>
>> 
>>
>> 
>>
>> org.apache.beam
>>
>>beam-sdks-java-io-kafka
>>
>>2.0.0

Re: Documentation for Beam on Windows

2018-06-01 Thread Ismaël Mejía
Is there a JIRA for this? Can you create one Udi?
On Wed, May 23, 2018 at 11:32 PM Lukasz Cwik  wrote:
>
> There is none to my knowledge.
>
> On Wed, May 23, 2018 at 1:49 PM Udi Meiri  wrote:
>>
>> Hi all,
>>
>> I was looking yesterday for a quickstart guide on how to use Beam on Windows 
>> but saw that those guides are exclusively for Linux users.
>>
>> What documentation is available for people wanting to use Beam on Windows 
>> machines?
>>
>> Thanks!


Re: I'm back and ready to help grow our community!

2018-05-22 Thread Ismaël Mejía
I missed somehow this email thread.
Congratulations Gris and welcome back!

On Fri, May 18, 2018 at 5:34 AM Jesse Anderson 
wrote:

> Congrats!

> On Thu, May 17, 2018, 6:44 PM Robert Burke  wrote:

>> Congrats & welcome back!

>> On Thu, May 17, 2018, 5:44 PM Huygaa Batsaikhan 
wrote:

>>> Welcome back, Gris! Congratulations!

>>> On Thu, May 17, 2018 at 4:24 PM Robert Bradshaw 
wrote:

 Congratulations, Gris! And welcome back!
 On Thu, May 17, 2018 at 3:30 PM Robin Qiu  wrote:

 > Congratulations! Welcome back!

 > On Thu, May 17, 2018 at 3:23 PM Reuven Lax  wrote:

 >> Congratulations! Good to see you back!

 >> Reuven

 >> On Thu, May 17, 2018 at 2:24 PM Griselda Cuevas 
wrote:

 >>> Hi Everyone,


 >>> I was absent from the mailing list, slack channel and our Beam
 community for the past six weeks, the reason was that I took a leave to
 focus on finishing my Masters Degree, which I finally did on May 15th.


 >>> I graduated as a Masters of Engineering in Operations Research
with a
 concentration in Data Science from UC Berkeley. I'm glad to be part of
this
 community and I'd like to share this accomplishment with you so I'm
adding
 two pictures of that day :)


 >>> Given that I've seen so many new folks around, I'd like to use this
 opportunity to re-introduce myself. I'm Gris Cuevas and I work at
Google.
 Now that I'm back, I'll continue to work on supporting our community
in two
 main streams: Contribution Experience & Events, Meetups, and
Conferences.


 >>> It's good to be back and I look forward to collaborating with you.


 >>> Cheers,

 >>> Gris


Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner

2018-05-16 Thread Ismaël Mejía
Hello,

Answers to the questions inline:

> 1. Are there any limitations in terms of implementations, functionalities
or performance if we want to run streaming on Beam with Spark runner vs
streaming on Spark-Streaming directly ?

At this moment the Spark runner does not support some parts of the Beam
model in
streaming mode, e.g. side inputs and state/timer API. Comparing this with
pure
spark streaming is not easy given the semantic differences of Beam.

> 2. Spark features like checkpointing, kafka offset management, how are
they supported in Apache Beam? Do we need to do some extra work for them?

Checkpointing is supported, Kafka offset management (if I understand what
you
mean) is managed by the KafkaIO connector + the runner, so this should be
ok.

> 3. with spark 2.x structured streaming , if we want to switch across
different modes like from micro-batching to continuous streaming mode, how
it can be done while using Beam?

To do this the Spark runner needs to translate the Beam Pipeline using the
Structured Streaming API which is not the case today. It uses the RDD based
API
but we expect to tackle this in the not so far future.  However even if we
did
Spark continuous mode is quite limited at this moment in time because it
does
not support aggregation functions.

https://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#continuous-processing

Don't hesitate to give a try to Beam and the Spark runner and refer us if
you
have questions or find any issues.

Regards,
Ismaël

On Tue, May 15, 2018 at 2:22 PM chandan prakash 
wrote:

> Also,

> 3. with spark 2.x structured streaming , if we want to switch across
different modes like from micro-batching to continuous streaming mode, how
it can be done while using Beam?

> These are some of the initial questions which I am not able to understand
currently.


> Regards,
> Chandan

> On Tue, May 15, 2018 at 5:45 PM, chandan prakash <
chandanbaran...@gmail.com> wrote:

>> Hi Everyone,
>> I have just started exploring and understanding Apache Beam for new
project in my firm.
>> In particular, we have to take decision whether to implement our product
over spark streaming (as spark batch is already in our eco system) or
should we use Beam over spark runner to have future liberty of changing
underline runner.

>> Couple of questions, after going through beam docs and examples, I have
is:

>> Are there any limitations in terms of implementations, functionalities
or performance if we want to run streaming on Beam with Spark runner vs
streaming on Spark-Streaming directly ?

>> Spark features like checkpointing, kafka offset management, how are they
supported in Apache Beam? Do we need to do some extra work for them?


>> Any answer or link to like wise discussion will be really appreciable.
>> Thanks in advance.

>> Regards,
>> --
>> Chandan Prakash




> --
> Chandan Prakash


Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-03-08 Thread Ismaël Mejía
Excellent, loved the 'Nobody writes a paper about their IO API'. IO is
such an important but less valued part of Big Data, kind of ironic.
Great work Eugene !

On Thu, Mar 8, 2018 at 9:40 PM, Kenneth Knowles  wrote:
> Love it. Great flashy title, too :-)
>
> On Thu, Mar 8, 2018 at 12:16 PM Eugene Kirpichov 
> wrote:
>>
>> Oops that's just the template I used. Thanks for noticing, will regenerate
>> the PDF and reupload when I get to it.
>>
>> On Thu, Mar 8, 2018, 11:59 AM Dan Halperin  wrote:
>>>
>>> Looks like it was a good talk! Why is it Google Confidential &
>>> Proprietary, though?
>>>
>>> Dan
>>>
>>> On Thu, Mar 8, 2018 at 11:49 AM, Eugene Kirpichov 
>>> wrote:

 Hey all,

 The slides for my yesterday's talk at Strata San Jose
 https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63696
 have been posted on the talk page. They may be of interest both to users 
 and
 IO authors.

 Thanks.
>>>
>>>
>


Re: Get file from S3 bucket

2018-02-22 Thread Ismaël Mejía
Hello,

Beam 2.3.0 introduced a native reader for S3, see the module
https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-io-amazon-web-services/2.3.0

You should add this dependency to your project and then you can read
using the Read transform.
It supports authentication so you can read both public and private buckets.

You can find an example here
https://github.com/jbonofre/beam-samples/blob/master/amazon-web-services/src/main/java/org/apache/beam/samples/ingest/amazon/IngestToS3.java

Regards,
Ismaël


On Thu, Feb 22, 2018 at 3:12 PM, Fabian Strunz
 wrote:
> Hello!
>
>
>
> Is there already a how-to on how to read a file from a S3 bucket?
>
>
>
> Can you only read public buckets or private buckets as well?
>
>
>
>
>
>
>
> Regards
>
>


Re: Regarding Beam SlackChannel

2018-02-15 Thread Ismaël Mejía
Done, you will receive and email, welcome!

On Thu, Feb 15, 2018 at 2:33 PM, Willy Lulciuc  wrote:
> Hello:
>
> Can someone please add me to the Beam slackchannel?
>
> Thanks.


Re: Deprecate and remove support for Kafka 0.9.x and older version

2018-02-06 Thread Ismaël Mejía
Agree with JB, showing deprecation is ok, but I think it is worth to
support Kafka 0.9.x for some extra time. Users tend to stay in old
data stores because migrating these clusters isn't always so easy.


On Tue, Feb 6, 2018 at 3:56 PM, Jean-Baptiste Onofré  wrote:
> +1 to flag as deprecated, but I would wait a bit before simply removing it.
>
> Regards
> JB
>
> On 02/03/2018 01:12 AM, Raghu Angadi wrote:
>> Is anyone using Apache Beam with Kafka 0.9.x and older?
>>
>> I am thinking of deprecating 0.9.x and 0.10.0 in Beam 2.4 and remove support 
>> in
>> 2.5 or later. 0.10.1 and up will be supported. 0.10.1+ includes much better
>> timestamp support.
>>
>> By deprecation I mean KafkaIO would continue to work with an older version at
>> runtime, but would not build with it (e.g. `mvn 
>> -Dkafka.clients.version=0.9.1`
>> fails).  We can print a deprecation warning at runtime.
>>
>> [1]: 
>> https://github.com/apache/kafka/commit/23c69d62a0cabf06c4db8b338f0ca824dc6d81a7
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: Strata Conference this March 6-8

2018-01-18 Thread Ismaël Mejía
My excuses I somehow misread the dates and thought you referred to the
London conference, but well in the end this becomes two good ideas :)

- A meetup in London for the week of May 21
- A meetup in San Jose if someone can organize it for the March dates.


On Thu, Jan 18, 2018 at 12:00 AM, Holden Karau <hol...@pigscanfly.ca> wrote:
> So doing a streaming BoF join in would probably require meeting somewhere
> other than a coffee shop so as not to be jerks in the coffee shop.
>
> On Wed, Jan 17, 2018 at 2:53 PM, Matthias Baetens
> <matthias.baet...@datatonic.com> wrote:
>>
>> Sure, I'd be very happy to organise something. This is about Strata San
>> Jose though right? Maybe we can organise a remote session in which we can
>> join (depending on when you would organise the BoF) or have a channel set-up
>> if the talks would be broadcasted?
>>
>> Also: will there be any Beam talks on Strata London or is this not known
>> yet? Keen to get involved and set things up around that date as well.
>>
>> On Wed, Jan 17, 2018 at 8:37 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>>
>>> That's a great idea ! I'm sure that Matthias (organizer of the Beam
>>> London Meetup) can help us to plan something.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 01/17/2018 08:57 AM, Ismaël Mejía wrote:
>>>>
>>>> Maybe a good idea to try to organize a Beam meetup in london in the
>>>> same dates in case some of the people around can jump in and talk too.
>>>>
>>>> On Wed, Jan 17, 2018 at 2:51 AM, Ron Gonzalez <zlgonza...@yahoo.com>
>>>> wrote:
>>>>>
>>>>> Works for me...
>>>>>
>>>>> On Tuesday, January 16, 2018, 5:45:33 PM PST, Holden Karau
>>>>> <hol...@pigscanfly.ca> wrote:
>>>>>
>>>>>
>>>>> How would folks feel about during the afternoon break (3:20-4:20) on
>>>>> the
>>>>> Wednesday (same day as Eugene's talk)? We could do the Philz which is a
>>>>> bit
>>>>> of a walk but gets us away from the big crowd and also lets folks not
>>>>> attending the conference but in the area join us.
>>>>>
>>>>> On Tue, Jan 16, 2018 at 5:29 PM, Ron Gonzalez <zlgonza...@yahoo.com>
>>>>> wrote:
>>>>>
>>>>> Cool, let me know if you guys finally schedule it. I will definitely
>>>>> try to
>>>>> make it to Eugene's talk but having an informal BoF in the area would
>>>>> be
>>>>> nice...
>>>>>
>>>>> Thanks,
>>>>> Ron
>>>>>
>>>>> On Tuesday, January 16, 2018, 5:06:53 PM PST, Boris Lublinsky
>>>>> <boris.lublin...@lightbend.com > wrote:
>>>>>
>>>>>
>>>>> All for it
>>>>>
>>>>> Boris Lublinsky
>>>>> FDP Architect
>>>>> boris.lublin...@lightbend.com
>>>>> https://www.lightbend.com/
>>>>>
>>>>> On Jan 16, 2018, at 7:01 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>
>>>>> +1 to BoF
>>>>>
>>>>> On Tue, Jan 16, 2018 at 5:00 PM, Dmitry Demeshchuk
>>>>> <dmi...@postmates.com>
>>>>> wrote:
>>>>>
>>>>> Probably won't be attending the conference, but totally down for a BoF.
>>>>>
>>>>> On Tue, Jan 16, 2018 at 4:58 PM, Holden Karau <hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>> Do interested folks have any timing constraints around a BoF?
>>>>>
>>>>> On Tue, Jan 16, 2018 at 4:30 PM, Jesse Anderson
>>>>> <je...@bigdatainstitute.io>
>>>>> wrote:
>>>>>
>>>>> +1 to BoF. I don't know if any Beam talks will be on the schedule.
>>>>>
>>>>>> We could do an informal BoF at the Philz nearby or similar?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/h oldenkarau
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Dmitry Demeshchuk.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>
>>
>>
>>
>> --
>>
>>
>> Matthias Baetens
>>
>>
>> datatonic | data power unleashed
>>
>> office +44 203 668 3680  |  mobile +44 74 918 20646
>>
>> Level24 | 1 Canada Square | Canary Wharf | E14 5AB London
>>
>>
>> We've been announced as one of the top global Google Cloud Machine
>> Learning partners.
>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau


Re: Strata Conference this March 6-8

2018-01-16 Thread Ismaël Mejía
Maybe a good idea to try to organize a Beam meetup in london in the
same dates in case some of the people around can jump in and talk too.

On Wed, Jan 17, 2018 at 2:51 AM, Ron Gonzalez  wrote:
> Works for me...
>
> On Tuesday, January 16, 2018, 5:45:33 PM PST, Holden Karau
>  wrote:
>
>
> How would folks feel about during the afternoon break (3:20-4:20) on the
> Wednesday (same day as Eugene's talk)? We could do the Philz which is a bit
> of a walk but gets us away from the big crowd and also lets folks not
> attending the conference but in the area join us.
>
> On Tue, Jan 16, 2018 at 5:29 PM, Ron Gonzalez  wrote:
>
> Cool, let me know if you guys finally schedule it. I will definitely try to
> make it to Eugene's talk but having an informal BoF in the area would be
> nice...
>
> Thanks,
> Ron
>
> On Tuesday, January 16, 2018, 5:06:53 PM PST, Boris Lublinsky
>  wrote:
>
>
> All for it
>
> Boris Lublinsky
> FDP Architect
> boris.lublin...@lightbend.com
> https://www.lightbend.com/
>
> On Jan 16, 2018, at 7:01 PM, Ted Yu  wrote:
>
> +1 to BoF
>
> On Tue, Jan 16, 2018 at 5:00 PM, Dmitry Demeshchuk 
> wrote:
>
> Probably won't be attending the conference, but totally down for a BoF.
>
> On Tue, Jan 16, 2018 at 4:58 PM, Holden Karau  wrote:
>
> Do interested folks have any timing constraints around a BoF?
>
> On Tue, Jan 16, 2018 at 4:30 PM, Jesse Anderson 
> wrote:
>
> +1 to BoF. I don't know if any Beam talks will be on the schedule.
>
>> We could do an informal BoF at the Philz nearby or similar?
>
>
>
>
> --
> Twitter: https://twitter.com/h oldenkarau
>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>
>
>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau


Re: [VOTE] [DISCUSSION] Remove support for Java 7

2017-10-18 Thread Ismaël Mejía
Small correction EOL of Java 8 is Sep. 2018 not Mar. 2018.
http://www.oracle.com/technetwork/java/eol-135779.html

JB the goal of this thread is to get an opinion from the users of all
the runners on their opinions/constraints, but we have to reach some
consensus and deal with the tradeoffs of existing users vs the future
of the project. So far we don't have many reports from users on Spark
1.5 or more important from people constrained by the need of Java 7
support, but we need to wait and see before taking a decision.


On Wed, Oct 18, 2017 at 12:59 PM, Srinivas Reddy
<srinivas96all...@gmail.com> wrote:
> +1
>
> -
> Srinivas
>
> - Typed on tiny keys. pls ignore typos.{mobile app}
>
> On 17-Oct-2017 9:47 PM, "Ismaël Mejía" <ieme...@gmail.com> wrote:
>>
>> We have discussed recently in the developer mailing list about the
>> idea of removing support for Java 7 on Beam. There are multiple
>> reasons for this:
>>
>> - Java 7 has not received public updates for almost two years and most
>> companies are moving / have already moved to Java 8.
>> - A good amount of the systems Beam users rely on have decided to drop
>> Java 7 support, e.g. Spark, Flink, Elasticsearch, even Hadoop plans to
>> do it on version 3.
>> - Most Big data distributions and Cloud managed Spark/Hadoop services
>> have already moved to Java 8.
>> - Recent versions of core libraries Beam uses are moving to be Java 8
>> only (or mostly), e.g. Guava, Google Auto, etc.
>> - Java 8 has some nice features that can make Beam code nicer e.g.
>> lambdas, streams.
>>
>> Considering that Beam is a ‘recent’ project we expect users to be
>> already using Java 8. However we wanted first to ask the opinion of
>> the Beam users on this subject. It could be the case that some of the
>> users are still dealing with some old cluster running on Java 7 or
>> have another argument to keep the Java 7 compatibility.
>>
>> So, please vote:
>> +1 Yes, go ahead and move Beam support to Java 8.
>>  0 Do whatever you want. I don’t have a preference.
>> -1 Please keep Java 7 compatibility (if possible add your argument to
>> keep supporting for Java 7).


Re: [VOTE] [DISCUSSION] Remove support for Java 7

2017-10-18 Thread Ismaël Mejía
+1

I forgot to vote yesterday, I don't really think this is a change
worth requiring a major version of Beam. Just clear information in the
site/release notes should make it. Also I am afraid that if we wait
until we have enough changes to switch Beam to a new major version the
switch to Java 8 will happen too late, probably after Java 8's end of
life. And I am not exaggerating, Java 8 is planned to EOL next march
2018! (of course Oracle usually changes this), in any case go go Java
8 ASAP !


On Wed, Oct 18, 2017 at 8:08 AM, Prabeesh K. <prabsma...@gmail.com> wrote:
> +1
>
> On 18 October 2017 at 05:16, Griselda Cuevas <g...@google.com> wrote:
>>
>> +1
>>
>> On 17 October 2017 at 16:36, Robert Bradshaw <rober...@google.com> wrote:
>>>
>>> +1 to removing Java 7 support, pending no major user outcry to the
>>> contrary.
>>>
>>> In terms of versioning, I fall into the camp that this isn't
>>> sufficiently incompatible to warrant a major version increase.
>>> Semantic versioning is all about messaging, and upgrading the major
>>> version so soon after GA for such a minor change would IMHO cause more
>>> confusion that clarity. Hitting 3.0 should signal a major improvement
>>> to Beam itself.
>>>
>>> On Tue, Oct 17, 2017 at 3:52 PM, Eugene Kirpichov <kirpic...@google.com>
>>> wrote:
>>> > +1 to removing Java 7 support.
>>> >
>>> > In terms of release 3.0, we can handle this two ways:
>>> > - Wait until enough other potentially incompatible changes accumulate,
>>> > do
>>> > all of them, and call it a "3.0" release, so that 3.0 will truly differ
>>> > in a
>>> > lot of incompatible and hopefully nice ways from 2.x. This might well
>>> > take a
>>> > year or so.
>>> > - Make a release in which Java 7 support is removed, and call it a
>>> > "3.0"
>>> > release just to signal the incompatibility, and other potentially
>>> > incompatible changes will wait until "4.0" etc.
>>> >
>>> > I suppose the decision depends on whether we have a lot of other
>>> > incompatible changes we would like to do, and whether we have any other
>>> > truly great features enabled by those changes, or at least truly great
>>> > features justifying increasing the major version number. If we go with
>>> > #1,
>>> > I'd say, among the current work happening in Beam, portability comes to
>>> > mind
>>> > as a sufficiently huge milestone, so maybe drop Java 7 in the same
>>> > release
>>> > that offers a sufficient chunk of the portability work?
>>> >
>>> > (There's also a third path: declare that dropping Java7 support is not
>>> > sufficiently "incompatible" to warrant a major version increase,
>>> > because
>>> > people don't have to rewrite their code but only switch their compiler
>>> > version, and people who already use a Java8 compiler won't even notice.
>>> > This
>>> > path could perhaps be considered if we had evidence that switching to a
>>> > Beam
>>> > release without Java7 support would require 0 work for an overwhelming
>>> > majority of users)
>>> >
>>> >
>>> >
>>> > On Tue, Oct 17, 2017 at 3:34 PM Randal Moore <rdmoor...@gmail.com>
>>> > wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> On Tue, Oct 17, 2017 at 5:21 PM Raghu Angadi <rang...@google.com>
>>> >> wrote:
>>> >>>
>>> >>> +1.
>>> >>>
>>> >>> On Tue, Oct 17, 2017 at 2:11 PM, David McNeill <dav...@mcpond.co.nz>
>>> >>> wrote:
>>> >>>>
>>> >>>> The final version of Beam that supports Java 7 should be clearly
>>> >>>> stated
>>> >>>> in the docs, so those stuck on old production infrastructure for
>>> >>>> other java
>>> >>>> app dependencies know where to stop upgrading.
>>> >>>>
>>> >>>> David McNeill
>>> >>>> 021 721 015
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On 18 October 2017 at 05:16, Ismaël Mejía <ieme...@gmail.com> wrote:
>>> >>>>>
>>> >>>>> We have discussed recently in the devel

Re: Failed to run Wordcount example

2017-08-16 Thread Ismaël Mejía
I just executed the same commands that you pasted in your email and it
worked for me, can you verify that you are not having network issues
while downloading the dependencies with maven ?

On Wed, Aug 16, 2017 at 10:06 AM, Vincent Wang <fvunic...@gmail.com> wrote:
> Hi Ismaël,
>
>   I'm running the command
>
>  mvn archetype:generate \
>   -DarchetypeGroupId=org.apache.beam \
>   -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
>   -DarchetypeVersion=2.0.0 \
>   -DgroupId=org.example \
>   -DartifactId=word-count-beam \
>   -Dversion="0.1" \
>   -Dpackage=org.apache.beam.examples \
>   -DinteractiveMode=false
>
> and
>
> mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
>  -Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner
>
> just the way on the quick start page. It seems that the example somehow
> takes its own version as some beam dependency's version accidentally.
>
> BTW, I'm using the latest master branch.
>
> Thanks,
> Huafeng
>
>
> Ismaël Mejía <ieme...@gmail.com>于2017年8月16日周三 下午3:57写道:
>>
>> Hello,
>>
>> The error message shows that it is looking for the Beam 0.1 version
>> and that version does not exist in maven central.
>> You have to replace the version of Beam in the command you executed
>> with the latest version that means 2.0.0 at this moment and it should
>> work.
>>
>> Regards,
>> Ismaël
>>
>>
>> On Wed, Aug 16, 2017 at 8:21 AM, Vincent Wang <fvunic...@gmail.com> wrote:
>> > Hi guys,
>> >
>> >   I'm trying to run the wordcount example according to the quick start
>> > but I
>> > got following error:
>> >
>> > [INFO]
>> > 
>> > [INFO] Building word-count-beam 0.1
>> > [INFO]
>> > 
>> > [WARNING] The POM for org.eclipse.m2e:lifecycle-mapping:jar:1.0.0 is
>> > missing, no dependency information available
>> > [WARNING] Failed to retrieve plugin descriptor for
>> > org.eclipse.m2e:lifecycle-mapping:1.0.0: Plugin
>> > org.eclipse.m2e:lifecycle-mapping:1.0.0 or one of its dependencies could
>> > not
>> > be resolved: Failure to find org.eclipse.m2e:lifecycle-mapping:jar:1.0.0
>> > in
>> > https://repo.maven.apache.org/maven2 was cached in the local repository,
>> > resolution will not be reattempted until the update interval of central
>> > has
>> > elapsed or updates are forced
>> > [WARNING] The POM for org.apache.beam:beam-sdks-java-core:jar:0.1 is
>> > missing, no dependency information available
>> > [WARNING] The POM for
>> >
>> > org.apache.beam:beam-sdks-java-extensions-google-cloud-platform-core:jar:0.1
>> > is missing, no dependency information available
>> > [WARNING] The POM for
>> > org.apache.beam:beam-sdks-java-extensions-protobuf:jar:0.1 is missing,
>> > no
>> > dependency information available
>> > [INFO]
>> > 
>> > [INFO] BUILD FAILURE
>> > [INFO]
>> > 
>> > [INFO] Total time: 0.960 s
>> > [INFO] Finished at: 2017-08-16T14:14:42+08:00
>> > [INFO] Final Memory: 18M/309M
>> > [INFO]
>> > 
>> > [ERROR] Failed to execute goal on project word-count-beam: Could not
>> > resolve
>> > dependencies for project org.example:word-count-beam:jar:0.1: The
>> > following
>> > artifacts could not be resolved:
>> >
>> > org.apache.beam:beam-sdks-java-extensions-google-cloud-platform-core:jar:0.1,
>> > org.apache.beam:beam-sdks-java-extensions-protobuf:jar:0.1: Failure to
>> > find
>> >
>> > org.apache.beam:beam-sdks-java-extensions-google-cloud-platform-core:jar:0.1
>> > in https://repo.maven.apache.org/maven2 was cached in the local
>> > repository,
>> > resolution will not be reattempted until the update interval of central
>> > has
>> > elapsed or updates are forced -> [Help 1]
>> >
>> >   Any idea?
>> >
>> > Thanks,
>> > Huafeng


Re: Failed to run Wordcount example

2017-08-16 Thread Ismaël Mejía
Hello,

The error message shows that it is looking for the Beam 0.1 version
and that version does not exist in maven central.
You have to replace the version of Beam in the command you executed
with the latest version that means 2.0.0 at this moment and it should
work.

Regards,
Ismaël


On Wed, Aug 16, 2017 at 8:21 AM, Vincent Wang  wrote:
> Hi guys,
>
>   I'm trying to run the wordcount example according to the quick start but I
> got following error:
>
> [INFO]
> 
> [INFO] Building word-count-beam 0.1
> [INFO]
> 
> [WARNING] The POM for org.eclipse.m2e:lifecycle-mapping:jar:1.0.0 is
> missing, no dependency information available
> [WARNING] Failed to retrieve plugin descriptor for
> org.eclipse.m2e:lifecycle-mapping:1.0.0: Plugin
> org.eclipse.m2e:lifecycle-mapping:1.0.0 or one of its dependencies could not
> be resolved: Failure to find org.eclipse.m2e:lifecycle-mapping:jar:1.0.0 in
> https://repo.maven.apache.org/maven2 was cached in the local repository,
> resolution will not be reattempted until the update interval of central has
> elapsed or updates are forced
> [WARNING] The POM for org.apache.beam:beam-sdks-java-core:jar:0.1 is
> missing, no dependency information available
> [WARNING] The POM for
> org.apache.beam:beam-sdks-java-extensions-google-cloud-platform-core:jar:0.1
> is missing, no dependency information available
> [WARNING] The POM for
> org.apache.beam:beam-sdks-java-extensions-protobuf:jar:0.1 is missing, no
> dependency information available
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 0.960 s
> [INFO] Finished at: 2017-08-16T14:14:42+08:00
> [INFO] Final Memory: 18M/309M
> [INFO]
> 
> [ERROR] Failed to execute goal on project word-count-beam: Could not resolve
> dependencies for project org.example:word-count-beam:jar:0.1: The following
> artifacts could not be resolved:
> org.apache.beam:beam-sdks-java-extensions-google-cloud-platform-core:jar:0.1,
> org.apache.beam:beam-sdks-java-extensions-protobuf:jar:0.1: Failure to find
> org.apache.beam:beam-sdks-java-extensions-google-cloud-platform-core:jar:0.1
> in https://repo.maven.apache.org/maven2 was cached in the local repository,
> resolution will not be reattempted until the update interval of central has
> elapsed or updates are forced -> [Help 1]
>
>   Any idea?
>
> Thanks,
> Huafeng


Re: Beam Slack channel

2017-06-29 Thread Ismaël Mejía
Invitation sent!

On Thu, Jun 29, 2017 at 9:16 AM, Patrick Reames
 wrote:
> Can i also get an invite?
>
> On 2017-06-25 08:51 (-0500), Aleksandr  wrote:
>> Hello,>
>> Can someone  please add me to the slack channel?>
>>
>> Best regards>
>> Aleksandr Gortujev.>
>>
>


Re: Slack channel invite

2017-05-02 Thread Ismaël Mejía
Done.

On Tue, May 2, 2017 at 1:05 PM, Josh Di Fabio  wrote:
> Please will someone kindly invite joshdifa...@gmail.com to the Beam slack
> channel?


Re: Approach to writing to Redis in Streaming Pipeline

2017-03-16 Thread Ismaël Mejía
Hello,

Probably it is not worth the effort to write a new RedisIO from zero
considering there is an ongoing Pull Request for this.

https://github.com/apache/beam/pull/1687

Maybe you can take a look if the current WIP is enough for your needs,
and eventually give a hand there to improve it if it is not the case.

Regards,
Ismaël


On Friday, March 17, 2017, sowmya balasubramanian
 wrote:
>
> Thanks a ton Raghu and Eugene! The Setup and Teardown is what I was looking 
> for. I will try it and see how it goes.
>
> Regards,
> Sowmya
>
> On Thu, Mar 16, 2017 at 6:02 PM, Eugene Kirpichov  
> wrote:
>>
>> Yes, please use a ParDo. The Custom Sink API is not intended for use with 
>> unbounded collections (i.e. in pretty much any streaming pipelines) and it's 
>> generally due for a redesign. ParDo is currently almost always a better 
>> choice when you want to implement a connector writing data to a third-party 
>> system, unless you're just implementing export to a particular file format 
>> (in which case FileBasedSink is appropriate).
>>
>> Concur with what Raghu said about @Setup/@Teardown.
>>
>> On Thu, Mar 16, 2017 at 3:02 PM Raghu Angadi  wrote:
>>>
>>> ParDo is ok.
>>>
>>> Do you open a connection in each processElement() invocation? If you can 
>>> reuse the connection, you can open once in @Setup method and close it in 
>>> @Teardown.
>>>
>>> Raghu.
>>>
>>> On Thu, Mar 16, 2017 at 2:19 PM, sowmya balasubramanian 
>>>  wrote:

 Hi All,

 I am newbie who has recently entered the world of GCP and pipelines.

 I have a streaming pipeline in which I write to a Redis sink at the end. 
 The pipeline writes about 60,000 events per 15 minute window it processes. 
 I implemented the writing to Redis using a ParDo.

 The prototype worked well for small set of streaming events. However, when 
 I tested with my full dataset, every now and then I noticed the Redis 
 client (Jedis) threw a SocketException. (The client opens connection every 
 time it has to write to Redis, then closes the connection)

 Couple of questions I have:

 Is there a preferred Redis client for the pipeline?
 Does it make sense to write a Custom Redis sink instead of a ParDo?

 Thanks,
 Sowmya





>>>
>


Re: New blog post: "Stateful processing with Apache Beam"

2017-02-15 Thread Ismaël Mejía
Great post, I like the use of the previous figure style with geometric
forms and colors, as well as the table analogy that really helps to
understand the concepts. I am still digesting some of the consequences of
the State API, in particular the implications of using state that you
mention at the end. Really good to discuss those also as part of the post.

I found some small typos and formatting issues that I addressed here.
https://github.com/apache/beam-site/pull/156

Thanks for writing,
Ismaël


On Tue, Feb 14, 2017 at 11:50 AM, Jean-Baptiste Onofré 
wrote:

> Hey Ken
>
> Just take a quick look and it's a great post !
>
> Thanks
> Regards
> JB
> On Feb 13, 2017, at 18:44, Kenneth Knowles  wrote:
>>
>> Hi all,
>>
>> I've just published a blog post about Beam's new stateful processing
>> capabilities:
>>
>> https://beam.apache.org/blog/2017/02/13/stateful-processing.html
>>
>> The blog post covers stateful processing from a few angles: how it
>> works, how it fits into the Beam model, what you might use it for, and
>> finally some examples of stateful Beam code.
>>
>> I'd love for you to take a look and see how this feature might apply to
>> your use of Beam. And, of course, I'd also love to hear from you about it.
>>
>> Kenn
>>
>


Re: KafkaIO: reset topic for reading from the start with every run

2017-01-24 Thread Ismaël Mejía
One extra reminder, if you use the DirectRunner you can set the
DirectOptions to make the validations of the runner loose (and gain some
speed improvements).

setEnforceImmutability(false)
setEnforceEncodability(false)

On Mon, Jan 23, 2017 at 8:22 PM, Gareth Western 
wrote:

> Thanks Thomas. I'll be sure to convey that in the demo. The Flink local
> runner performs nicely. I'm now setting up the Flink cluster for the next
> test.
>
>
>
> On 23. jan. 2017 20:20, Thomas Groh wrote:
>
>> You should also generally expect the DirectRunner to be slower than
>> production runners - the goals of the DirectRunner are primarily ensuring
>> that Pipelines are portable to other production runners and enforcing the
>> Beam Programming Model while enabling local iteration and development, and
>> exposing bugs early. As a result, there is a relatively large amount of
>> additional work done per-element that will slow the Pipeline, and consume
>> additional local resources as the Pipeline executes.
>>
>>
>