Re: [VOTE] JupyterLab Sidepanel extension release v1.0.0 for BEAM-10545 RC #1

2020-10-09 Thread Ning Kang
To Pablo,

The public key in use by NPM can be found in this blog "[3]
https://blog.npmjs.org/post/172999548390/new-pgp-machinery;. A direct link:
https://keybase.io/npmregistry/pgp_keys.asc
Quoted from the blog:

> We’ve also chosen to use Keybase
> 
>  to
> publicize our PGP key and give you confidence that the npm registry you
> install from is the same registry that’s signing packages. Our account on
> Keybase is npmregistry
> 
> .

Keybase can be found here: https://keybase.io/

Thanks!
Ning.

On Fri, Oct 9, 2020 at 1:53 PM Pablo Estrada  wrote:

> +1
> I installed the extension and reviewed it as well.
>
> I have a question: You mention that NPM will sign the package. What key
> will it use? We may need to upload your pgp key to the Beam list of keys?
> Thanks Ning!
> -P.
>
> On Tue, Oct 6, 2020 at 2:57 PM Ning Kang  wrote:
>
>> Please review the release of the following jupyter labextension
>> (TypeScript node package) for running Beam notebooks in JupyterLab:
>> * apache-beam-jupyterlab-sidepanel
>>
>> Hi everyone,
>> Please review and vote on the release candidate #1 for the version 1.0.0,
>> as follows:
>> [ ] +1, Approve the release
>> [ ] -1. Do not approve the release (please provide specific comments)
>>
>> The complete staging area is available for your review, which includes:
>> * the assets (only the
>> `sdks/python/apache_beam/runners/interactive/extensions/apache-beam-jupyterlab-sidepanel`
>> sub directory) to be published to npmjs.com [1]
>> * commit hash "b7ae7bb1dc28a7c8f26e9f48682e781a74e2d3c4" [2]
>> * package will be signed by NPM once published; the pgp machinery [3]
>>
>> Additional details:
>> * to install the package before it being published, install it locally by
>> cloning the Beam repo or downloading the assets:
>>
>> git checkout jupyterlab-sidepanel-v1.0.0 -b some-branch # if cloning the
>> repo, do this step
>>
>> pushd sdks/python/apache_beam/runners/interactive/extensions/apache-beam-
>> jupyterlab-sidepanel
>>
>> jlpm
>>
>> jlpm build
>>
>> jupyter labextension link .
>> * screenshots of the extension [4]
>> * a publish dry run:
>>
>> npm notice === Tarball Details ===
>>
>> npm notice name:  apache-beam-jupyterlab-sidepanel
>>
>> npm notice version:   1.0.0
>>
>> npm notice package size:  19.8 kB
>>
>> npm notice unpacked size: 101.9 kB
>>
>> npm notice shasum:7f896de0d6e587aab2bef348a6e94f95f75f280f
>>
>> npm notice integrity: sha512-hdkn2Ni2S0roY[...]ShMK2/MAbQvyQ==
>>
>> npm notice total files:   51
>>
>> npm notice
>>
>> + apache-beam-jupyterlab-sidepanel@1.0.0
>>
>> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>>
>> Thanks!
>>
>> [1]
>> https://github.com/apache/beam/releases/tag/jupyterlab-sidepanel-v1.0.0
>> [2]
>> https://github.com/apache/beam/commit/b7ae7bb1dc28a7c8f26e9f48682e781a74e2d3c4
>> [3] https://blog.npmjs.org/post/172999548390/new-pgp-machinery
>> [4]
>> https://docs.google.com/document/d/1aKK8TzSrl8WiG0K4v9xZEfLMCinuGqRlMOyb7xOhgy4/edit#heading=h.he7se5yxfo7
>>
>


Re: [VOTE] JupyterLab Sidepanel extension release v1.0.0 for BEAM-10545 RC #1

2020-10-09 Thread Pablo Estrada
+1
I installed the extension and reviewed it as well.

I have a question: You mention that NPM will sign the package. What key
will it use? We may need to upload your pgp key to the Beam list of keys?
Thanks Ning!
-P.

On Tue, Oct 6, 2020 at 2:57 PM Ning Kang  wrote:

> Please review the release of the following jupyter labextension
> (TypeScript node package) for running Beam notebooks in JupyterLab:
> * apache-beam-jupyterlab-sidepanel
>
> Hi everyone,
> Please review and vote on the release candidate #1 for the version 1.0.0,
> as follows:
> [ ] +1, Approve the release
> [ ] -1. Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
> * the assets (only the
> `sdks/python/apache_beam/runners/interactive/extensions/apache-beam-jupyterlab-sidepanel`
> sub directory) to be published to npmjs.com [1]
> * commit hash "b7ae7bb1dc28a7c8f26e9f48682e781a74e2d3c4" [2]
> * package will be signed by NPM once published; the pgp machinery [3]
>
> Additional details:
> * to install the package before it being published, install it locally by
> cloning the Beam repo or downloading the assets:
>
> git checkout jupyterlab-sidepanel-v1.0.0 -b some-branch # if cloning the
> repo, do this step
>
> pushd sdks/python/apache_beam/runners/interactive/extensions/apache-beam-
> jupyterlab-sidepanel
>
> jlpm
>
> jlpm build
>
> jupyter labextension link .
> * screenshots of the extension [4]
> * a publish dry run:
>
> npm notice === Tarball Details ===
>
> npm notice name:  apache-beam-jupyterlab-sidepanel
>
> npm notice version:   1.0.0
>
> npm notice package size:  19.8 kB
>
> npm notice unpacked size: 101.9 kB
>
> npm notice shasum:7f896de0d6e587aab2bef348a6e94f95f75f280f
>
> npm notice integrity: sha512-hdkn2Ni2S0roY[...]ShMK2/MAbQvyQ==
>
> npm notice total files:   51
>
> npm notice
>
> + apache-beam-jupyterlab-sidepanel@1.0.0
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks!
>
> [1]
> https://github.com/apache/beam/releases/tag/jupyterlab-sidepanel-v1.0.0
> [2]
> https://github.com/apache/beam/commit/b7ae7bb1dc28a7c8f26e9f48682e781a74e2d3c4
> [3] https://blog.npmjs.org/post/172999548390/new-pgp-machinery
> [4]
> https://docs.google.com/document/d/1aKK8TzSrl8WiG0K4v9xZEfLMCinuGqRlMOyb7xOhgy4/edit#heading=h.he7se5yxfo7
>


Requesting contributor permissions for jira tickets

2020-10-09 Thread Dominik Schöneweiß
Hi everyone,

my name is Dominik and I’m working on different beam projects at my day job.
I would like to contribute to the code base and wanted to request jira 
permissions.

username: nomnom

Thanks!
-- 


advanced store GmbH
Alte Jakobstraße 79/80
D-10179 Berlin

www.advanced-store.com 

Tel: +49 (0)30 577 
066 020

Fax: +49 (0)30 577 066 029

Gesellschaft mit beschränkter Haftung

mit Sitz in Berlin, Amtsgericht Charlottenburg
Handelsregister: HRB 115601 
B
UST ID: DE261726838

Geschäftsführer: Marc Majewski


VERTRAULICHKEITSHINWEIS: Diese Nachricht ist vertraulich. Sie darf 
ausschließlich durch den vorgesehenen Empfänger und Adressaten gelesen, 
kopiert oder genutzt werden. Sollten Sie diese Nachricht versehentlich 
erhalten haben, bitten wir, den Absender (durch Antwort-E-Mail) hiervon 
unverzüglich zu informieren und die Nachricht zu löschen. Jede Nutzung oder 
Weitergabe des Inhalts dieser Nachricht ist unzulässig.


CONFIDENTIALITY 
NOTICE: This message (including any attachments) is confidential and may be 
privileged. It may be read, copied and used only by the intended recipient. 
If you have received it in error please contact the sender (by return 
e-mail) immediately and delete this message. Any unauthorized use or 
dissemination of this message in whole or in part is strictly prohibited.




Re: Dataflow updates fail with "Coder has changed" error using KafkaIO with SchemaCoder

2020-10-09 Thread Brian Hulette
Hi Cameron,

Thanks for bringing this up on the dev list. I'm quite familiar with Beam
schemas, but I should be clear I'm not that familiar with
Dataflow's pipeline update. +Reuven Lax  may need to
check me there.

> I am curious if it has been determined what makes a Schema the same as
another schema. From what I have seen in the codebase, it changes.

You're right, schema equality means different things in different contexts,
and we should be more clear about this. As I understand it, for pipeline
update the important thing isn't so much whether the schemas are actually
equal, but whether data encoded with the old schema can be understood by a
SchemaCoder referencing the new schema, because it's probable that the new
SchemaCoder will receive data that was encoded with the old SchemaCoder. In
order to satisfy that requirement, the old and the new schemas must have
the same fields* *in the same order*.
It might not seem like maintaining the ordering is an issue, but it is for
schemas inferred from Java types. That's because there's no guarantee about
the order in which we'll discover the fields or methods when using
reflection APIs. I believe Reuven did some experiments here and found that
the ordering is essentially random, so when we infer a schema from a Java
type in two different executions it can result in two completely different
field orders.

There are a couple of things we definitely need to do on the Beam side to
support pipeline update for SchemaCoder with possibly out-of-order fields:
- BEAM-10277: Java's RowCoder needs to respect the encoding_position field
in the schema proto. This provides a layer of indirection for field
ordering that runners can modify to "fix" schemas that have the same fields
but out of order.
- Java's SchemaCoder needs to encode the schema in a portable way, so that
runners will be able to inspect and modify the schema proto as described
above. Currently SchemaCoder is still represented in the pipeline proto as
a serialized Java class, so runners can't easily inspect/modify it.



All that being said, it looks like you may not be using SchemaCoder with a
schema inferred from a Java type. Where is `outputSchema` coming from? Is
it possible to make sure it maintains a consistent field order?
If you can do that, this may be an easier problem. I think then we could
make a change on the Dataflow side to ignore the schema's UUID when
checking for update compatibility.
On the other hand, if you need to support pipeline update for schemas with
out-of-order fields, we'd need to address the above tasks first. If you're
willing to work on them I can help direct you, these are things I've been
hoping to work on but haven't been able to get to.

Brian

* Looking forward we don't actually want to require the schemas to have the
same fields, we could allow adding/removing fields with certain limitations.

On Thu, Oct 8, 2020 at 12:55 PM Cameron Morgan 
wrote:

> Hey everyone,
>
> *Summary: *
>
> There is an issue with the Dataflow runner and the “Update” capability
> while using the beam native Row type, which I imagine also blocks the
> snapshots feature (as the docs say the snapshots have the same restrictions
> as the Update feature) but I have no experience there.
>
> Currently when reading from KafkaIO with the valueCoder set as a
> SchemaCoder:
>
> ```
> KafkaIO.Read()
> .withTopic(topic)
> .withKeyDeserializer(ByteArrayDeserializer::class.java)
> .withValueDeserializerAndCoder([Deserializer
> ], SchemaCoder.of(outputSchema))
> ```
>
> Updates fail consistently with the error:
> ```
> The original job has not been aborted., The Coder or type for step
> ReadInputTopic/Read(KafkaUnboundedSource)/DataflowRunner.StreamingUnboundedRead.ReadWithIds
> has changed
> ```
>
> There is an open issue about this,
> https://issues.apache.org/jira/browse/BEAM-9502 but I have not seen it
> discussed in the mailing list so I wanted to start it.
>
> *Investigation so far: *
>
> This failing on Beam 2.20 and below makes sense, as before the code path
> that called equals on this Coder first checked that the schema’s were equal
> (This part has not changed): 
> *https://github.com/apache/beam/blob/release-2.25.0/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java#L194
> .*
> Then this called the equals on the schema here, which if UUIDs were
> different caused false to be returned:
> https://github.com/apache/beam/blob/release-2.20.0/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L272
>
> This means that as the above issue suggests, the UUID being random meant
> that no two SchemaCoders were ever the same, causing the equals to return
> false.
>
>
>
> In [BEAM-4076] this was changed (PR link:
> https://github.com/apache/beam/pull/11041/files, direct link:
> 

Re: [BEAM-10587] Support Maps in BigQuery #12389

2020-10-09 Thread Andrew Pilloud
BigQuery has no native support for Map types, but I agree that we should be
consistent with how other tools import maps into BigQuery. Is this
something Dataflow templates do? What other tools are there?

Beam ZetaSQL also lacks support for Map types. I like the idea of adding a
configuration parameter to turn this on and retaining the existing behavior
by default.

Thanks for sending this to the list!

Andrew

On Fri, Oct 9, 2020 at 7:20 AM Jeff Klukas  wrote:

> It's definitely desirable to be able to get back Map types from BQ, and
> it's nice that BQ is consistent in representing maps as repeated key/value
> structs. Inferring maps from that specific structure is preferable to
> inventing some new naming convention for the fields, which would hinder
> interoperability with non-Beam applications.
>
> Would it be possible to add a configurable parameter called something like
> withMapsInferred() ? Default behavior would be the status quo, but users
> could opt in to the behavior of inferring maps based on field names. This
> would prevent the PR change from potentially breaking existing
> applications. And it means the least surprising behavior remains the
> default.
>
> On Fri, Oct 9, 2020 at 6:06 AM Worley, Ryan 
> wrote:
>
>> https://github.com/apache/beam/pull/12389
>>
>> Hi everyone, in the above pull request I am attempting to add support for
>> writing Avro records with maps to a BigQuery table (via Beam Schema).  The
>> write portion is fairly straightforward - we convert the map to an array of
>> structs with key and value fields (seemingly the closest possible
>> approximation of a map in BigQuery).  But the read back portion is more
>> controversial because we simply check if a field is an array of structs
>> with exactly two fields - key and value - and assume that should be read
>> into a Schema map field.
>>
>> So the possibility exists that an array of structs with key and value
>> fields, which wasn't originally written from a map, could be unexpectedly
>> read into a map.  In the PR review I suggested a few options for tagging
>> the BigQuery field, so that we could know it was written from a Beam Schema
>> map and should be read back into one, but I'm not very satisfied with any
>> of the options.
>>
>> Andrew Pilloud suggested that I write to this group to get some feedback
>> on the issue.  Should we be concerned that all arrays of structs with
>> exactly 'key' and 'value' fields would be read into a Schema map or could
>> this be considered a feature?  If the former, how would you suggest that we
>> limit reading into a map only those fields that were originally written
>> from a map?
>>
>> Thanks for any feedback to help bump this PR along!
>>
>> NOTICE:
>>
>> This message, and any attachments, contain(s) information that may be
>> confidential or protected by privilege from disclosure and is intended only
>> for the individual or entity named above. No one else may disclose, copy,
>> distribute or use the contents of this message for any purpose. Its
>> unauthorized use, dissemination or duplication is strictly prohibited and
>> may be unlawful. If you receive this message in error or you otherwise are
>> not an authorized recipient, please immediately delete the message and any
>> attachments and notify the sender.
>>
>


Re: [BEAM-10587] Support Maps in BigQuery #12389

2020-10-09 Thread Jeff Klukas
It's definitely desirable to be able to get back Map types from BQ, and
it's nice that BQ is consistent in representing maps as repeated key/value
structs. Inferring maps from that specific structure is preferable to
inventing some new naming convention for the fields, which would hinder
interoperability with non-Beam applications.

Would it be possible to add a configurable parameter called something like
withMapsInferred() ? Default behavior would be the status quo, but users
could opt in to the behavior of inferring maps based on field names. This
would prevent the PR change from potentially breaking existing
applications. And it means the least surprising behavior remains the
default.

On Fri, Oct 9, 2020 at 6:06 AM Worley, Ryan  wrote:

> https://github.com/apache/beam/pull/12389
>
> Hi everyone, in the above pull request I am attempting to add support for
> writing Avro records with maps to a BigQuery table (via Beam Schema).  The
> write portion is fairly straightforward - we convert the map to an array of
> structs with key and value fields (seemingly the closest possible
> approximation of a map in BigQuery).  But the read back portion is more
> controversial because we simply check if a field is an array of structs
> with exactly two fields - key and value - and assume that should be read
> into a Schema map field.
>
> So the possibility exists that an array of structs with key and value
> fields, which wasn't originally written from a map, could be unexpectedly
> read into a map.  In the PR review I suggested a few options for tagging
> the BigQuery field, so that we could know it was written from a Beam Schema
> map and should be read back into one, but I'm not very satisfied with any
> of the options.
>
> Andrew Pilloud suggested that I write to this group to get some feedback
> on the issue.  Should we be concerned that all arrays of structs with
> exactly 'key' and 'value' fields would be read into a Schema map or could
> this be considered a feature?  If the former, how would you suggest that we
> limit reading into a map only those fields that were originally written
> from a map?
>
> Thanks for any feedback to help bump this PR along!
>
> NOTICE:
>
> This message, and any attachments, contain(s) information that may be
> confidential or protected by privilege from disclosure and is intended only
> for the individual or entity named above. No one else may disclose, copy,
> distribute or use the contents of this message for any purpose. Its
> unauthorized use, dissemination or duplication is strictly prohibited and
> may be unlawful. If you receive this message in error or you otherwise are
> not an authorized recipient, please immediately delete the message and any
> attachments and notify the sender.
>


[BEAM-10587] Support Maps in BigQuery #12389

2020-10-09 Thread Worley, Ryan
https://github.com/apache/beam/pull/12389

Hi everyone, in the above pull request I am attempting to add support for 
writing Avro records with maps to a BigQuery table (via Beam Schema).  The 
write portion is fairly straightforward - we convert the map to an array of 
structs with key and value fields (seemingly the closest possible approximation 
of a map in BigQuery).  But the read back portion is more controversial because 
we simply check if a field is an array of structs with exactly two fields - key 
and value - and assume that should be read into a Schema map field.

So the possibility exists that an array of structs with key and value fields, 
which wasn't originally written from a map, could be unexpectedly read into a 
map.  In the PR review I suggested a few options for tagging the BigQuery 
field, so that we could know it was written from a Beam Schema map and should 
be read back into one, but I'm not very satisfied with any of the options.

Andrew Pilloud suggested that I write to this group to get some feedback on the 
issue.  Should we be concerned that all arrays of structs with exactly 'key' 
and 'value' fields would be read into a Schema map or could this be considered 
a feature?  If the former, how would you suggest that we limit reading into a 
map only those fields that were originally written from a map?

Thanks for any feedback to help bump this PR along!

NOTICE:

This message, and any attachments, contain(s) information that may be 
confidential or protected by privilege from disclosure and is intended only for 
the individual or entity named above. No one else may disclose, copy, 
distribute or use the contents of this message for any purpose. Its 
unauthorized use, dissemination or duplication is strictly prohibited and may 
be unlawful. If you receive this message in error or you otherwise are not an 
authorized recipient, please immediately delete the message and any attachments 
and notify the sender.


Re: Jenkins CI down

2020-10-09 Thread Ismaël Mejía
Sure, I was not clear if this had changed after the Jenkins migration. I
thought we could control Jenkins master without INFRA.
If someone detects issues or reports them to infra please remember to post
this to dev@ too for everyone awareness.


On Wed, Oct 7, 2020 at 8:56 PM Tyson Hamilton  wrote:

> In the future, if this happens again, please create an Apache INFRA ticket
> like the following and update the dev list:
>
> https://issues.apache.org/jira/browse/INFRA-20954
>
> On Wed, Oct 7, 2020 at 2:20 AM Ismaël Mejía  wrote:
>
>> Yes seems to have been broken for some time, but working now. Thanks.
>>
>> On Wed, Oct 7, 2020 at 8:50 AM Michał Walenia 
>> wrote:
>>
>>> Hi,
>>> which URL are you trying to access? https://ci-beam.apache.org works
>>> for me.
>>>
>>> On Wed, Oct 7, 2020 at 8:06 AM Ismaël Mejía  wrote:
>>>
 Can somebody please check in what is going on? It seems our jenkins
 instance is down (503 Service Unavailable).


>>>
>>> --
>>>
>>> Michał Walenia
>>> Polidea  | Software Engineer
>>>
>>> M: +48 791 432 002 <+48791432002>
>>> E: michal.wale...@polidea.com
>>>
>>> Unique Tech
>>> Check out our projects! 
>>>
>>