Re: Staging a PCollection in Beam | Dataflow Runner

2022-10-19 Thread Israel Herraiz via dev
I think that would be a Reshuffle
,
but only within the context of the same job (e.g. if there is a failure and
a retry, the retry would start from the checkpoint created by the
reshuffle). In Dataflow, a group by key, a combiner per key, cogroup by
key, stateful dofns and I think splittable dofns will also have the same
effect of creating a checkpoint (any shuffling operation will always create
a checkpoint).

If you want to start a different job (slightly updated code, starting from
a previous point of a previous job), in Dataflow that would be a snapshot
, I think.
Snapshots only work in streaming pipelines.

On Wed, 19 Oct 2022 at 08:45, Ravi Kapoor  wrote:

> Hi Team,
> Can we stage a PCollection or  PCollection data? Lets say
> to save  the expensive operations between two complex BQ tables time and
> again and materialize it in some temp view which will be deleted after the
> session.
>
> Is it possible to do that in the Beam Pipeline?
> We can later use the temp view in another pipeline to read the data from
> and do processing.
>
> Or In general I would like to know Do we ever stage the PCollection.
> Let's say I want to create another instance of the same job which has
> complex processing.
> Does the pipeline re perform the computation or would it pick the already
> processed data in the previous instance that must be staged somewhere?
>
> Like in spark we do have notions of createOrReplaceTempView which is used
> to create temp table from a spark dataframe or dataset.
>
> Please advise.
>
> --
> Thanks,
> Ravi Kapoor
> +91-9818764564 <+91%2098187%2064564>
> kapoorrav...@gmail.com
>


More details about PR #9852: Add option to set a temp dataset in BigQueryIO

2020-02-02 Thread Israel Herraiz
Hi all,

I have updated and polished a pull request  I submitted some time ago, and
I would like to bring it to the attention of this list, to see if I could
get some feedback or review of the code.

The PR is at https://github.com/apache/beam/pull/9852

It adds a new option withQueryTempDataset to BigQueryIO.Read.

Currently, if I want to read from a table with BigQueryIO, I need to assign
the role bigquery.jobUser to the service account  of Apache Beam (e.g.
Dataflow).

However, if I try to read from a view using the same role, the pipeline
will fail, because it needs to create a temporary dataset and table. The
name of this dataset is chosen by Apache Beam.

This in practice requires giving the service account the permission to
create datasets (e.g. assigning the role bigquery.user, not
bigquery.jobUser), which is a very broad permission.

With the submitted PR, you can specify the temporary dataset used to read
from queries (e.g. reading from a view). Thus you can just keep the role
bigquery.jobUser in the Beam service account, and just provide additional
permissions in that dataset to create temporary tables (confining any
potential write activity to that dataset only).

The destination dataset can even be in a different project than the data
you are reading (something that is not possible with the currently
available options), so you don't need to give write permissions in the same
project where the data resides. In situations where there is a
"untouchable" data project with authorized views, it is currently
impossible to read from those authorized views with BigQueryIO, unless you
give write permissions to Beam in the "untouchable" project. With this PR,
you could confine those writes to another project and dataset.

I hope the need for this option makes sense. Any thoughts?

Kind regards,
Israel


BigtableIO: is it really experimental?

2020-01-20 Thread Israel Herraiz
Hi,

I have been working lately quite a lot with Dataflow and Bigtable, using
Beam's BigtableIO.

The documentation (
https://beam.apache.org/releases/javadoc/2.17.0/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.html)
shows that BigtableIO is annotated as experimental.

The same docs also says that: *this connector for Cloud Bigtable is
considered experimental and may break or receive backwards-incompatible
changes in future versions of the Apache Beam SDK. Cloud Bigtable is in
Beta, and thus it may introduce breaking changes in future revisions of its
service or APIs.*

However, Bigtable went out of beta in 2016 (
https://cloud.google.com/bigtable/docs/release-notes#June_29_2016) and many
of the client libraries (all?) are in GA too (Java went out of beta on
September ->
https://cloud.google.com/bigtable/docs/release-notes#September_25_2019).

So I am wondering if BigtableIO is truly experimental, and whether I should
submit a pull request updating the comment about Cloud Bigtable being in
beta.

Kind regards,
Israel


Re: Contributor permission for Beam Jira tickets

2019-10-22 Thread Israel Herraiz
Thank you!


On Tue, Oct 22, 2019 at 4:52 PM Jean-Baptiste Onofré 
wrote:

> Hi Israel,
>
> Welcome aboard !
>
> I just added you in contributors on Jira.
>
> Regards
> JB
>
> On 22/10/2019 21:43, Israel Herraiz wrote:
> > Hi,
> >
> > My name is Israel, and I work in the Professional Services team at
> Google.
> >
> > I have created a JIRA issue and submitted the corresponding pull request
> > (see https://issues.apache.org/jira/browse/BEAM-8458), and I would like
> > to be able to assign the issue to myself.
> >
> > That's my second pull request sent to Apache Beam.
> >
> > My JIRA username is iht
> > (see https://issues.apache.org/jira/secure/ViewProfile.jspa?name=iht).
> >
> > Please could anyone give me that permission in JIRA?
> >
> > Thanks in advance.
> >
> > Israel
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Contributor permission for Beam Jira tickets

2019-10-22 Thread Israel Herraiz
Hi,

My name is Israel, and I work in the Professional Services team at Google.

I have created a JIRA issue and submitted the corresponding pull request
(see https://issues.apache.org/jira/browse/BEAM-8458), and I would like to
be able to assign the issue to myself.

That's my second pull request sent to Apache Beam.

My JIRA username is iht (see
https://issues.apache.org/jira/secure/ViewProfile.jspa?name=iht).

Please could anyone give me that permission in JIRA?

Thanks in advance.

Israel