Yes, please report documentation issues via JIRA (added you as a
contributor so that you can create it), also feel free to open a PR
addressing the issue.
On Tue, Jan 24, 2017 at 5:34 AM, Tobias Feldhaus <
tobias.feldh...@localsearch.ch> wrote:
> Hi,
>
> in the Programming Guide, under the
, Mingmin <ming...@ebay.com> wrote:
> Thanks Lukasz. With the provided window function, can I control how the
> watermark move forward ? Or a customized WindowFn is required.
>
> Sent from my iPhone
>
> On Dec 27, 2016, at 10:40 AM, Lukasz Cwik <lc...@google.com> wrot
If the PCollection is small you can just convert it into a
PCollectionView using View.asList and then in another ParDo read
in this list as a side input and iterate over all the elements using the
index offset in the list.
To parallelize the above, you need to break up the List into ranges
Have you thought of fetching the schema upfront from BigQuery and
prefiltering out any records in a preceeding DoFn instead of relying on
BigQuery telling you that the schema doesn't match?
Otherwise you are correct in believing that you will need to update
BigQueryIO to have the retry/error
How do you know when a record in the data pipeline has enough meta
information stored so that it can be processed?
How far behind is the meta data pubsub compared to the main pubsub?
Do you expect late data/metadata, and if so what do you want to do?
Also, side inputs aren't meant to be slow and
BoundedSource is able to report the timestamp[1] for records. It is just
that runners know that it is a fixed dataset so they have a trivial
optimization where the watermark goes from negative infinity to positive
infinity once all the data is read. For bounded splittable DoFns, its
likely that
But the windows can still be processed out of order.
On Thu, Aug 3, 2017 at 2:10 PM, Lukasz Cwik <lc...@google.com> wrote:
> Yes.
>
> On Thu, Aug 3, 2017 at 2:09 PM, Eric Fang <ef...@stacklighting.com> wrote:
>
>> Thanks Lukasz. If the stream is infinite, I am
There is currently no strict ordering which is supported within Apache Beam
(timestamp or not) and any ordering which may be occurring is just a side
effect and not guaranteed in any way.
Since the smallest unit of work is a bundle containing 1 element, the only
way to get ordering is to make one
ledged PubSub
> messages). In this case would Dataflow's autoscaling still scale up? I
> thought the reason the autoscaler scales up is due to the watermark lagging
> behind, but is it also aware of the acknowledged PubSub messages?
>
> On 3 Aug 2017, at 18:58, Lukasz Cwik <lc...@google.co
Yes.
On Thu, Aug 3, 2017 at 2:09 PM, Eric Fang <ef...@stacklighting.com> wrote:
> Thanks Lukasz. If the stream is infinite, I am assuming you mean by window
> the stream into pans and sort the bundle for each trigger to an order?
>
> Eric
>
> On Thu, Aug 3, 2017 at
I have invited you to the slack channel.
2 million data points doesn't seem like it should be an issue.
Have you considered trying a simpler combiner like Count to see if the
bottleneck is with the combiner that you are supplying?
Also, could you share the code for what resample_function does?
Do you have any stack traces or error messages that would provide more
details to the failure?
On Tue, Jul 11, 2017 at 11:28 AM, Will Walters
wrote:
> Hello,
>
> I've recently been successful running the Quickstart on my cluster (in
> Flink through Yarn on Hadoop).
You can do this efficiently with Apache Beam but you would need to write
code which converts a users expression into a set of PTransforms or create
a few pipeline variants for commonly computed outcomes. There are already
many transforms which can compute things like min, max, average. Take a
look
RK. Will it be easy to do this ?
>
> Thanks again Lukasz !
>
>
> Le 2017-07-23 20:42, Lukasz Cwik a écrit :
>
>> You can do this efficiently with Apache Beam but you would need to
>> write code which converts a users expression into a set of PTransforms
>> or crea
Welcome.
Sent you an invite.
On Wed, Jul 12, 2017 at 12:16 PM, Matthew Sole
wrote:
> Hi,
>
> Could you add me to slack please?
>
> Thank you,
>
> Matt
>
es it
> wait until the whole map is collected?
> #2 can the DoFn specify that it depends on only specific keys of the side
> input map? does that affect the scheduling of the DoFn?
>
> Thanks for any pointers...
> rdm
>
> On Wed, Jul 5, 2017 at 4:58 PM Lukasz Cwik <lc...@go
#1, side inputs supported sizes and performance are specific to a runner.
For example, I know that Dataflow supports side inputs which are 1+ TiB
(aggregate) in batch pipelines and ~100s MiBs per window because there have
been several one off benchmarks/runs. What kinds of sizes/use case do you
That should have said:
~100s MiBs per window in streaming pipelines
On Wed, Jul 5, 2017 at 2:58 PM, Lukasz Cwik <lc...@google.com> wrote:
> #1, side inputs supported sizes and performance are specific to a runner.
> For example, I know that Dataflow supports side inputs whic
It was removed because many of the fields stored in PipelineOptions were
not really cloneable but used as a way to pass around items such as an
ExecutorService or Credentials for dependency injection reasons.
With the above caveat that your not getting a true clone, feel free to copy
the code
Done
On Tue, Apr 25, 2017 at 1:24 PM, Alexandre Crayssac <
alexandre.crays...@polynom.io> wrote:
> Same here!
>
> Thanks
>
> On Mon, Apr 24, 2017 at 5:16 PM, Lukasz Cwik <lc...@google.com> wrote:
>
>> Done
>>
>> On Mon, Apr 24, 2017 at 7:34 AM
Welcome Griselda, Steve, and Apache.
Steve, this has come up before but it is against Slack's free tier policy
for having a bot which sends invites out automatically.
On Wed, Aug 16, 2017 at 10:18 AM, Apache Enthu
wrote:
> Please could you add me too?
>
> Thanks,
> Almas
ython type with the
> CountCombineFn and it's still stucked.
>
> Here is what I can see on my GCP console (this screenshot shows 36 minutes
> by I waited for 5 hours to be sure) :
> [image: Selection_070.png]
>
>
> On Wed, Aug 16, 2017 at 1:08 AM Lukasz Cwik <lc...@google
Invite sent, welcome.
On Thu, Aug 17, 2017 at 5:05 PM, Subramanyam Chitti <
subramanyam.chi...@bigcommerce.com> wrote:
> Hi,
> Could you please add me to the slack channel? My email address is
> subramanyam.chi...@bigcommerce.com
>
This is not expected, reach out to Google Cloud support with some recent
running/killed job ids or e-mail dataflow-feedb...@google.com.
On Mon, Aug 21, 2017 at 2:19 PM, Steve Anderson wrote:
> Has anyone used allowed late data with a session window? Every time I've
> tried to
Moving this to user@beam.apache.org
In the latest snapshot version of Apache Beam, file based sources like
AvroIO/TextIO were updated to support reading from Hadoop, see
HadoopFileSystem
Have you tried installing a logger onto the root JUL logger once the
pipeline starts executing in the worker (so inside one of your DoFn(s)
setup methods)?
ROOT_LOGGER_NAME = ""
LogManager.getLogManager().getLogger(ROOT_LOGGER_NAME).addHandler(myCustomHandler);
Also, the logging integration
What are you trying to do by nacking the message?
If your trying to delay the processing of the message till a future time,
look at using State & Timers with StatefulDoFn to queue the message for
processing at a future time. See this blog for some examples:
Filed BEAM-2500 as a feature request.
On Thu, Jun 22, 2017 at 9:00 AM, tarush grover <tarushappt...@gmail.com>
wrote:
> Hi All,
>
> Can we add a module s3-file-system in beam to directly support and have
> integration with s3?
>
> Regards,
> Tarush
>
> On Thu, 22
rent threads, I see a log statement from inside my
> synchronized block for each thread, which shouldn't be possible.
>
> Thoughts?
>
>
> On Thu, Jun 15, 2017 at 6:26 AM, Lukasz Cwik <lc...@google.com> wrote:
>
>> Take a look at DoFn setup/teardown, called only onc
Have you tried an AppEngine flex environment?
I know that users have tried AppEngine standard with the Java SDK and have
hit limitations of the standard environment which are not easy to resolve.
The solution has always been to suggest users try the flex environment (
Take a look at CompressedSource:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CompressedSource.java
I feel as though you could follow the same pattern to decompress/decrypt
the data as a wrapper.
Apache Beam supports a concept of dynamic work
Take a look at session windows[1]. As long as the messages you post to
Pubsub aren't spaced out farther then the session gap duration they will
all get grouped together.
It seems as though it would be much simpler to just run a separate Apache
Beam job for each internal job you want to process
You want to depend on the Hadoop File System module[1] and configure
HadoopFileSystemOptions[2] with a S3 configuration[3].
1:
https://github.com/apache/beam/tree/master/sdks/java/io/hadoop-file-system
2:
Why not use a singleton like pattern and have a function which either loads
and caches the ML model from a side input or returns the singleton if it
has been loaded.
You'll want to use some form of locking to ensure that you really only load
the ML model once.
On Wed, May 24, 2017 at 6:18 AM,
Since your using a small number of shards, add a Partition transform which
uses a deterministic hash of the key to choose one of 4 partitions. Write
each partition with a single shard.
(Fixed width diagram below)
Pipeline -> AvroIO(numShards = 4)
Becomes:
Pipeline -> Partition -->
What runner are you using (Flink, Spark, Google Cloud Dataflow, Apex, ...)?
On Wed, May 24, 2017 at 8:09 AM, Ankur Chauhan wrote:
> Sorry that was an autocorrect error. I meant to ask - what dataflow runner
> are you using? If you are using google cloud dataflow then the
t; partitions so that Dataflow won't override numShards.
>
> Josh
>
>
> On Wed, May 24, 2017 at 4:10 PM, Lukasz Cwik <lc...@google.com> wrote:
>
>> Since your using a small number of shards, add a Partition transform
>> which uses a deterministic hash of the key to
For the compilation error, have you tried?
Long maxLatency = e.get((TupleTag) (TupleTag) finalOutputTags.get(0));
I'm not sure whether I fully understand the problem.
On Sat, May 27, 2017 at 2:15 AM, 郭亚峰(默岭) wrote:
>
> Hi there,
> I'm working with a small DSL
dk.util.SerializableUtils.serializeToByteAr
>>>>> ray(SerializableUtils.java:49)
>>>>> ... 10 more
>>>>>
>>>>> The way I'm trying to use this in the ParDo/DoFn is:
>>>>>
>>>>> (line 138 starts here)
>>
To flatten all the dependencies into one jar is build system dependent. If
using Maven I would look into the Maven Shade Plugin (
https://maven.apache.org/plugins/maven-shade-plugin/).
Jar files are also just zip files so you could merge them manually as well
but you'll need to deal with
hoo-inc.com>
wrote:
> Yeah, we're working on altering the build file to include all dependencies
> in one, huge jar. Is there a better way than this to run Beam jobs on a
> cluster? Putting everything into a jar seems like a clunky solution.
>
>
> On Friday, June 2, 2017 1
morrow.
>>
>> Sorry for the confusing question!
>>
>> Josh
>>
>> On Tue, Jun 6, 2017 at 10:01 PM, Lukasz Cwik <lc...@google.com> wrote:
>>
>>> Based upon your descriptions, it seemed like you wanted limited
>>> parallelism becaus
Combining PubSub + Bigtable is common.
You should try to use the BigtableSession approach because the hbase
approach adds a lot of dependencies (leading to dependency conflicts).
You should use the same version of Bigtable libraries that Apache Beam is
using (Apache Beam 2.0.0 uses Bigtable
Take a look at DoFn setup/teardown, called only once per DoFn instance and
not per element so it makes easier to write initialization code.
Also if the schema map is shared, have you thought of using a single static
instance of Guava's LoadingCache shared amongst all the DoFn instances?
You can
Unfortunately you can't Combine Writes since they return PDone (a terminal
node) during pipeline construction.
On Sun, Jun 11, 2017 at 3:23 PM, Gwilym Evans
wrote:
> I'm not 100% sure as I haven't tried it, but, Combining comes to mind as a
> possible way of doing
Yes you can use the HadoopFileSystem and use the Hadoop S3A connector.
Documentation about options/configuration for S3 Hadoop connectors:
https://wiki.apache.org/hadoop/AmazonS3
Build a valid Hadoop configuration for S3 and set it on the
HadoopFileSystemOptions:
Configuration s3Configuration =
The Dataflow implementation when executing a batch pipeline does not
parallelize dependent fused segments irrespective of the windowing function
so #1 will fully execute before #2 starts.
On Sat, Jun 10, 2017 at 3:48 PM, Morand, Sebastien <
sebastien.mor...@veolia.com> wrote:
> Hi again,
>
> So
I believe that if your data from the past can't effect the data of the
future because the windows/state are independent of each other then just
reprocessing the old data using a batch job is simplest and likely to be
the fastest.
About your choices 1, 2, and 3, allowed lateness is relative to the
JB, for your second point it seems as though you may not be setting the
Hadoop configuration on HadoopFileSystemOptions.
Also, I just merged https://github.com/apache/beam/pull/2890 which will
auto detect Hadoop configuration based upon your HADOOP_CONF_DIR and
YARN_CONF_DIR environment variables.
buffering data in windows, and want the shard guarantee to apply across
> windows.
>
>
>
> On Tue, Jun 6, 2017 at 5:42 PM, Lukasz Cwik <lc...@google.com> wrote:
>
>> Your code looks like what I was describing. My only comment would be to
>> use a deterministic hashing
at the runner won't parallelise the
> final MyDofn across e.g. 8 instances instead of 4? If there are two input
> elements with the same key are they actually guaranteed to be processed on
> the same instance?
>
>
> Thanks,
>
> Josh
>
>
>
>
> On Tue, Jun 6, 2017 at 4:51
Invitation sent. Welcome
On Tue, Oct 3, 2017 at 10:56 AM, Tim <timrobertson...@gmail.com> wrote:
> May I also have one please?
>
> Tim,
> Sent from my iPhone
>
> On 3 Oct 2017, at 19:22, Lukasz Cwik <lc...@google.com> wrote:
>
> Invitation sent, welcome.
>
Can you stream the updates to the keys into the pipeline and then use it as
a side input performing a join against on your main stream that needs the
config data?
You could also use an in memory cache that periodically refreshes keys from
the external source.
A better answer depends on:
* how
onfig.
> 2. The config data shouldn't be change often. It is configured by human
> users.
> 3. The config data per key should be about 10-20 key value pairs.
> 4. Ideally the key number is in the range of a few millions, but a few
> thousands to begin with.
>
> Thanks
> Eric
to five times, if it still fails, maybe
> we'll just write it to logs and then redo this write later.
>
> Do you think that makes sense?
>
> Thanks,
>
> Derek
>
> On Mon, Oct 16, 2017 at 10:31 AM, Lukasz Cwik <lc...@google.com> wrote:
>
>> That source is not a
The `numberOfWorkerHarnessThreads` is worker wide and not per DoFn. Setting
this value to constrain how many threads are executing will impact all
parts of your pipeline. One idea is to use a Semaphore as a static object
within your DoFn with a fixed number of allowed actors that can enter and
Invite sent, welcome.
On Fri, Oct 13, 2017 at 3:07 PM, NerdyNick wrote:
> Hello
>
> Can someone please add me to the Beam slack channel?
>
> Thanks.
>
Check out https://beam.apache.org/documentation/io/io-toc/ and the
PTransform style guide
https://beam.apache.org/contribute/ptransform-style-guide/.
The PTransform style guide contains a lot of useful points which are
general but still apply to IO authors.
On Mon, Oct 16, 2017 at 3:52 PM, Eric
That is correct, autoscaling for streaming is only supported in Pubsub.
What sources were you interested in?
On Mon, Sep 4, 2017 at 12:54 AM, Derek Hao Hu
wrote:
> I've used PubSubIO for autoscaling on a streaming pipeline and it seems to
> be working fine so far.
>
> I
To my knowledge you should use Spark 1.6.3 since that is what is declared
as the spark.version in the projects root pom.xml
On Wed, Aug 30, 2017 at 2:45 PM, Mahender Devaruppala <
mahend...@apporchid.com> wrote:
> Hello,
>
>
>
> I am running into spark assertion error when running a apache
This is neat.
On Fri, Sep 29, 2017 at 1:26 PM, Vilhelm von Ehrenheim <
vonehrenh...@gmail.com> wrote:
> Hi Steve!
> I have several pipelines that successfully use both numpy and scikit
> models without any problems. I don't think I use Pandas atm but I'm sure
> that is fine too.
>
> However, you
Invitation sent, welcome.
On Tue, Oct 3, 2017 at 9:14 AM, Jon Brasted wrote:
> Hello,
>
> Please may I have an invitation to the Apache Beam Slack channel?
>
>
> Thanks,
> Jon
>
Using a bounded (batch style) pipeline you should be able to just group all
events by user and ignore windowing completely and produce any information
since you'll have a global view of all events. This scales well since data
for a user is only held up to the point that it is processed and then
Apache Beam supports a fixed number of shards but discourages use for
auto-tuning/scaling reasons and simplifies good scalable pipeline creation
for users.
Some users do require a fixed number of shards and several classes like
TextIO support fixed sharding. If your trying to always use a fixed
* Is code executed within @ProcessElement of a DoFn using State API
guaranteed to be "serialized" per-K and per-window (by "serialized" i mean
that it will produce the same effect as if every execution of the
@ProcessElement method for a given K and window had executed to completion
before the
handles attachments.
>
> Jacob
>
> On Tue, Oct 17, 2017 at 2:54 PM, Lukasz Cwik <lc...@google.com> wrote:
>
>> Have you considered using a stateful DoFn, buffering/batching based upon
>> a certain number of elements is shown in this blog[1] and could be extended
>
per worker instance,
>> then this creates a backlog and autoscaling might trigger earlier, so
>> technically the overall system lag might actually be better?
>>
>> I haven't tested this hypothesis yet but basically the above is my
>> reasoning.
>>
>> Thanks,
Marble <jmar...@kochava.com> wrote:
> Here's a gist: https://gist.github.com/jacobmarble/
> 6ca40e0a14828e6a0dfe89b9cb2e4b4c
>
> Should I consider StateId value mutations to be non-atomic?
>
> Jacob
>
> On Wed, Oct 18, 2017 at 9:25 AM, Lukasz Cwik <lc...@google.co
Filed https://issues.apache.org/jira/browse/BEAM-3198 for the
IllegalArgumentException
Do you mind posting a little code snippet of how you build the BQ IO
connector on BEAM-3198?
On Wed, Nov 15, 2017 at 12:18 PM, Arpan Jain wrote:
> Hi,
>
> I am trying to use
It seems like your trying to use Spark 2.1.0. Apache Beam currently relies
on users using Spark 1.6.3. There is an open pull request[1] to migrate to
Spark 2.2.0.
1: https://github.com/apache/beam/pull/4208/
On Mon, Dec 4, 2017 at 10:58 AM, Opitz, Daniel A
wrote:
> We
I also believe we were still in the investigatory phase for dropping
support for Java 7.
On Mon, Dec 4, 2017 at 2:22 PM, Eugene Kirpichov
wrote:
> Thanks JB for sending the detailed notes about new stuff in 2.2.0! A lot
> of exciting things indeed.
>
> Regarding Java 8: I
Since processing can happen out of order, for example if the input was:
```
{"id": "2", parent_id: "a", "timestamp": 2, "amount": 3}
{"id": "1", parent_id: "a", "timestamp": 1. "amount": 1}
{"id": "1", parent_id: "a", "timestamp": 3, "amount": 2}
```
would the output be 3 and then 5 or would you
Instead of a callback fn, its most useful if a PCollection is returned
containing the result of the sink so that any arbitrary additional
functions can be applied.
On Fri, Dec 1, 2017 at 7:14 AM, Jean-Baptiste Onofré
wrote:
> Agree, I would prefer to do the callback in the IO
1) Based upon your question, it seems like you want to have one large job
that then spins off really small jobs. Any reason you can't just do it all
within one pipeline?
2. Am I capable of defining flexible window sizes per "device-task"?
Yes, you'll want a custom WindowFn which when going
BigQueryIO has been written in such a way to support emitting failed
records to a "dead letter queue". Not all IO transforms support this but it
is very useful for the ones that do.
WriteResult writeResult = p.apply(PubsubIO.readMessagesWithAttributes()
.fromSubscription(“"))
Invites sent, welcome.
On Tue, Nov 21, 2017 at 8:49 AM, Andrew Jones
wrote:
> Me too, please :)
>
> On Tue, 21 Nov 2017, at 16:19, Dariusz Aniszewski wrote:
> > Hello
> >
> > Can someone please add me to the Beam slack channel?
> >
> > Thanks.
>
Just sent one to you Eric, welcome.
On Tue, Nov 21, 2017 at 8:54 AM, Eric Anderson wrote:
> Me three :)
>
> On Tue, Nov 21, 2017 at 8:49 AM Andrew Jones
> wrote:
>
>> Me too, please :)
>>
>> On Tue, 21 Nov 2017, at 16:19, Dariusz Aniszewski
Are we talking about integration testing or general pipeline execution
metrics?
For integration testing, I would see that they users on PAssert and a test
runner to do testing similar to Apache Beam's @ValidatesRunner or IO
integration tests.
For production pipeline monitoring, the common metric
chava.com> wrote:
>
> Me too, if you don't mind.
>
> Jacob
>
> On Thu, Nov 9, 2017 at 2:09 PM, Lukasz Cwik <lc...@google.com> wrote:
>
>> Invite sent, welcome.
>>
>> On Thu, Nov 9, 2017 at 2:08 PM, Fred Tsang <ftsan...@hotmail.com> wrote:
&
Invite sent, welcome.
On Thu, Nov 9, 2017 at 2:08 PM, Fred Tsang wrote:
> Hi,
>
> Please add me to the slack channel.
>
> Thanks,
> Fred
>
> Ps. I think "BeamTV" would be a great YouTube channel ;)
>
For joining with external data you have some options:
* Do direct calls to the external datastore, perform your own in memory
caching/expiration. You control exactly what happens and when it happens
but as you have done this in the past you know what this entails.
* Ingest the external data and
To convert a PCollection into an ArrayList locally within your application,
you would need to materialize your PCollection via some sink like
AvroIO/TextIO/... and then in your program after the pipeline has completed
read the output files parsing the records.
On Tue, Oct 31, 2017 at 1:29 AM,
a nice way of Accumulating and Aggregating data along ad infinitum :) )
>>
>> Something I'd love to add though (on the Views)
>> * When I use the Singleton, how can I ensure that I'm only getting the
>> value for the Key I want in the DoFn. I see there is
It looks like something failed within your job and the error your getting
is from your driver program (not the remote execution that is happening
within Google Cloud). You'll want to look at the Stackdriver logs for
details, you can get more details about how to see your Stackdriver logs
here[1].
On Wed, May 16, 2018 at 10:46 PM chandan prakash
wrote:
> Thanks Ismaël.
> Your answers were quite useful for a novice user.
> I guess this answer will help many like me.
>
> *Regarding your answer to point 2 :*
>
>
> *"Checkpointing is supported, Kafka offset
This is not exposed programatically. Depending on which runner your using,
you may be able to query for the watermark through its monitoring APIs if
it is exposed as a metric and you know what it is called. This is likely to
be brittle implementation and also the data is likely to be stale.
On
side inputs from
>>>> the main input? By that I mean the scope of the side input would be a per
>>>> window one and it would be different for every window. Is that correct?
>>>>
>>>> Regards,
>>>> Harsh
>>>>
>>>> O
e accounts again.
>
>
> On Tue, May 15, 2018 at 15:11 Lukasz Cwik <lc...@google.com> wrote:
>
>> For each BillingModel you receive over Kafka, how "fresh" should the
>> account information be?
>> Does the account information in the extern
For each BillingModel you receive over Kafka, how "fresh" should the
account information be?
Does the account information in the external store change?
On Tue, May 15, 2018 at 11:22 AM Harshvardhan Agrawal <
harshvardhan.ag...@gmail.com> wrote:
> Hi,
>
> We have certain billing data that arrives
There is none to my knowledge.
On Wed, May 23, 2018 at 1:49 PM Udi Meiri wrote:
> Hi all,
>
> I was looking yesterday for a quickstart guide on how to use Beam on
> Windows but saw that those guides are exclusively for Linux users.
>
> What documentation is available for
I should have clarified that the precision guarantee I was talking about
was timing.
On Thu, May 24, 2018 at 2:21 PM Lukasz Cwik <lc...@google.com> wrote:
> The runner is responsible for scheduling the work anywhere it chooses. It
> can be the same node all the time or dif
The runner is responsible for scheduling the work anywhere it chooses. It
can be the same node all the time or different nodes.
There is no precision guarantee on the upper bound (only the lower bound), the
withRate method states that it will "generate at most a given number of
elements per a
m
> the main input? By that I mean the scope of the side input would be a per
> window one and it would be different for every window. Is that correct?
>
> Regards,
> Harsh
>
> On Tue, May 15, 2018 at 17:54 Lukasz Cwik <lc...@google.com> wrote:
>
>> Using deduplicate
tiple times across workers in a window.
>>>
>>> What I was thinking was that it might be better to perform the lookup
>>> only once for each account and product in a window and then supply them as
>>> side inputs to the main input.
>>>
>>> On Tu
A watermark is a lower bound on data that is processed and available. It is
specifically a lower bound because we want runners to be able to process
each window in parallel.
In your example, a Runner may choose to compute Aggregate[Pete:09:01,X,Y]
in parallel with Aggregate[Pete:09:02,X,Y] even
User is the correct mailing list.
beam.io.WriteToText takes 'strings' which means that you have to format the
whole line yourself. You'll want to apply another ParDo
after CreateColForSampleFn which takes the 1x164 record and concatenates
each value with ',' in between.
On Mon, Jun 18, 2018 at
Any updates on BEAM-4512?
On Mon, Jun 11, 2018 at 1:42 PM Lukasz Cwik wrote:
> Thanks all, it seems as though only Google needs the grace period. I'll
> wait for the shorter of BEAM-4512 or two weeks before merging
> https://github.com/apache/beam/pull/5571
>
>
> On Wed, Jun
It is currently the later where all the data is read and then filtered
within the pipeline. Note that this doesn't mean that all the data is
loaded into memory as the way that the join is done is dependent on the
Runner that is powering the pipeline.
Kenn had shared this doc[1] which is starting
em is no different, I wanted to explore the space further and
> find a more elegant solution (Not introducing Cycles if there was a better
> way to handle it).
>
>
>
>
>
> On Thu, Jun 7, 2018 at 10:34 PM Lukasz Cwik wrote:
>
>> A watermark is a lower bound on data that is
es the issue:
> https://github.com/calonso/beam_experiments/blob/master/refreshingsideinput/src/test/scala/com/mrcalonso/RefreshingSideInput2Test.scala#L19-L53
>
> Hope it helps
>
> On Mon, Jun 4, 2018 at 9:04 PM Lukasz Cwik wrote:
>
>> Carlos, can you provide a test
1 - 100 of 277 matches
Mail list logo