Re: Documentation - Programming Guide - Creating a PCollection

2017-01-24 Thread Lukasz Cwik
Yes, please report documentation issues via JIRA (added you as a contributor so that you can create it), also feel free to open a PR addressing the issue. On Tue, Jan 24, 2017 at 5:34 AM, Tobias Feldhaus < tobias.feldh...@localsearch.ch> wrote: > Hi, > > in the Programming Guide, under the

Re: window settings for recovery scenario

2016-12-28 Thread Lukasz Cwik
, Mingmin <ming...@ebay.com> wrote: > Thanks Lukasz. With the provided window function, can I control how the > watermark move forward ? Or a customized WindowFn is required. > > Sent from my iPhone > > On Dec 27, 2016, at 10:40 AM, Lukasz Cwik <lc...@google.com> wrot

Re: howto zip pcollection with index

2017-04-10 Thread Lukasz Cwik
If the PCollection is small you can just convert it into a PCollectionView using View.asList and then in another ParDo read in this list as a side input and iterate over all the elements using the index offset in the list. To parallelize the above, you need to break up the List into ranges

Re: How to skip processing on failure at BigQueryIO sink?

2017-04-11 Thread Lukasz Cwik
Have you thought of fetching the schema upfront from BigQuery and prefiltering out any records in a preceeding DoFn instead of relying on BigQuery telling you that the schema doesn't match? Otherwise you are correct in believing that you will need to update BigQueryIO to have the retry/error

Re: dealing with metadata coming in a separate pubsub

2017-04-21 Thread Lukasz Cwik
How do you know when a record in the data pipeline has enough meta information stored so that it can be processed? How far behind is the meta data pubsub compared to the main pubsub? Do you expect late data/metadata, and if so what do you want to do? Also, side inputs aren't meant to be slow and

Re: Using watermarks with bounded sources

2017-04-24 Thread Lukasz Cwik
BoundedSource is able to report the timestamp[1] for records. It is just that runners know that it is a fixed dataset so they have a trivial optimization where the watermark goes from negative infinity to positive infinity once all the data is read. For bounded splittable DoFns, its likely that

Re: Ordering in PCollection

2017-08-03 Thread Lukasz Cwik
But the windows can still be processed out of order. On Thu, Aug 3, 2017 at 2:10 PM, Lukasz Cwik <lc...@google.com> wrote: > Yes. > > On Thu, Aug 3, 2017 at 2:09 PM, Eric Fang <ef...@stacklighting.com> wrote: > >> Thanks Lukasz. If the stream is infinite, I am

Re: Ordering in PCollection

2017-08-03 Thread Lukasz Cwik
There is currently no strict ordering which is supported within Apache Beam (timestamp or not) and any ordering which may be occurring is just a side effect and not guaranteed in any way. Since the smallest unit of work is a bundle containing 1 element, the only way to get ordering is to make one

Re: PubSubIO withTimestampAttribute - what are the implications?

2017-08-03 Thread Lukasz Cwik
ledged PubSub > messages). In this case would Dataflow's autoscaling still scale up? I > thought the reason the autoscaler scales up is due to the watermark lagging > behind, but is it also aware of the acknowledged PubSub messages? > > On 3 Aug 2017, at 18:58, Lukasz Cwik <lc...@google.co

Re: Ordering in PCollection

2017-08-03 Thread Lukasz Cwik
Yes. On Thu, Aug 3, 2017 at 2:09 PM, Eric Fang <ef...@stacklighting.com> wrote: > Thanks Lukasz. If the stream is infinite, I am assuming you mean by window > the stream into pans and sort the bundle for each trigger to an order? > > Eric > > On Thu, Aug 3, 2017 at

Re: Resampling a timeserie stuck on a GroupByKey

2017-08-15 Thread Lukasz Cwik
I have invited you to the slack channel. 2 million data points doesn't seem like it should be an issue. Have you considered trying a simpler combiner like Count to see if the bottleneck is with the combiner that you are supplying? Also, could you share the code for what resample_function does?

Re: Running Custom Code on Cluster

2017-07-11 Thread Lukasz Cwik
Do you have any stack traces or error messages that would provide more details to the failure? On Tue, Jul 11, 2017 at 11:28 AM, Will Walters wrote: > Hello, > > I've recently been successful running the Quickstart on my cluster (in > Flink through Yarn on Hadoop).

Re: Feature Generation for Large datasets composed of many time series

2017-07-23 Thread Lukasz Cwik
You can do this efficiently with Apache Beam but you would need to write code which converts a users expression into a set of PTransforms or create a few pipeline variants for commonly computed outcomes. There are already many transforms which can compute things like min, max, average. Take a look

Re: Feature Generation for Large datasets composed of many time series

2017-07-24 Thread Lukasz Cwik
RK. Will it be easy to do this ? > > Thanks again Lukasz ! > > > Le 2017-07-23 20:42, Lukasz Cwik a écrit : > >> You can do this efficiently with Apache Beam but you would need to >> write code which converts a users expression into a set of PTransforms >> or crea

Re: Slack.

2017-07-12 Thread Lukasz Cwik
Welcome. Sent you an invite. On Wed, Jul 12, 2017 at 12:16 PM, Matthew Sole wrote: > Hi, > > Could you add me to slack please? > > Thank you, > > Matt >

Re: Providing HTTP client to DoFn

2017-07-06 Thread Lukasz Cwik
es it > wait until the whole map is collected? > #2 can the DoFn specify that it depends on only specific keys of the side > input map? does that affect the scheduling of the DoFn? > > Thanks for any pointers... > rdm > > On Wed, Jul 5, 2017 at 4:58 PM Lukasz Cwik <lc...@go

Re: Providing HTTP client to DoFn

2017-07-05 Thread Lukasz Cwik
#1, side inputs supported sizes and performance are specific to a runner. For example, I know that Dataflow supports side inputs which are 1+ TiB (aggregate) in batch pipelines and ~100s MiBs per window because there have been several one off benchmarks/runs. What kinds of sizes/use case do you

Re: Providing HTTP client to DoFn

2017-07-05 Thread Lukasz Cwik
That should have said: ~100s MiBs per window in streaming pipelines On Wed, Jul 5, 2017 at 2:58 PM, Lukasz Cwik <lc...@google.com> wrote: > #1, side inputs supported sizes and performance are specific to a runner. > For example, I know that Dataflow supports side inputs whic

Re: Cloning options instances

2017-04-21 Thread Lukasz Cwik
It was removed because many of the fields stored in PipelineOptions were not really cloneable but used as a way to pass around items such as an ExecutorService or Credentials for dependency injection reasons. With the above caveat that your not getting a true clone, feel free to copy the code

Re: Slack Channel Invite

2017-04-25 Thread Lukasz Cwik
Done On Tue, Apr 25, 2017 at 1:24 PM, Alexandre Crayssac < alexandre.crays...@polynom.io> wrote: > Same here! > > Thanks > > On Mon, Apr 24, 2017 at 5:16 PM, Lukasz Cwik <lc...@google.com> wrote: > >> Done >> >> On Mon, Apr 24, 2017 at 7:34 AM

Re: Slack channel

2017-08-16 Thread Lukasz Cwik
Welcome Griselda, Steve, and Apache. Steve, this has come up before but it is against Slack's free tier policy for having a bot which sends invites out automatically. On Wed, Aug 16, 2017 at 10:18 AM, Apache Enthu wrote: > Please could you add me too? > > Thanks, > Almas

Re: Resampling a timeserie stuck on a GroupByKey

2017-08-16 Thread Lukasz Cwik
ython type with the > CountCombineFn and it's still stucked. > > Here is what I can see on my GCP console (this screenshot shows 36 minutes > by I waited for 5 hours to be sure) : > [image: Selection_070.png] > > > On Wed, Aug 16, 2017 at 1:08 AM Lukasz Cwik <lc...@google

Re: Slack Channel

2017-08-17 Thread Lukasz Cwik
Invite sent, welcome. On Thu, Aug 17, 2017 at 5:05 PM, Subramanyam Chitti < subramanyam.chi...@bigcommerce.com> wrote: > Hi, > Could you please add me to the slack channel? My email address is > subramanyam.chi...@bigcommerce.com >

Re: Question re: Session Windows and Late Data

2017-08-22 Thread Lukasz Cwik
This is not expected, reach out to Google Cloud support with some recent running/killed job ids or e-mail dataflow-feedb...@google.com. On Mon, Aug 21, 2017 at 2:19 PM, Steve Anderson wrote: > Has anyone used allowed late data with a session window? Every time I've > tried to

Re: HDFSFileSource and distributed Apex questions

2017-05-02 Thread Lukasz Cwik
Moving this to user@beam.apache.org In the latest snapshot version of Apache Beam, file based sources like AvroIO/TextIO were updated to support reading from Hadoop, see HadoopFileSystem

Re: Custom log appenders with Dataflow runner

2017-05-17 Thread Lukasz Cwik
Have you tried installing a logger onto the root JUL logger once the pipeline starts executing in the worker (so inside one of your DoFn(s) setup methods)? ROOT_LOGGER_NAME = "" LogManager.getLogManager().getLogger(ROOT_LOGGER_NAME).addHandler(myCustomHandler); Also, the logging integration

Re: PubsubIO and message retries

2017-06-12 Thread Lukasz Cwik
What are you trying to do by nacking the message? If your trying to delay the processing of the message till a future time, look at using State & Timers with StatefulDoFn to queue the message for processing at a future time. See this blog for some examples:

Re: reading from s3 file in aws

2017-06-22 Thread Lukasz Cwik
Filed BEAM-2500 as a feature request. On Thu, Jun 22, 2017 at 9:00 AM, tarush grover <tarushappt...@gmail.com> wrote: > Hi All, > > Can we add a module s3-file-system in beam to directly support and have > integration with s3? > > Regards, > Tarush > > On Thu, 22

Re: 答复: Creating side input map with global window

2017-06-23 Thread Lukasz Cwik
rent threads, I see a log statement from inside my > synchronized block for each thread, which shouldn't be possible. > > Thoughts? > > > On Thu, Jun 15, 2017 at 6:26 AM, Lukasz Cwik <lc...@google.com> wrote: > >> Take a look at DoFn setup/teardown, called only onc

Re: AppEngine & beam

2017-06-20 Thread Lukasz Cwik
Have you tried an AppEngine flex environment? I know that users have tried AppEngine standard with the Java SDK and have hit limitations of the standard environment which are not easy to resolve. The solution has always been to suggest users try the flex environment (

Re: Reading encrypted and lz compressed files

2017-06-20 Thread Lukasz Cwik
Take a look at CompressedSource: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CompressedSource.java I feel as though you could follow the same pattern to decompress/decrypt the data as a wrapper. Apache Beam supports a concept of dynamic work

Re: Is this a valid usecase for Apache Beam and Google dataflow

2017-06-20 Thread Lukasz Cwik
Take a look at session windows[1]. As long as the messages you post to Pubsub aren't spaced out farther then the session gap duration they will all get grouped together. It seems as though it would be much simpler to just run a separate Apache Beam job for each internal job you want to process

Re: reading from s3 file in aws

2017-06-22 Thread Lukasz Cwik
You want to depend on the Hadoop File System module[1] and configure HadoopFileSystemOptions[2] with a S3 configuration[3]. 1: https://github.com/apache/beam/tree/master/sdks/java/io/hadoop-file-system 2:

Re: Best way to load heavy object into memory on nodes (python sdk)

2017-05-24 Thread Lukasz Cwik
Why not use a singleton like pattern and have a function which either loads and caches the ML model from a side input or returns the singleton if it has been loaded. You'll want to use some form of locking to ensure that you really only load the ML model once. On Wed, May 24, 2017 at 6:18 AM,

Re: How to partition a stream by key before writing with FileBasedSink?

2017-05-24 Thread Lukasz Cwik
Since your using a small number of shards, add a Partition transform which uses a deterministic hash of the key to choose one of 4 partitions. Write each partition with a single shard. (Fixed width diagram below) Pipeline -> AvroIO(numShards = 4) Becomes: Pipeline -> Partition -->

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Lukasz Cwik
What runner are you using (Flink, Spark, Google Cloud Dataflow, Apex, ...)? On Wed, May 24, 2017 at 8:09 AM, Ankur Chauhan wrote: > Sorry that was an autocorrect error. I meant to ask - what dataflow runner > are you using? If you are using google cloud dataflow then the

Re: How to partition a stream by key before writing with FileBasedSink?

2017-05-24 Thread Lukasz Cwik
t; partitions so that Dataflow won't override numShards. > > Josh > > > On Wed, May 24, 2017 at 4:10 PM, Lukasz Cwik <lc...@google.com> wrote: > >> Since your using a small number of shards, add a Partition transform >> which uses a deterministic hash of the key to

Re: Call CombineFns.compose() with uncertain combinefn

2017-05-30 Thread Lukasz Cwik
For the compilation error, have you tried? Long maxLatency = e.get((TupleTag) (TupleTag) finalOutputTags.get(0)); I'm not sure whether I fully understand the problem. On Sat, May 27, 2017 at 2:15 AM, 郭亚峰(默岭) wrote: > > Hi there, > I'm working with a small DSL

Re: Enriching stream messages based on external data

2017-06-02 Thread Lukasz Cwik
dk.util.SerializableUtils.serializeToByteAr >>>>> ray(SerializableUtils.java:49) >>>>> ... 10 more >>>>> >>>>> The way I'm trying to use this in the ParDo/DoFn is: >>>>> >>>>> (line 138 starts here) >>

Re: Running Beam in Hadoop Cluster

2017-06-02 Thread Lukasz Cwik
To flatten all the dependencies into one jar is build system dependent. If using Maven I would look into the Maven Shade Plugin ( https://maven.apache.org/plugins/maven-shade-plugin/). Jar files are also just zip files so you could merge them manually as well but you'll need to deal with

Re: Running Beam in Hadoop Cluster

2017-06-02 Thread Lukasz Cwik
hoo-inc.com> wrote: > Yeah, we're working on altering the build file to include all dependencies > in one, huge jar. Is there a better way than this to run Beam jobs on a > cluster? Putting everything into a jar seems like a clunky solution. > > > On Friday, June 2, 2017 1

Re: How to partition a stream by key before writing with FileBasedSink?

2017-06-07 Thread Lukasz Cwik
morrow. >> >> Sorry for the confusing question! >> >> Josh >> >> On Tue, Jun 6, 2017 at 10:01 PM, Lukasz Cwik <lc...@google.com> wrote: >> >>> Based upon your descriptions, it seemed like you wanted limited >>> parallelism becaus

Re: Enriching stream messages based on external data

2017-06-01 Thread Lukasz Cwik
Combining PubSub + Bigtable is common. You should try to use the BigtableSession approach because the hbase approach adds a lot of dependencies (leading to dependency conflicts). You should use the same version of Bigtable libraries that Apache Beam is using (Apache Beam 2.0.0 uses Bigtable

Re: 答复: Creating side input map with global window

2017-06-15 Thread Lukasz Cwik
Take a look at DoFn setup/teardown, called only once per DoFn instance and not per element so it makes easier to write initialization code. Also if the schema map is shared, have you thought of using a single static instance of Guava's LoadingCache shared amongst all the DoFn instances? You can

Re: Action in the pipeline after Write

2017-06-11 Thread Lukasz Cwik
Unfortunately you can't Combine Writes since they return PDone (a terminal node) during pipeline construction. On Sun, Jun 11, 2017 at 3:23 PM, Gwilym Evans wrote: > I'm not 100% sure as I haven't tried it, but, Combining comes to mind as a > possible way of doing

Re: S3 support?

2017-06-13 Thread Lukasz Cwik
Yes you can use the HadoopFileSystem and use the Hadoop S3A connector. Documentation about options/configuration for S3 Hadoop connectors: https://wiki.apache.org/hadoop/AmazonS3 Build a valid Hadoop configuration for S3 and set it on the HadoopFileSystemOptions: Configuration s3Configuration =

Re: Scaling dataflow python SDK 2.0.0

2017-06-10 Thread Lukasz Cwik
The Dataflow implementation when executing a batch pipeline does not parallelize dependent fused segments irrespective of the windowing function so #1 will fully execute before #2 starts. On Sat, Jun 10, 2017 at 3:48 PM, Morand, Sebastien < sebastien.mor...@veolia.com> wrote: > Hi again, > > So

Re: Reprocessing historic data with streaming jobs

2017-05-01 Thread Lukasz Cwik
I believe that if your data from the past can't effect the data of the future because the windows/state are independent of each other then just reprocessing the old data using a batch job is simplest and likely to be the fastest. About your choices 1, 2, and 3, allowed lateness is relative to the

Re: [HEADS UP] Using "new" filesystem layer

2017-05-04 Thread Lukasz Cwik
JB, for your second point it seems as though you may not be setting the Hadoop configuration on HadoopFileSystemOptions. Also, I just merged https://github.com/apache/beam/pull/2890 which will auto detect Hadoop configuration based upon your HADOOP_CONF_DIR and YARN_CONF_DIR environment variables.

Re: How to partition a stream by key before writing with FileBasedSink?

2017-06-06 Thread Lukasz Cwik
buffering data in windows, and want the shard guarantee to apply across > windows. > > > > On Tue, Jun 6, 2017 at 5:42 PM, Lukasz Cwik <lc...@google.com> wrote: > >> Your code looks like what I was describing. My only comment would be to >> use a deterministic hashing

Re: How to partition a stream by key before writing with FileBasedSink?

2017-06-06 Thread Lukasz Cwik
at the runner won't parallelise the > final MyDofn across e.g. 8 instances instead of 4? If there are two input > elements with the same key are they actually guaranteed to be processed on > the same instance? > > > Thanks, > > Josh > > > > > On Tue, Jun 6, 2017 at 4:51

Re: Slack invitation request

2017-10-03 Thread Lukasz Cwik
Invitation sent. Welcome On Tue, Oct 3, 2017 at 10:56 AM, Tim <timrobertson...@gmail.com> wrote: > May I also have one please? > > Tim, > Sent from my iPhone > > On 3 Oct 2017, at 19:22, Lukasz Cwik <lc...@google.com> wrote: > > Invitation sent, welcome. >

Re: Best way to inject external per key config into the pipeline

2017-10-05 Thread Lukasz Cwik
Can you stream the updates to the keys into the pipeline and then use it as a side input performing a join against on your main stream that needs the config data? You could also use an in memory cache that periodically refreshes keys from the external source. A better answer depends on: * how

Re: Best way to inject external per key config into the pipeline

2017-10-05 Thread Lukasz Cwik
onfig. > 2. The config data shouldn't be change often. It is configured by human > users. > 3. The config data per key should be about 10-20 key value pairs. > 4. Ideally the key number is in the range of a few millions, but a few > thousands to begin with. > > Thanks > Eric

Re: How to catch exceptions while using DatastoreV1 API

2017-10-16 Thread Lukasz Cwik
to five times, if it still fails, maybe > we'll just write it to logs and then redo this write later. > > Do you think that makes sense? > > Thanks, > > Derek > > On Mon, Oct 16, 2017 at 10:31 AM, Lukasz Cwik <lc...@google.com> wrote: > >> That source is not a

Re: Limit the number of DoFn instances per worker?

2017-10-17 Thread Lukasz Cwik
The `numberOfWorkerHarnessThreads` is worker wide and not per DoFn. Setting this value to constrain how many threads are executing will impact all parts of your pipeline. One idea is to use a Semaphore as a static object within your DoFn with a fixed number of allowed actors that can enter and

Re: Regarding Beam Slack Channel

2017-10-13 Thread Lukasz Cwik
Invite sent, welcome. On Fri, Oct 13, 2017 at 3:07 PM, NerdyNick wrote: > Hello > > Can someone please add me to the Beam slack channel? > > Thanks. >

Re: Update doc on custom source and sink

2017-10-16 Thread Lukasz Cwik
Check out https://beam.apache.org/documentation/io/io-toc/ and the PTransform style guide https://beam.apache.org/contribute/ptransform-style-guide/. The PTransform style guide contains a lot of useful points which are general but still apply to IO authors. On Mon, Oct 16, 2017 at 3:52 PM, Eric

Re: Problem with autoscaling

2017-09-05 Thread Lukasz Cwik
That is correct, autoscaling for streaming is only supported in Pubsub. What sources were you interested in? On Mon, Sep 4, 2017 at 12:54 AM, Derek Hao Hu wrote: > I've used PubSubIO for autoscaling on a streaming pipeline and it seems to > be working fine so far. > > I

Re: Apache Beam v2.1.0 - Spark Runner Issue

2017-08-30 Thread Lukasz Cwik
To my knowledge you should use Spark 1.6.3 since that is what is declared as the spark.version in the projects root pom.xml On Wed, Aug 30, 2017 at 2:45 PM, Mahender Devaruppala < mahend...@apporchid.com> wrote: > Hello, > > > > I am running into spark assertion error when running a apache

Re: Beam and Python API: Pandas/Numpy?

2017-10-03 Thread Lukasz Cwik
This is neat. On Fri, Sep 29, 2017 at 1:26 PM, Vilhelm von Ehrenheim < vonehrenh...@gmail.com> wrote: > Hi Steve! > I have several pipelines that successfully use both numpy and scikit > models without any problems. I don't think I use Pandas atm but I'm sure > that is fine too. > > However, you

Re: Slack invitation request

2017-10-03 Thread Lukasz Cwik
Invitation sent, welcome. On Tue, Oct 3, 2017 at 9:14 AM, Jon Brasted wrote: > Hello, > > Please may I have an invitation to the Apache Beam Slack channel? > > > Thanks, > Jon >

Re: Long windows / lagged events

2017-08-28 Thread Lukasz Cwik
Using a bounded (batch style) pipeline you should be able to just group all events by user and ignore windowing completely and produce any information since you'll have a global view of all events. This scales well since data for a user is only held up to the point that it is processed and then

Re: Writing Parquet / Orc files

2017-09-05 Thread Lukasz Cwik
Apache Beam supports a fixed number of shards but discourages use for auto-tuning/scaling reasons and simplifies good scalable pipeline creation for users. Some users do require a fixed number of shards and several classes like TextIO support fixed sharding. If your trying to always use a fixed

Re: Filtering a PCollection for only monotonically increasing elements

2017-10-11 Thread Lukasz Cwik
* Is code executed within @ProcessElement of a DoFn using State API guaranteed to be "serialized" per-K and per-window (by "serialized" i mean that it will produce the same effect as if every execution of the @ProcessElement method for a given K and window had executed to completion before the

Re: How to window by quantity of data?

2017-10-18 Thread Lukasz Cwik
handles attachments. > > Jacob > > On Tue, Oct 17, 2017 at 2:54 PM, Lukasz Cwik <lc...@google.com> wrote: > >> Have you considered using a stateful DoFn, buffering/batching based upon >> a certain number of elements is shown in this blog[1] and could be extended >

Re: Limit the number of DoFn instances per worker?

2017-10-18 Thread Lukasz Cwik
per worker instance, >> then this creates a backlog and autoscaling might trigger earlier, so >> technically the overall system lag might actually be better? >> >> I haven't tested this hypothesis yet but basically the above is my >> reasoning. >> >> Thanks,

Re: How to window by quantity of data?

2017-10-18 Thread Lukasz Cwik
Marble <jmar...@kochava.com> wrote: > Here's a gist: https://gist.github.com/jacobmarble/ > 6ca40e0a14828e6a0dfe89b9cb2e4b4c > > Should I consider StateId value mutations to be non-atomic? > > Jacob > > On Wed, Oct 18, 2017 at 9:25 AM, Lukasz Cwik <lc...@google.co

Re: Batch loads in streaming pipeline - withNumFileShards

2017-11-15 Thread Lukasz Cwik
Filed https://issues.apache.org/jira/browse/BEAM-3198 for the IllegalArgumentException Do you mind posting a little code snippet of how you build the BQ IO connector on BEAM-3198? On Wed, Nov 15, 2017 at 12:18 PM, Arpan Jain wrote: > Hi, > > I am trying to use

Re: Spark Runner Issues with YARN

2017-12-04 Thread Lukasz Cwik
It seems like your trying to use Spark 2.1.0. Apache Beam currently relies on users using Spark 1.6.3. There is an open pull request[1] to migrate to Spark 2.2.0. 1: https://github.com/apache/beam/pull/4208/ On Mon, Dec 4, 2017 at 10:58 AM, Opitz, Daniel A wrote: > We

Re: Apache Beam, version 2.2.0

2017-12-04 Thread Lukasz Cwik
I also believe we were still in the investigatory phase for dropping support for Java 7. On Mon, Dec 4, 2017 at 2:22 PM, Eugene Kirpichov wrote: > Thanks JB for sending the detailed notes about new stuff in 2.2.0! A lot > of exciting things indeed. > > Regarding Java 8: I

Re: Global sum of latest help

2017-12-04 Thread Lukasz Cwik
Since processing can happen out of order, for example if the input was: ``` {"id": "2", parent_id: "a", "timestamp": 2, "amount": 3} {"id": "1", parent_id: "a", "timestamp": 1. "amount": 1} {"id": "1", parent_id: "a", "timestamp": 3, "amount": 2} ``` would the output be 3 and then 5 or would you

Re:

2017-12-01 Thread Lukasz Cwik
Instead of a callback fn, its most useful if a PCollection is returned containing the result of the sink so that any arbitrary additional functions can be applied. On Fri, Dec 1, 2017 at 7:14 AM, Jean-Baptiste Onofré wrote: > Agree, I would prefer to do the callback in the IO

Re: Usecase scenario: Job definition from low frequently changing storage

2017-12-18 Thread Lukasz Cwik
1) Based upon your question, it seems like you want to have one large job that then spins off really small jobs. Any reason you can't just do it all within one pipeline? 2. Am I capable of defining flexible window sizes per "device-task"? Yes, you'll want a custom WindowFn which when going

Re: Handling errors in I/O transformation

2017-11-20 Thread Lukasz Cwik
BigQueryIO has been written in such a way to support emitting failed records to a "dead letter queue". Not all IO transforms support this but it is very useful for the ones that do. WriteResult writeResult = p.apply(PubsubIO.readMessagesWithAttributes() .fromSubscription(“"))

Re: Regarding Beam Slack Channel

2017-11-21 Thread Lukasz Cwik
Invites sent, welcome. On Tue, Nov 21, 2017 at 8:49 AM, Andrew Jones wrote: > Me too, please :) > > On Tue, 21 Nov 2017, at 16:19, Dariusz Aniszewski wrote: > > Hello > > > > Can someone please add me to the Beam slack channel? > > > > Thanks. >

Re: Regarding Beam Slack Channel

2017-11-21 Thread Lukasz Cwik
Just sent one to you Eric, welcome. On Tue, Nov 21, 2017 at 8:54 AM, Eric Anderson wrote: > Me three :) > > On Tue, Nov 21, 2017 at 8:49 AM Andrew Jones > wrote: > >> Me too, please :) >> >> On Tue, 21 Nov 2017, at 16:19, Dariusz Aniszewski

Re: BEAM counters for validation

2017-11-21 Thread Lukasz Cwik
Are we talking about integration testing or general pipeline execution metrics? For integration testing, I would see that they users on PAssert and a test runner to do testing similar to Apache Beam's @ValidatesRunner or IO integration tests. For production pipeline monitoring, the common metric

Re: Slack Channel

2017-11-16 Thread Lukasz Cwik
chava.com> wrote: > > Me too, if you don't mind. > > Jacob > > On Thu, Nov 9, 2017 at 2:09 PM, Lukasz Cwik <lc...@google.com> wrote: > >> Invite sent, welcome. >> >> On Thu, Nov 9, 2017 at 2:08 PM, Fred Tsang <ftsan...@hotmail.com> wrote: &

Re: Slack Channel

2017-11-09 Thread Lukasz Cwik
Invite sent, welcome. On Thu, Nov 9, 2017 at 2:08 PM, Fred Tsang wrote: > Hi, > > Please add me to the slack channel. > > Thanks, > Fred > > Ps. I think "BeamTV" would be a great YouTube channel ;) >

Re: design pattern for enriching data via db lookups?

2017-11-08 Thread Lukasz Cwik
For joining with external data you have some options: * Do direct calls to the external datastore, perform your own in memory caching/expiration. You control exactly what happens and when it happens but as you have done this in the past you know what this entails. * Ingest the external data and

Re: Can we transfrom PCollection to ArrayList?

2017-11-01 Thread Lukasz Cwik
To convert a PCollection into an ArrayList locally within your application, you would need to materialize your PCollection via some sink like AvroIO/TextIO/... and then in your program after the pipeline has completed read the output files parsing the records. On Tue, Oct 31, 2017 at 1:29 AM,

Re: Apache Beam delta between windows.

2018-05-04 Thread Lukasz Cwik
a nice way of Accumulating and Aggregating data along ad infinitum :) ) >> >> Something I'd love to add though (on the Views) >> * When I use the Singleton, how can I ensure that I'm only getting the >> value for the Key I want in the DoFn. I see there is

Re: AssertionError: Job did not reach to a terminal state after waiting indefinitely.

2018-05-10 Thread Lukasz Cwik
It looks like something failed within your job and the error your getting is from your driver program (not the remote execution that is happening within Google Cloud). You'll want to look at the Stackdriver logs for details, you can get more details about how to see your Stackdriver logs here[1].

Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner

2018-05-17 Thread Lukasz Cwik
On Wed, May 16, 2018 at 10:46 PM chandan prakash wrote: > Thanks Ismaël. > Your answers were quite useful for a novice user. > I guess this answer will help many like me. > > *Regarding your answer to point 2 :* > > > *"Checkpointing is supported, Kafka offset

Re: Access to current watermark

2018-05-17 Thread Lukasz Cwik
This is not exposed programatically. Depending on which runner your using, you may be able to query for the watermark through its monitoring APIs if it is exposed as a metric and you know what it is called. This is likely to be brittle implementation and also the data is likely to be stale. On

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-16 Thread Lukasz Cwik
side inputs from >>>> the main input? By that I mean the scope of the side input would be a per >>>> window one and it would be different for every window. Is that correct? >>>> >>>> Regards, >>>> Harsh >>>> >>>> O

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Lukasz Cwik
e accounts again. > > > On Tue, May 15, 2018 at 15:11 Lukasz Cwik <lc...@google.com> wrote: > >> For each BillingModel you receive over Kafka, how "fresh" should the >> account information be? >> Does the account information in the extern

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Lukasz Cwik
For each BillingModel you receive over Kafka, how "fresh" should the account information be? Does the account information in the external store change? On Tue, May 15, 2018 at 11:22 AM Harshvardhan Agrawal < harshvardhan.ag...@gmail.com> wrote: > Hi, > > We have certain billing data that arrives

Re: Documentation for Beam on Windows

2018-05-23 Thread Lukasz Cwik
There is none to my knowledge. On Wed, May 23, 2018 at 1:49 PM Udi Meiri wrote: > Hi all, > > I was looking yesterday for a quickstart guide on how to use Beam on > Windows but saw that those guides are exclusively for Linux users. > > What documentation is available for

Re: Understanding GenerateSequence and SideInputs

2018-05-24 Thread Lukasz Cwik
I should have clarified that the precision guarantee I was talking about was timing. On Thu, May 24, 2018 at 2:21 PM Lukasz Cwik <lc...@google.com> wrote: > The runner is responsible for scheduling the work anywhere it chooses. It > can be the same node all the time or dif

Re: Understanding GenerateSequence and SideInputs

2018-05-24 Thread Lukasz Cwik
The runner is responsible for scheduling the work anywhere it chooses. It can be the same node all the time or different nodes. There is no precision guarantee on the upper bound (only the lower bound), the withRate method states that it will "generate at most a given number of elements per a

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Lukasz Cwik
m > the main input? By that I mean the scope of the side input would be a per > window one and it would be different for every window. Is that correct? > > Regards, > Harsh > > On Tue, May 15, 2018 at 17:54 Lukasz Cwik <lc...@google.com> wrote: > >> Using deduplicate

Re: Create SideInputs for PTranform using lookups for each window of data

2018-05-15 Thread Lukasz Cwik
tiple times across workers in a window. >>> >>> What I was thinking was that it might be better to perform the lookup >>> only once for each account and product in a window and then supply them as >>> side inputs to the main input. >>> >>> On Tu

Re: GlobalWindows, Watermarks and triggers

2018-06-07 Thread Lukasz Cwik
A watermark is a lower bound on data that is processed and available. It is specifically a lower bound because we want runners to be able to process each window in parallel. In your example, a Runner may choose to compute Aggregate[Pete:09:01,X,Y] in parallel with Aggregate[Pete:09:02,X,Y] even

Re: Returning dataframe from parDo and printing its value - advice?

2018-06-18 Thread Lukasz Cwik
User is the correct mailing list. beam.io.WriteToText takes 'strings' which means that you have to format the whole line yourself. You'll want to apply another ParDo after CreateColForSampleFn which takes the 1x164 record and concatenates each value with ',' in between. On Mon, Jun 18, 2018 at

Re: [DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml) grace period?

2018-06-18 Thread Lukasz Cwik
Any updates on BEAM-4512? On Mon, Jun 11, 2018 at 1:42 PM Lukasz Cwik wrote: > Thanks all, it seems as though only Google needs the grace period. I'll > wait for the shorter of BEAM-4512 or two weeks before merging > https://github.com/apache/beam/pull/5571 > > > On Wed, Jun

Re: SQL Filter Pushdowns in Apache Beam SQL

2018-06-13 Thread Lukasz Cwik
It is currently the later where all the data is read and then filtered within the pipeline. Note that this doesn't mean that all the data is loaded into memory as the way that the join is done is dependent on the Runner that is powering the pipeline. Kenn had shared this doc[1] which is starting

Re: GlobalWindows, Watermarks and triggers

2018-06-07 Thread Lukasz Cwik
em is no different, I wanted to explore the space further and > find a more elegant solution (Not introducing Cycles if there was a better > way to handle it). > > > > > > On Thu, Jun 7, 2018 at 10:34 PM Lukasz Cwik wrote: > >> A watermark is a lower bound on data that is

Re: Multimap PCollectionViews' values udpated rather than appended

2018-06-11 Thread Lukasz Cwik
es the issue: > https://github.com/calonso/beam_experiments/blob/master/refreshingsideinput/src/test/scala/com/mrcalonso/RefreshingSideInput2Test.scala#L19-L53 > > Hope it helps > > On Mon, Jun 4, 2018 at 9:04 PM Lukasz Cwik wrote: > >> Carlos, can you provide a test

  1   2   3   >