Re: [PROPOSAL] Preparing for Beam 2.24.0 release

2020-08-27 Thread Eugene Kirpichov
st bug that > was blocking me, and I'm continuing from there today, so hopefully I'll be > able to have the release candidate ready before the week is up. > > Hope the update is helpful, > Daniel Oliveira > > On Thu, Aug 27, 2020 at 12:56 PM Eugene Kirpichov

Re: [PROPOSAL] Preparing for Beam 2.24.0 release

2020-08-27 Thread Eugene Kirpichov
Hi! Just wondering how the progress on 2.24 has been? I see the version in JIRA https://issues.apache.org/jira/projects/BEAM/versions/12347146 is blocked only by https://issues.apache.org/jira/browse/BEAM-10663 which hasn't seen much action in the last week. Is there something specific people can

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-07-01 Thread Eugene Kirpichov
s a good example. I did a quick survey >> and every other property I could find does seem like it fits on the Read, >> and most IOs have a few of these closures for example also extracting >> timestamps. So maybe just a resolution convention of putting them on the >> ReadAll an

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-30 Thread Eugene Kirpichov
tc...). I think it's helpful to think of the parameters >> that are allowed to vary as some "location descriptor", but I imagine IO >> authors may want other parameters to vary across a ReadAll as well. >> >> To me it seems safer to let an IO author "opt-i

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-29 Thread Eugene Kirpichov
s sense in some cases, but I also think not all >>>>>>> IOs >>>>>>> have the same need. I would treat Read as a special type as long as the >>>>>>> Read is schema-aware. >>>>>>> >>> >>>>&g

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-24 Thread Eugene Kirpichov
Hi Ismael, Thanks for taking this on. Have you considered an approach similar (or dual) to FileIO.write(), where we in a sense also have to configure a dynamic number different IO transforms of the same type (file writes)? E.g. how in this example we configure many aspects of many file writes: t

Re: Python IO Connector

2020-01-06 Thread Eugene Kirpichov
Agreed with above, it seems prudent to develop a pure-Python connector for something as common as interacting with a database. It's likely easier to achieve an idiomatic API, familiar to non-Beam Python SQL users, within pure Python. Developing a cross-language connector here might be plain imposs

Re: Per Element File Output Without writeDynamic

2019-12-03 Thread Eugene Kirpichov
tion to a separate file as multiple Documents in a > file will cause compilation errors and/or incorrect code to be generated by > the Thrift compiler. > > Additionally there are some Documents that are so large that we would want > them to be split. > > On Mon, Dec 2, 2019

Re: Per Element File Output Without writeDynamic

2019-12-02 Thread Eugene Kirpichov
Hi Christopher, So, you have a PCollection, and you're writing it to files. FileIO.write/writeDynamic will write several Document's to each file - however, in your use case some of the individual Document's are so large that you want instead each of those large documents to be split into several f

Re: [VOTE] Beam Mascot animal choice: vote for as many as you want

2019-11-20 Thread Eugene Kirpichov
[ ] Beaver [ ] Hedgehog [X] Lemur [X] Owl [ ] Salmon [ ] Trout [ ] Robot dinosaur [X] Firefly [X] Cuttlefish [ ] Dumbo Octopus [ ] Angler fish On Wed, Nov 20, 2019 at 9:47 AM Valentyn Tymofieiev wrote: > [ ] Beaver > [X] Hedgehog > [ ] Lemur > [ ] Owl > [ ] Salmon > [ ] Trout > [ ] Robot dinosau

Re: RabbitMQ and CheckpointMark feasibility

2019-11-14 Thread Eugene Kirpichov
o-op in finalizeCheckpoint. > This is very different, semantically, and can lead to dropped messages if a > pipeline doesn't finish processing the given message. > > Any help would be much appreciated. > If I'm misunderstanding something above, could you describe in detail a sc

Re: RabbitMQ and CheckpointMark feasibility

2019-11-08 Thread Eugene Kirpichov
s not need to be Java > Seralizable. All that's needed is do return a Coder for the CheckpointMark > in getCheckpointMarkCoder. > > On Thu, Nov 7, 2019 at 7:29 PM Eugene Kirpichov wrote: > >> Hi Daniel, >> >> This is probably insufficiently well documented. The C

Re: RabbitMQ and CheckpointMark feasibility

2019-11-07 Thread Eugene Kirpichov
Hi Daniel, This is probably insufficiently well documented. The CheckpointMark is used for two purposes: 1) To persistently store some notion of how much of the stream has been consumed, so that if something fails we can tell the underlying streaming system where to start reading when we re-create

Re: [Discuss] Beam mascot

2019-11-04 Thread Eugene Kirpichov
Feels like "Beam" would go well with an animal that has glowing bright eyes (with beams of light shooting out of them), such as a lemur [1] or an owl. [1] https://www.cnn.com/travel/article/madagascar-lemurs/index.html On Mon, Nov 4, 2019 at 7:33 PM Kenneth Knowles wrote: > Yes! Let's have a ma

Re: RabbitMqIO issues and open PRs

2019-10-31 Thread Eugene Kirpichov
Regarding review latency, FWIW I'm not super active on Beam these days but I'll be happy to review 1-2 PRs for RabbitMqIO (I'm @jkff). On Thu, Oct 31, 2019 at 8:47 PM Kenneth Knowles wrote: > Yes, thanks for emailing! We very much value sharing your intentions with > the community. For small cha

Re: JdbcIO read needs to fit in memory

2019-10-25 Thread Eugene Kirpichov
Yeah - in this case your primary option is to use JdbcIO.readAll() and shard your query, as suggested above. Alternative hypothesis: is the result set of your query actually big enough that it *shouldn't* fit in memory? Or could it be a matter of inefficient storage of its elements? Could you brie

Re: JdbcIO read needs to fit in memory

2019-10-24 Thread Eugene Kirpichov
ad things in parallel. On Thu, Oct 24, 2019 at 8:17 AM Eugene Kirpichov wrote: > Hi Josef, > > JdbcIO per se does not require the result set to fit in memory. The issues > come from the limitations of the context in which it runs: > - It indeed uses a DoFn to emit results; a DoFn

Re: JdbcIO read needs to fit in memory

2019-10-24 Thread Eugene Kirpichov
Hi Josef, JdbcIO per se does not require the result set to fit in memory. The issues come from the limitations of the context in which it runs: - It indeed uses a DoFn to emit results; a DoFn is in general allowed to emit an unbounded number of results that doesn't necessarily have to fit in memor

Re: Reading from RDB, ParDo or BoundedSource

2019-09-27 Thread Eugene Kirpichov
to BigQuery. > > Em sex, 27 de set de 2019 19:21, Pablo Estrada > escreveu: > >> Hi Lucas! >> Can you share more information about your use case? Java has JdbcIO. >> Maybe that's all you need? Or perhaps you're using Python SDK? >> Best >> -P. >

Re: Reading from RDB, ParDo or BoundedSource

2019-09-27 Thread Eugene Kirpichov
Hi Lucas, Any reason why you can't use JdbcIO? You almost certainly should *not* use BoundedSource, nor Splittable DoFn for this. BoundedSource is obsolete in favor of assembling your connector from regular transforms and/or using an SDF, and SDF is an extremely advanced feature whose primary audie

Re: Collecting feedback for Beam usage

2019-09-24 Thread Eugene Kirpichov
Creating a central place for collecting Beam usage sounds compelling, but we'd have to be careful about several aspects: - It goes without saying that this can never be on-by-default, even for a tiny fraction of pipelines. - For further privacy protection, including the user's PipelineOptions is pr

Re: [VOTE] Release 2.14.0, release candidate #1

2019-07-31 Thread Eugene Kirpichov
I would recommend that the known issue notice about this source at least be strongly worded - this source in the current state should be marked "DO NOT USE" - it will produce data loss in *most* production use cases. That still leaves the risk that people will use it anyway; up to folks driving the

Re: Sort Merge Bucket - Action Items

2019-07-25 Thread Eugene Kirpichov
ing ShardedKey for SMB's >> bucket/shard to reuse WriteShardsIntoTempFilesFn. These transforms are >> private and easy to change/pull out. >> >>> >> >>> OTOH they are somewhat coupled with the package private >> {Avro,Text,TFRecord}Sink and thei

Re: Sort Merge Bucket - Action Items

2019-07-22 Thread Eugene Kirpichov
On Mon, Jul 22, 2019 at 7:49 AM Robert Bradshaw wrote: > On Mon, Jul 22, 2019 at 4:04 PM Neville Li wrote: > > > > Thanks Robert. Agree with the FileIO point. I'll look into it and see > what needs to be done. > > > > Eugene pointed out that we shouldn't build on FileBased{Source,Sink}. So > for

Re: pubsub -> IO

2019-07-17 Thread Eugene Kirpichov
I think full-blown SDF is not needed for this - someone just needs to implement a MongoDbIO.readAll() variant, using a composite transform. The regular pattern for this sort of thing will do (ParDo split, reshuffle, ParDo read). Whether it's worth replacing MongoDbIO.read() with a redirect to readA

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-16 Thread Eugene Kirpichov
that's extra complication. IMO the >>reader work distribution is better solved by better bucket/shard strategy >>in upstream writer. >> >> *References* >> >>1. ReadMatches extends PTransform, >>PCollection> >>2. ReadAllVi

Re: Discussion/Proposal: support Sort Merge Bucket joins in Beam

2019-07-15 Thread Eugene Kirpichov
Quick note: I didn't look through the document, but please do not build on either FileBasedSink or FileBasedReader. They are both remnants of the old, non-composable IO world; and in fact much of the composable IO work emerged from frustration with their limitations and recognizing that many other

Re: pipeline status tracing

2019-06-24 Thread Eugene Kirpichov
t; > > [1] > https://beam.apache.org/documentation/patterns/file-processing-patterns/ > > > > On 20 Jun 2019, at 23:00, Eugene Kirpichov wrote: > > > > If you're writing to files, you can already do this: FileIO.write() > returns WriteFilesResult that you can

Re: pipeline status tracing

2019-06-20 Thread Eugene Kirpichov
If you're writing to files, you can already do this: FileIO.write() returns WriteFilesResult that you can use in combination with Wait.on() and JdbcIO.write() to write something to a database afterwards. Something like: PCollection<..> writeResult = data.apply(FileIO.write()...); Create.of("dummy"

Re: I'm thinking about new features, what do you think?

2019-06-07 Thread Eugene Kirpichov
It looks like you want to take a PCollection of lists of items of the same type (but not necessarily of the same length - in your example you pad them to the same length but that's unnecessary), induce an undirected graph on them where there's an edge between XS and YS if they have an element in co

Re: Wait on JdbcIO write completion

2019-02-20 Thread Eugene Kirpichov
Hi Jonathan, Wait.on() requires a PCollection - it is not possible to change it to wait on PDone because all PDone's in the pipeline are the same so it's not clear what exactly you'd be waiting on. To use the Wait transform with JdbcIO.write(), you would need to change https://github.com/apache/b

Re: [DISCUSS] Should File based IOs implement readAll() or just readFiles()

2019-01-30 Thread Eugene Kirpichov
TextIO.read() and AvroIO.read() indeed perform better than match() + readMatches() + readFiles(), due to DWR - so for these two in particular I would not recommend such a refactoring. However, new file-based IOs that do not support DWR should only provide readFiles(). Those that do, should provide

Re: FileIOTest.testMatchWatchForNewFiles flakey in java presubmit

2019-01-22 Thread Eugene Kirpichov
Yeah the "List expected" is constructed from Files.getLastModifiedTime() calls before the files are actually modified, the code is basically unconditionally broken rather than merely flaky. There's several easy options: 1) Use PAssert.that().satisfies() instead of .contains(), and use assertThat()

Re: SplittableDoFn

2018-10-02 Thread Eugene Kirpichov
Very cool, thanks Alex! On Tue, Oct 2, 2018 at 2:19 PM Alex Van Boxel wrote: > Don't want to crash the tech discussion here, but... I just gave a session > at the Beam Summit about Splittable DoFn's as a users perspective (from > things I could gather from the documentation and experimentation).

Re: Modular IO presentation at Apachecon

2018-09-27 Thread Eugene Kirpichov
Thanks Ismael and everyone else! Unfortunately I do not believe that this session was recorded on video :( Juan - yes, this is some of the important future work, and I think it's not hard to add to many connectors; contributions would be welcome. In terms of a "per-key" Wait transform, yeah, that d

Re: Beam Schemas: current status

2018-08-29 Thread Eugene Kirpichov
Wow, this is really coming together, congratulations and thanks for the great work! On Wed, Aug 29, 2018 at 1:40 AM Reuven Lax wrote: > I wanted to send a quick note to the community about the current status of > schema-aware PCollections in Beam. As some might remember we had a good > discussio

Re: Let's start getting rid of BoundedSource

2018-07-17 Thread Eugene Kirpichov
On Tue, Jul 17, 2018 at 2:49 AM Etienne Chauchot wrote: > Hi Eugene > > Le lundi 16 juillet 2018 à 07:52 -0700, Eugene Kirpichov a écrit : > > Hi Etienne - thanks for catching this; indeed, I somehow missed that > actually several runners do this same thing - it seemed to me a

Re: BiqQueryIO.write and Wait.on

2018-07-17 Thread Eugene Kirpichov
us as part of the collection so that not only the > completion is signaled, but also the result (success/failure) can be > accessed, how does it sound? > > Regards > > On Tue, Jul 17, 2018 at 3:07 AM Eugene Kirpichov > wrote: > >> Hi Carlos, >> >> Any upd

Re: BiqQueryIO.write and Wait.on

2018-07-16 Thread Eugene Kirpichov
Hi Carlos, Any updates / roadblocks you hit? On Tue, Jul 3, 2018 at 7:13 AM Eugene Kirpichov wrote: > Awesome!! Thanks for the heads up, very exciting, this is going to make a > lot of people happy :) > > On Tue, Jul 3, 2018, 3:40 AM Carlos Alonso wrote: > >> + dev@beam.

Re: CODEOWNERS for apache/beam repo

2018-07-16 Thread Eugene Kirpichov
rn: /runners/google-cloud-dataflow-java*) >>>>>>>> INFO:root:Selected reviewer @lukecwik for: >>>>>>>> /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java >>>>>>>> (path_pattern: /sdks/java/co

Re: Let's start getting rid of BoundedSource

2018-07-16 Thread Eugene Kirpichov
/ On Mon, Jul 16, 2018 at 7:56 AM Jean-Baptiste Onofré wrote: > Hi guys, > > I think it's the purpose of SDF to simplify the BoundedSource like writing. > > I agree that extended @SplitRestriction is a good approach. > > Regards > JB > > On 16/07/2018 16:52, Eugene

An update on Eugene

2018-07-16 Thread Eugene Kirpichov
Hi beamers, After 5.5 years working on data processing systems at Google, several of these years working on Dataflow and Beam, I am moving on to do something new (also at Google) in the area of programming models for machine learning. Anybody who worked with me closely knows how much I love buildi

Re: Let's start getting rid of BoundedSource

2018-07-16 Thread Eugene Kirpichov
nline > > Etienne > > Le dimanche 15 juillet 2018 à 14:20 -0700, Eugene Kirpichov a écrit : > > Hey beamers, > > I've always wondered whether the BoundedSource implementations in the Beam > SDK are worth their complexity, or whether they rather could be converted &g

Let's start getting rid of BoundedSource

2018-07-15 Thread Eugene Kirpichov
Hey beamers, I've always wondered whether the BoundedSource implementations in the Beam SDK are worth their complexity, or whether they rather could be converted to the much easier to code ParDo style, which is also more modular and allows you to very easily implement readAll(). There's a handful

Re: CODEOWNERS for apache/beam repo

2018-07-13 Thread Eugene Kirpichov
example, I added myself to runner/core >>> > because I wanted to take a look at the PRs related to >>> > runner/core/metrics but I'm getting assigned to all runner-core PRs. >>> Can >>> > we specify paths like >>> > runners/core-java/src/m

Re: CODEOWNERS for apache/beam repo

2018-07-12 Thread Eugene Kirpichov
Hi Udi, I see that the PR was merged - thanks! However it seems to have some unintended effects. On my PR https://github.com/apache/beam/pull/5940 , I assigned a reviewer manually, but the moment I pushed a new commit, it auto-assigned a lot of other people to it, and I had to remove them. This s

Building the Java SDK container with Jib?

2018-07-09 Thread Eugene Kirpichov
Hi, Apparently a new tool has come out that lets you build Java containers cheaply, without even having Docker installed: https://cloudplatform.googleblog.com/2018/07/introducing-jib-build-java-docker-images-better.html Anyone interested in giving it a shot, to have faster turnaround when making

Re: Performance issue in Beam 2.4 onwards

2018-07-09 Thread Eugene Kirpichov
Hi - If I remember correctly, the reason for this change was to ensure that the state is encodable at all. Prior to the change, there had been situations where the coder specified on a state cell is buggy, absent or set incorrectly (due to some issue in coder inference), but direct runner did not

Re: BiqQueryIO.write and Wait.on

2018-07-03 Thread Eugene Kirpichov
monly used) - that will pave the >> way for future work. >> >> Hope this helps, please ask more if something is unclear! >> >> On Fri, Apr 20, 2018 at 12:48 AM Carlos Alonso >> wrote: >> >>> Hey Eugene!! >>> >>> I’d gladly take a sta

Re: Unbounded source translation for portable pipelines

2018-07-02 Thread Eugene Kirpichov
;> >> Looks as if the Python side runner changed: >> >> Traceback (most recent call last): >> File "flink-example.py", line 7, in >> from apache_beam.runners.portability import universal_local_runner >> ImportError: cannot import name universal_local_

Re: [Design Proposal] Improving Beam code review

2018-06-30 Thread Eugene Kirpichov
Thanks for driving this Huygaa! I'm excited to see this happen, this is going to make a big difference in contributor and reviewer experience. On Fri, Jun 29, 2018 at 5:29 PM Huygaa Batsaikhan wrote: > Thanks everyone who reviewed the doc and suggested good ideas. Here is a > recap of the docume

Re: Unbounded source translation for portable pipelines

2018-06-27 Thread Eugene Kirpichov
Flink sources/sinks at this time.) > > Thanks, > Thomas > > > On Tue, Jun 26, 2018 at 2:13 AM Eugene Kirpichov > wrote: > >> Hi! >> >> Wanted to let you know that I've just merged the PR that adds >> checkpointable SDF support to the portable reference

Re: ErrorProne and -Werror enabled for all Java projects

2018-06-27 Thread Eugene Kirpichov
This is awesome, thanks to everybody involved! It's so good to have ./gradlew compileJava compileTestJava not produce heaps of warnings like it used to. On Wed, Jun 27, 2018 at 9:52 AM Andrew Pilloud wrote: > Looking at the diff I think you can replace "Default Setting" with "Only > Setting". Th

Re: [DISCUSS] Automation for Java code formatting

2018-06-26 Thread Eugene Kirpichov
+1! In some cases the temptation to format code manually can be quite strong, but the ease of just re-running the formatter after any change (especially after global changes like class/method renames) overweighs it. I lost count of the times when I wasted a precommit because some line became >100

Re: bad logger import?

2018-06-26 Thread Eugene Kirpichov
This is definitely a typo. Feel free to send PRs to correct these on sight :) On Tue, Jun 26, 2018 at 8:51 AM Rafael Fernandez wrote: > Filed https://issues.apache.org/jira/browse/BEAM-4644 for this. I > assigned it to +Ankur Goenka because it's the first > name in history :p (please reroute wh

Re: Unbounded source translation for portable pipelines

2018-06-25 Thread Eugene Kirpichov
ested in the Flink portable streaming runner side? It is of course blocked by having the rest of that runner working in streaming mode though (the batch mode is practically done - will send you a separate note about the status of that). On Fri, Mar 23, 2018 at 12:20 PM Eugene Kirpichov wrote: &

Re: "retest this please" no longer working on the beam site repo

2018-06-21 Thread Eugene Kirpichov
It's quite often not working on the main repo either :-/ On Thu, Jun 21, 2018 at 9:59 AM Reuven Lax wrote: > Does anyone know why this functionality isn't working? > > Reuven >

Re: Celebrating Pride... in the Apache Beam Logo

2018-06-15 Thread Eugene Kirpichov
Very cool! On Fri, Jun 15, 2018 at 10:56 AM OrielResearch Eila Arich-Landkof < e...@orielresearch.org> wrote: > 👍👍👍 > > On Fri, Jun 15, 2018 at 1:50 PM, Griselda Cuevas wrote: > >> Someone in my team edited some Open-Source-Projects' logos to celebrate >> pride and Apache Beam was included! >> >

Re: [CANCEL][VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-13 Thread Eugene Kirpichov
FWIW I have a fix to the flaky test in https://github.com/apache/beam/pull/5585 (open) On Wed, Jun 13, 2018 at 5:26 PM Udi Meiri wrote: > +1 to ignoring flaky test. > > FYI there's a fourth cherrypick: https://github.com/apache/beam/pull/5624 > > On Wed, Jun 13, 2018 at 3:45 PM Pablo Estrada wr

Re: Proposing interactive beam runner

2018-06-13 Thread Eugene Kirpichov
This is awesome, thanks Sindy! I hope that the questions related to portability will get resolved in a way that will allow to reuse some of the work for other interactive Beam experiences, including SQL as Andrew says, and providing a REPL e.g. for users of Scala or other JVM-based languages. +Nev

Re: [DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml) grace period?

2018-06-06 Thread Eugene Kirpichov
Is it possible for Dataflow to just keep a copy of the pom.xmls and delete it as soon as Dataflow is migrated? Overall +1, I've been using Gradle without issues for a while and almost forgot pom.xml's still existed. On Wed, Jun 6, 2018, 1:13 PM Pablo Estrada wrote: > I agree that we should dele

Re: Reducing Committer Load for Code Reviews

2018-05-31 Thread Eugene Kirpichov
taken just to get things done quickly. > > I think as Thomas Weise mentioned that mentoring and encouraging more > contributors to become committers is a better long term solution to > this issue. > On Thu, May 31, 2018 at 11:24 PM Eugene Kirpichov > wrote: > > > &

Re: Reducing Committer Load for Code Reviews

2018-05-31 Thread Eugene Kirpichov
Agreed with all said above - as I understand it, we have consensus on the following: Whether you're a committer or not: - Find somebody who's familiar with the code and ask them to review. Use your best judgment in whose review would give you good confidence that your code is actually good. (it's

Re: The full list of proposals / prototype documents

2018-05-31 Thread Eugene Kirpichov
on, the authors will be always welcomed to update this page >>>>>>>>> by >>>>>>>>> themselves. In my turn, I’ll try to keep an eye on this to keep it >>>>>>>>> synced. >>>>>>>>> And of course,

Re: Launching a Portable Pipeline

2018-05-22 Thread Eugene Kirpichov
ent/d/1xOaEEJrMmiSHprd-WiYABegfT129qqF-idUBINjxz8s/edit#heading=h.lky5ef6wxo9x> > ). > Please feel free to provide comments. > Once we agree, I will publish the conclusion on the mailing list. > > On Mon, May 14, 2018 at 1:51 PM Eugene Kirpichov > wrote: > >> Thanks Ankur, this

Re: The full list of proposals / prototype documents

2018-05-22 Thread Eugene Kirpichov
Making it easier to manage indeed would be good. Could someone from PMC please add the following documents of mine to it? SDF related documents: http://s.apache.org/splittable-do-fn http://s.apache.org/sdf-via-source http://s.apache.org/textio-sdf http://s.apache.

Re: Current progress on Portable runners

2018-05-22 Thread Eugene Kirpichov
the contents of this email? (my >>> reasoning is (a) keep the contribution guide concise but (b) all this >>> detail is helpful yet (c) the detail may be ever-changing so making a >>> separate web page is not the best format) >>> >>> Kenn >>> >&

Re: [VOTE] Go SDK

2018-05-22 Thread Eugene Kirpichov
+1! It is particularly exciting to me that the Go support is "portability-first" and does everything in the proper "portability way" from the start, free of legacy non-portable runner support code. On Tue, May 22, 2018 at 11:32 AM Scott Wegner wrote: > +1 (non-binding) > > Having a third langua

Re: What is the future of Reshuffle?

2018-05-18 Thread Eugene Kirpichov
Agreed that it should be undeprecated, many users are getting confused by this. I know that some people are working on a replacement for at least one of its use cases (RequiresStableInput), but the use case of breaking fusion is, as of yet, unaddressed, and there's not much to be gained by keeping

Re: What is the Impulse and why do we need it?

2018-05-18 Thread Eugene Kirpichov
Hi Ismael, Impulse is a primitive necessary for the Portability world, where sources do not exist. Impulse is the only possible root of the pipeline, it emits a single empty byte array, and it's all DoFn's and SDF's from there. E.g. when using Fn API, Read.from(BoundedSource) is translated into: Im

Current progress on Portable runners

2018-05-17 Thread Eugene Kirpichov
Hi all, A little over a month ago, a large group of Beam community members has been working a prototype of a portable Flink runner - that is, a runner that can execute Beam pipelines on Flink via the Portability API . The prototype was developed in a separate

Re: Wait.on() - "Do this, then that" transform

2018-05-17 Thread Eugene Kirpichov
> > > On Thu, May 17, 2018 at 10:13 PM Eugene Kirpichov > wrote: > > > Thanks Kenn, forwarding to user@ is a good idea; just did that. > > > JB - this is orthogonal to SDF, because I'd expect this transform to be > primarily used for waiting on the results of

Re: Wait.on() - "Do this, then that" transform

2018-05-17 Thread Eugene Kirpichov
18 at 10:25 PM Jean-Baptiste Onofré wrote: > Cool !!! > > I guess we can leverage this in IOs with SDF. > > Thanks > Regards > JB > > On 14/05/2018 23:48, Eugene Kirpichov wrote: > > Hi folks, > > > > Wanted to give a heads up about the existence

Re: Checkpointing and Restoring BoundedSource

2018-05-16 Thread Eugene Kirpichov
Hi Shen, The only guarantee made by splitAtFraction is that the primary source + residual source = original source - there are no other guarantees. For checkpointing, you can use the following pattern: When you want to checkpoint, call splitAtFraction(getFractionConsumed() + epsilon) or something l

Wait.on() - "Do this, then that" transform

2018-05-14 Thread Eugene Kirpichov
Hi folks, Wanted to give a heads up about the existence of a commonly requested feature and its first successful production usage. The feature is the Wait.on() transform [1] , and the first successful production usage is in Spanner [2] . The Wait.on() transform allows you to "do this, then that"

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-05-14 Thread Eugene Kirpichov
> Matthias > > On Sat, 14 Apr 2018 at 21:45 Eugene Kirpichov > wrote: > >> Hi all, >> >> The video is now available. I got it from my Strata account and I have >> permission to use and share it freely, so I published it on my own YouTube >> page (where

Re: Eventual PAssert

2018-05-14 Thread Eugene Kirpichov
Thanks Anton, this is a really important topic because currently we have no way at all for unit-testing IOs that emit unbounded output. Regardless of the proposed PAssert API itself, even if we just figure out a way to make the pipeline terminate on some condition from within the pipeline, that'll

Re: Launching a Portable Pipeline

2018-05-14 Thread Eugene Kirpichov
Thanks Ankur, this document clarifies a few points and raises some very important questions. I encourage everybody with a stake in Portability to take a look and chime in. +Aljoscha Krettek +Thomas Weise +Henning Rohde On Mon, May 14, 2018 at 12:34 PM Ankur Goenka wrote: > Updated link >

Re: Graal instead of docker?

2018-05-11 Thread Eugene Kirpichov
and implementation work done by dozens of people in this area. We are doing something that's never done before, and the APIs and implementation are not perfect and will continue evolving, but there are much more effective and friendly ways to point out use cases where they fail or ways in w

Re: Support close of the iterator/iterable created from MapState/SetState

2018-05-11 Thread Eugene Kirpichov
I'm not sure if this has been proposed in this thread, but if the common case is that users consume the whole iterator, then you can close resources at !hasNext(). And for cleanup of incompletely consumed iterators, rely on what Kenn suggested. Since you're making your own runner, you can add addit

Re: Graal instead of docker?

2018-05-09 Thread Eugene Kirpichov
On Wed, May 9, 2018 at 1:08 AM Romain Manni-Bucau wrote: > > > Le mer. 9 mai 2018 00:57, Henning Rohde a écrit : > >> There are indeed lots of possibilities for interesting docker >> alternatives with different tradeoffs and capabilities, but in generally >> both the runner as well as the SDK mu

Re: Graal instead of docker?

2018-05-08 Thread Eugene Kirpichov
>> >> > All are good points. >> >> >> > The only "?" I keep is: why beam doesnt uses its visitor api to make >> the >> >> portability transversal to all runners "mutating" the user model before >> >> translation? Tec

Re: Graal instead of docker?

2018-05-05 Thread Eugene Kirpichov
>> >> I would also prefer that for Flink and other Java based runners we retain >> the option to inline executable stages that are in Java. I would expect a >> good number of use cases to benefit from direct execution in the task >> manager, and it may be good to offer the

Re: Graal instead of docker?

2018-05-05 Thread Eugene Kirpichov
, come up with a more precise assessment of what parts are actually dependent on Docker's container format and/or on Docker itself, and propose a plan for untangling this dependency and opening the door to other mechanisms of cross-language execution On Sat, May 5, 2018 at 12:50 PM Eugene Kirp

Re: Graal instead of docker?

2018-05-05 Thread Eugene Kirpichov
Graal is a very young project, currently nowhere near the level of maturity or completeness as to be sufficient for Beam to fully bet its portability vision on it: - Graal currently only claims to support Java and Javascript, with Ruby and R in the status of "some applications may run", Python supp

Re: How to create a runtime ValueProvider

2018-05-03 Thread Eugene Kirpichov
There is no way to achieve this using ValueProvider. Its value is either fixed at construction time (StaticValueProvider), or completely dynamic (evaluated every time you call .get()). You'll need to implement this using a side input. E.g. take a look at implementation of BigQueryIO, how it generat

Re: ValidatesRunner test cleanup

2018-05-03 Thread Eugene Kirpichov
independent runner validation > suite. This framework is clever, but a bit deceptive as runner tests look > like unit tests of the primitives. > > Kenn > > On Thu, May 3, 2018 at 9:24 AM Eugene Kirpichov > wrote: > >> Thanks Scott, this is awesome! >> However,

Re: ValidatesRunner test cleanup

2018-05-03 Thread Eugene Kirpichov
Thanks Scott, this is awesome! However, we should be careful when choosing what should be ValidatesRunner and what should be NeedsRunner. Could you briefly describe how you made the call and roughly what are the statistics before/after your PR (number of tests in both categories)? On Thu, May 3, 2

Re: Java compiler OOMs on Jenkins/Gradle

2018-05-01 Thread Eugene Kirpichov
le.com who is currently messing around with tuning some > Gradle flags related to the JVM and its memory usage. > > On Tue, May 1, 2018 at 1:34 PM Eugene Kirpichov > wrote: > >> Hi, >> >> I've seen the same issue twice in a row on PR >> https://github.

Java compiler OOMs on Jenkins/Gradle

2018-05-01 Thread Eugene Kirpichov
Hi, I've seen the same issue twice in a row on PR https://github.com/apache/beam/pull/4264 : the Java precommit fails with messages like: > Task :beam-sdks-java-core:compileTestJava An exception has occurred in the compiler ((version info not available)). Please file a bug against the Java compil

Re: Splittable DoFN in Spark discussion

2018-04-30 Thread Eugene Kirpichov
t;> wrote: >> >> >> Yeah that's been the implied source of being able to be continuous, you >> union with a receiver which produce an infinite number of batches (the >> "never ending queue stream" but not actually a queuestream since they have >> so

Re: Kafka connector for Beam Python SDK

2018-04-30 Thread Eugene Kirpichov
t;that will server as a good example for any users who wish to implement >>>>> new >>>>> Splittable DoFn implementations on top of Beam Python SDK. >>>>>- >>>>> >>>>>Cross-language transform feature is cur

Re: Kafka connector for Beam Python SDK

2018-04-29 Thread Eugene Kirpichov
Thanks Cham, this is great! I left just a couple of comments on the doc. On Fri, Apr 27, 2018 at 10:06 PM Chamikara Jayalath wrote: > Hi All, > > I'm looking into adding a Kafka connector to Beam Python SDK. I think this > will benefits many Python SDK users and will serve as a good example for

Re: Custom URNs and runner translation

2018-04-26 Thread Eugene Kirpichov
I agree with Thomas' sentiment that cross-language IO is very important because of how much work it takes to produce a mature connector implementation in a language. Looking at implementations of BigQueryIO, PubSubIO, KafkaIO, FileIO in Java, only a very daring soul would be tempted to reimplement

Re: Apache Beam Jenkins Machines Upgrade

2018-04-26 Thread Eugene Kirpichov
This sounds awesome, thanks to everybody involved! On Thu, Apr 26, 2018 at 4:28 PM Yifan Zou wrote: > Greetings, > > Most of you already know about upgrades on Jenkins machines this week. I > still want to share some details you might interested in. > > In April 24th, we had 16 new GCE instances

Re: Splittable DoFN in Spark discussion

2018-04-24 Thread Eugene Kirpichov
-fetched, but I could be convinced). Does > doing a lightweight analysis and just promoting some things to be some kind > of infinite representation help? > > Kenn > > On Tue, Apr 24, 2018 at 2:37 PM Eugene Kirpichov > wrote: > >> Would like to revive this thread one m

Re: Splittable DoFN in Spark discussion

2018-04-24 Thread Eugene Kirpichov
ecause timers suffer from the same problem. On Thu, Apr 12, 2018 at 2:28 PM Eugene Kirpichov wrote: > (resurrecting thread as I'm back from leave) > > I looked at this mode, and indeed as Reuven points out it seems that it > affects execution details, but doesn't off

Re: Plan for a Parquet new release and writing Parquet file with outputstream

2018-04-19 Thread Eugene Kirpichov
Very cool! JB, time to update your PR? On Thu, Apr 19, 2018 at 9:17 AM Alexey Romanenko wrote: > FYI: Apache Parquet 1.10.0 was release recently. > It contains *org.apache.parquet.io.OutputFile *and updated > *org.apache.parquet.hadoop.ParquetFileWriter* > > WBR, > Alexey > > > On 14 Feb 2018, a

Re: "Radically modular data ingestion APIs in Apache Beam" @ Strata - slides available

2018-04-14 Thread Eugene Kirpichov
that. https://www.youtube.com/watch?v=NIn9E5TVoCA On Tue, Mar 13, 2018 at 3:33 AM James wrote: > Very informative, thanks! > > On Fri, Mar 9, 2018 at 4:49 PM Etienne Chauchot > wrote: > >> Great ! >> >> Thanks for sharing. >> >> Etienne >> >>

Re: Gradle Status [April 6]

2018-04-13 Thread Eugene Kirpichov
While we're talking about running tests in IntelliJ with Gradle... Anybody got advice on how to run a single NeedsRunner test in sdks-java-core, say, ParDoTest? With Maven, I used to just run the test in IntelliJ and specify "runners-direct-java" as the classpath; with Gradle, the best I got is to

  1   2   3   4   5   >