Re: [ANNOUNCE] New committer: Ke Wu

2022-05-31 Thread Xinyu Liu
Congrats! Xinyu On Mon, May 30, 2022 at 7:46 AM Evan Galpin wrote: > Congrats Ke! > > - Evan > > > On Mon, May 30, 2022 at 4:11 AM Jan Lukavský wrote: > >> Congrats Ke! >> >> Jan >> On 5/29/22 04:12, Yi Pan wrote: >> >> Congrats, Ke! >> >> -Yi >> >> On Sat, May 28, 2022 at 6:57 PM Robert

Re: [PROPOSAL] Projection pushdown in Beam Java

2021-08-06 Thread Xinyu Liu
Very happy to see we will have pushdown optimizations for java pipelines! Thanks for sharing the proposal. Thanks, XInyu On Fri, Aug 6, 2021 at 9:26 AM Alexey Romanenko wrote: > Thanks Kyle, very promising. I left some comments. > > — > Alexey > > On 5 Aug 2021, at 19:59, Luke Cwik wrote: > >

Re: Question about transformOverride

2021-04-21 Thread Xinyu Liu
Runners providing features to make it easier to test the way you describe, > though does sound very useful, but it does require the runner be aware of > each transform to be overridden, possibly increasing the runners dependency > surface. > > On Wed, Apr 21, 2021, 9:31 AM Xinyu

Re: Question about transformOverride

2021-04-21 Thread Xinyu Liu
@Chamikara: Yuhong and I are working on Samza Runner, and we are looking for a way to swap the transform for ease of use in testing. @Reuven: Your approach will work for this case, but we do have quite a few read transforms here and we have to plug this code in each of time with some testing

Re: Python Dataframe API issue

2021-03-25 Thread Xinyu Liu
; >>>> This could be https://issues.apache.org/jira/browse/BEAM-11929 >>>> >>>> On Thu, Mar 25, 2021 at 4:26 PM Robert Bradshaw >>>> wrote: >>>> >>>>> This is definitely wrong. Looking into what's going on here, but this &g

Python Dataframe API issue

2021-03-25 Thread Xinyu Liu
Hi, folks, I am playing around with the Python Dataframe API, and seemly got an schema issue when converting pcollection to dataframe. I wrote the following code for a simple test: import apache_beam as beam from apache_beam.dataframe.convert import to_dataframe from

Re: DoFn @Setup with PipelineOptions

2021-03-02 Thread Xinyu Liu
I created a ticket to track this: https://issues.apache.org/jira/browse/BEAM-11914. Thanks everyone for the comments! Thanks, Xinyu On Mon, Mar 1, 2021 at 4:45 PM Xinyu Liu wrote: > The reason for not passing it in directly is that we have a large amount > of configs here at LinkedIn so

Re: DoFn @Setup with PipelineOptions

2021-03-01 Thread Xinyu Liu
oFn constructor or as a > variable in the containing scope? Do you only know the option after the > pipeline is completely constructed so you need to make the switch at > runtime? Makes sense. I think passing options to @Setup is useful and > harmless. > > Kenn > > On Mon, Mar 1

DoFn @Setup with PipelineOptions

2021-03-01 Thread Xinyu Liu
Hi, all, Currently the @Setup method signature in DoFn does not support any arguments. This is a bit cumbersome to use for use cases such as creating a db connection, rest client or fetch some resources, where we would like to read the configs from the PipelineOptions during setup. Shall we

Re: Running Beam pipeline using Spark on YARN

2020-06-23 Thread Xinyu Liu
I am doing some prototyping on this too. I used spark-submit script instead of the rest api. In my simple setup, I ran SparkJobServerDriver.main() directly in the AM as a spark job, which will submit the python job to the default spark master url pointing to "local". I also use --files in the

Re: Running Beam python pipeline on Spark

2020-06-03 Thread Xinyu Liu
> > https://github.com/lyft/flinkk8soperator/blob/bb8834d69e8621d636ef2085fdc167a9d2c2bfa3/examples/beam-python/src/beam_example/pipeline.py#L16-L17 > > Thomas > > > On Wed, Jun 3, 2020 at 5:48 PM Xinyu Liu wrote: > >> Hi, folks, >> >> I am trying to do s

Running Beam python pipeline on Spark

2020-06-03 Thread Xinyu Liu
Hi, folks, I am trying to do some experiment to run a simple "hello world" python pipeline on a remote Spark cluster on Hadoop. So far I ran the SparkJobServerDriver on the Yarn application master and managed to submit a python pipeline to it. SparkPipelineRunner was able to run the portable

Re: [Discuss] Propose Calcite Vendor Release (1.22.0)

2020-03-05 Thread Xinyu Liu
Thanks, Rui! We've been waiting for the new version of Calcite which has the fix to unflatten the fields. Seems this version will come with it. Thanks, Xinyu On Thu, Mar 5, 2020 at 12:41 AM Ismaël Mejía wrote: > The calcite vote already passed so this is good to go, thanks for > volunteering

Re: Strict timer ordering in Samza and Portable Flink Runners

2019-10-23 Thread Xinyu Liu
Hi, Jan, Thanks for reporting this. I assigned BEAM-8459 to myself and will take a look soon. Thanks, Xinyu On Wed, Oct 23, 2019 at 2:54 AM Jan Lukavský wrote: > Hi, > > as part of [1] a new set of validatesRunner tests has been introduced. >

Re: [spark structured streaming runner] merge to master?

2019-10-10 Thread Xinyu Liu
+1 for merging to master. It's going to help a lot for us to try it out, and also contribute back for the missing features. Thanks, Xinyu On Thu, Oct 10, 2019 at 6:40 AM Alexey Romanenko wrote: > +1 for merging this new runner too (even if it’s not 100% ready for the > moment) in case if it

Re: Pointers on Contributing to Structured Streaming Spark Runner

2019-09-18 Thread Xinyu Liu
comments are inline: > > Le vendredi 13 septembre 2019 à 12:16 -0700, Xinyu Liu a écrit : > > Hi, Etienne, > > The slides are very informative! Thanks for sharing the details about how > the Beam API are mapped into Spark Structural Streaming. > > > Thanks ! > >

Re: Pointers on Contributing to Structured Streaming Spark Runner

2019-09-13 Thread Xinyu Liu
Hi, Etienne, The slides are very informative! Thanks for sharing the details about how the Beam API are mapped into Spark Structural Streaming. We (LinkedIn) are also interested in trying the new SparkRunner to run Beam pipeine in batch, and contribute to it too. From my understanding, seems the

Re: Integration of python/portable runner tests for Samza runner

2019-04-25 Thread Xinyu Liu
t; of issues with Docker on Jenkins and wanted to lower build time for > PreCommit tests. Loopback means that an embedded Python environment will > be started which listens on localhost. It's comparable to Java's > EmbeddedSdkHarness. > > -Max > > On 24.04.19 20:10, Xinyu Liu wrote: >

Re: Integration of python/portable runner tests for Samza runner

2019-04-24 Thread Xinyu Liu
Thanks for the useful pointers! We are looking forward to integrating both Portable and Python-specific tests for Samza runner. A few questions: - For portable running tests: by looking at the portableValidatesRunnerTask in flink_job_server.gradle, it seems it's the same set of Java tests but

Re: New contributor

2019-01-30 Thread Xinyu Liu
Welcome and glad to see you here, Tao! Xinyu On Wed, Jan 30, 2019 at 12:00 PM Kenneth Knowles wrote: > Done. Welcome! > > Kenn > > On Wed, Jan 30, 2019 at 11:53 AM Tao Feng wrote: > >> Hi, >> >> I would like to contribute to beam and work on some tickets in my spare >> time. Could you please

Re: [DISCUSSION] ParDo Async Java API

2019-01-30 Thread Xinyu Liu
ined. > >>>>> >> > >>>>> >> Kenn > >>>>> >> > >>>>> >> On Wed, Jan 23, 2019 at 10:35 AM Scott Wegner > wrote: > >>>>> >>> > >>>>> >>> A related

Re: [DISCUSSION] ParDo Async Java API

2019-01-22 Thread Xinyu Liu
Kenneth Knowles wrote: > > > On Tue, Jan 22, 2019, 17:23 Reuven Lax >> >> >> On Tue, Jan 22, 2019 at 5:08 PM Xinyu Liu wrote: >> >>> @Steve: it's good to see that this is going to be useful in your use >>> cases as well. Thanks for sharing the c

Re: [DISCUSSION] ParDo Async Java API

2019-01-22 Thread Xinyu Liu
ink-docs-release-1.7/dev/stream/operators/asyncio.html On Tue, Jan 22, 2019 at 5:23 PM Reuven Lax wrote: > > > On Tue, Jan 22, 2019 at 5:08 PM Xinyu Liu wrote: > >> @Steve: it's good to see that this is going to be useful in your use >> cases as well. Thanks for s

Re: [DISCUSSION] ParDo Async Java API

2019-01-22 Thread Xinyu Liu
se could be process(@Element InputT element, @Output >> OutputReceiver>). In this way, the process method >> itself will be async chained, rather than counting on the user to do the >> right thing. >> >> We should see how these look in real use cases. The way that p

[DISCUSSION] ParDo Async Java API

2019-01-22 Thread Xinyu Liu
Hi, guys, As more users try out Beam running on the SamzaRunner, we got a lot of asks for an asynchronous processing API. There are a few reasons for these asks: - The users here are experienced in asynchronous programming. With async frameworks such as Netty and ParSeq and libs like async

Re: [DISCUSS] Structuring Java based DSLs

2018-12-12 Thread Xinyu Liu
Agree with Kenn on this. From our SamzaRunner point of view, we would like Beam SQL to be self-contained and flexible enough for our users to use it in different scenarios, e.g. pure SQL and embeded in different SDKs. We are also extremely interested in the DataFrame-like API mentioned above. To

Re: Bay Area Apache Beam Kickoff!

2018-11-21 Thread Xinyu Liu
This is awesome! Glad to learn the latest Beam SQL and meet you guys there. Thanks, Xinyu On Tue, Nov 20, 2018 at 9:07 PM Jean-Baptiste Onofré wrote: > Nice !! > > Unfortunately I won't be able to be there. But good luck and I'm sure it > will be a great meetup ! > > Regards > JB > > On

Re: Performance of BeamFnData between Python and Java

2018-11-09 Thread Xinyu Liu
started with single element bundles). Default in the Flink > runner now is to cap bundles at 1000 elements or 1 second, whatever comes > first. With that, I have seen decent throughput for the pipeline above (~ > 5000k elements per second with single SDK worker). > > > > > >

Re: Performance of BeamFnData between Python and Java

2018-11-08 Thread Xinyu Liu
By looking at the gRPC dashboard published by the benchmark[1], it seems the streaming ping-pong operations per second for gRPC in python is around 2k ~ 3k qps. This seems quite low compared to gRPC performance in other languages, e.g. 600k qps for Java and Go. Is it expected to run multiple

Re: Beam Samza Runner status update

2018-10-12 Thread Xinyu Liu
t; > -Rui > > > > On Wed, Oct 10, 2018 at 11:10 AM Jean-Baptiste Onofré > > mailto:j...@nanthrax.net>> wrote: > > > > Thanks for sharing and congrats for this great work ! >

Beam Samza Runner status update

2018-10-10 Thread Xinyu Liu
Hi, All, It's been over four months since we added the Samza Runner to Beam, and we've been making a lot of progress after that. Here I would like to update your guys and share some really good news happening here at LinkedIn: 1) First Beam job in production @LInkedIn! After a few rounds of

Re: Update state after firing

2018-10-09 Thread Xinyu Liu
data. Thanks, Xinyu On Tue, Oct 9, 2018 at 11:54 AM Reuven Lax wrote: > 2) is simply a bug that nobody has ever gotten around to fixing. Stateful > ParDo should support merging windows such as sessions. > > On Tue, Oct 9, 2018 at 11:40 AM Xinyu Liu wrote: > >> We

Re: Update state after firing

2018-10-09 Thread Xinyu Liu
Have you considered using Beam's state API for this? >> >> On Tue, Oct 9, 2018 at 11:03 AM Xinyu Liu wrote: >> >>> Hi, guys, >>> >>> Current triggering allows us to either discard the state or accumulate >>> the state after a window pane is fired. We use the

Update state after firing

2018-10-09 Thread Xinyu Liu
Hi, guys, Current triggering allows us to either discard the state or accumulate the state after a window pane is fired. We use the extractOutput() in CombinFn to return the output value after the firing. All these have been working well for us. We do have a use case which seems not handled here:

Re: Donating the Dataflow Worker code to Apache Beam

2018-09-13 Thread Xinyu Liu
Big +1 (non-googler). >From Samza Runner's perspective, we are very happy to see dataflow worker code so we can learn and compete :). Thanks, Xinyu On Thu, Sep 13, 2018 at 11:34 AM Suneel Marthi wrote: > +1 (non-googler) > > This is a great  move > > Sent from my iPhone > > On Sep 13, 2018,

Re: Status of IntelliJ with Gradle

2018-08-22 Thread Xinyu Liu
We experienced the same issues too in intellij after switching to latest version. I did the trick Luke mentioned before to include the beam-model-fn-execution and beam-model-job-management jars in the dependent modules to get around compilation. But I cannot get the vendored protobuf working.

Re: Process JobBundleFactory for portable runner

2018-08-22 Thread Xinyu Liu
We are also interested in this Process JobBundleFactory as we are planning to fork a process to run python sdk in Samza runner, instead of using docker container. So this change will be helpful to us too. On the same note, we are trying out portable_runner.py to submit a python job. Seems it will

Re: Samza runner committer support

2018-06-30 Thread Xinyu Liu
may not be >>> the best person. >>> >>> On Thu, Jun 28, 2018 at 12:33 PM, Xinyu Liu >>> wrote: >>> >>>> Hi, All, >>>> >>>> Our Samza runner has recently been merged to master, and Kenn has been >>>> extr

Samza runner committer support

2018-06-28 Thread Xinyu Liu
Hi, All, Our Samza runner has recently been merged to master, and Kenn has been extremely instrumental during the whole process, e.g. design decisions, feature requests and code reviews. We would like thank him for all the support he has been given to us! Given Kenn is going to be on leave soon,

Re: Going on leave for a bit

2018-06-26 Thread Xinyu Liu
Congrats! Enjoy the time without sleep. Thanks, Xinyu On Tue, Jun 26, 2018 at 10:12 AM, Griselda Cuevas wrote: > Enjoy the time off Kenn! > > > On Tue, 26 Jun 2018 at 12:14, Kai Jiang wrote: > >> Congrats! Enjoy your family time. >> >> Best, >> Kai >> >> On Tue, Jun 26, 2018, 09:11 Alan

Re: [PROPOSAL] Merge samza-runner to master

2018-06-25 Thread Xinyu Liu
green button so this happens? >>> >>> >>> >>> >>> On Sat, Jun 23, 2018 at 6:43 AM Jean-Baptiste Onofré >>> wrote: >>> >>>> +1 >>>> >>>> As the build is fine, it makes sense to merge pretty fast. >&

Re: [PROPOSAL] Merge samza-runner to master

2018-06-22 Thread Xinyu Liu
ill somewhat of a moving >>> target, and you have ongoing work in this direction, that may not be a >>> hard requirement. >>> >>> I'm a bit concerned that there is are only two contributors (but the >>> git logs): you and Kenn. But you do indicate there a

Re: [PROPOSAL] Merge samza-runner to master

2018-06-21 Thread Xinyu Liu
I updated the merge PR with the gradle integration (there was some Jenkins Java tests failure with google cloud quota issues. It seems not related to this patch). Please feel free to ping me if anything else is needed. Thanks, Xinyu On Mon, Jun 18, 2018 at 5:44 PM, Xinyu Liu wrote: > @Kenn

Re: [PROPOSAL] Merge samza-runner to master

2018-06-18 Thread Xinyu Liu
gt;> >>>> One thing that will be necessary is porting the build to Gradle. >>>> >>>> Kenn >>>> >>>> On Mon, Jun 18, 2018 at 11:57 AM Xinyu Liu >>>> wrote: >>>> >>>>> Hi, Folks, >>>&g

[PROPOSAL] Merge samza-runner to master

2018-06-18 Thread Xinyu Liu
Hi, Folks, On behalf of the Samza team, I would like to propose to merge the samza-runner branch into master. The branch was created on Jan when we first introduced the Samza Runner [1], and we've been adding features and refining it afterwards. Now the runner satisfies the criteria outlined in

Re: Support close of the iterator/iterable created from MapState/SetState

2018-05-15 Thread Xinyu Liu
I am happy to chat about it over hangout or slack too. Let's talk offline to set it up if needed. Thanks, Xinyu On Tue, May 15, 2018 at 10:51 AM, Xinyu Liu <xinyuliu...@gmail.com> wrote: > For Samza runner, it's always processes key+window pairs serially. To > answer Luke's questio

Re: Support close of the iterator/iterable created from MapState/SetState

2018-05-15 Thread Xinyu Liu
sense of the user-facing per-key-and-window state API, is single >>>>>> threaded and scoped to a single DoFn. There's no one else who can write >>>>>> the >>>>>> state. If a BagState is read and written and read again, the user-facing >>>>>> logi

Re: Support close of the iterator/iterable created from MapState/SetState

2018-05-14 Thread Xinyu Liu
he runner's >> implementation path will be phased out. So should we expand this discussion >> to how the portability APIs enable the SDK and runner to collaborate to >> achieve this use case? It seems like the interaction you need is that the >> runner can tell that the SDK can

Re: Support close of the iterator/iterable created from MapState/SetState

2018-05-14 Thread Xinyu Liu
tputT>() { >>> @ProcessElement >>> public void process(@Element CompletionStage element, ...) { >>> element.thenApply(...) >>> } >>> } >>> >>> The framework will automatically create the CompletionStage, and the >>>

Re: Support close of the iterator/iterable created from MapState/SetState

2018-05-13 Thread Xinyu Liu
require a snapshot? Doesn't a native RocksDb iterator > require a snapshot to have well-defined contents? As you can tell, I don't > know enough about RocksDb details to be sure of my suggestions. > > Kenn > > [1] https://issues.apache.org/jira/browse/BEAM-2980 > [2] https://issues.apac

Re: Support close of the iterator/iterable created from MapState/SetState

2018-05-12 Thread Xinyu Liu
the scoping in Java will force a bunch of uninitialized declarations >> outside the try-with-resources block, kind of a lot of boilerplate LOC >> >> One thing that is good about your proposal is that the iterable could >> have some transparent caches that all free up together. >

Re: Support close of the iterator/iterable created from MapState/SetState

2018-05-11 Thread Xinyu Liu
elease, and potentially masked errors that are hard to debug. It is less > error-prone than WeakReference, which is asking for trouble when objects > are collected en masse. Anecdotally I have heard that performance of this > kind of approach is poor, but I haven't experienced it myself and I

Re: Support close of the iterator/iterable created from MapState/SetState

2018-05-10 Thread Xinyu Liu
r then having 100s or 1000s of users suffering > through a more complicated API. > > On Thu, May 10, 2018 at 3:44 PM Xinyu Liu <xinyuliu...@gmail.com> wrote: > >> Load/evict blocks will help reduce the cache memory footprint, but we >> still won't be able to release th

Re: Support close of the iterator/iterable created from MapState/SetState

2018-05-10 Thread Xinyu Liu
es the resource after a > certain amount of inactivity or uses weak references. > > On Thu, May 10, 2018 at 3:07 PM Xinyu Liu <xinyuliu...@gmail.com> wrote: > >> Hi, folks, >> >> I'm in the middle of implementing the MapState and SetState in our Samza >> runner.

Support close of the iterator/iterable created from MapState/SetState

2018-05-10 Thread Xinyu Liu
Hi, folks, I'm in the middle of implementing the MapState and SetState in our Samza runner. We noticed that the state returns the Java Iterable for reading entries, keys, etc. For state backed by file-based kv store like rocksDb, we need to be able to let users explicitly close iterator/iterable

Re: Support non-keyed stateful ParDo

2018-04-25 Thread Xinyu Liu
up later and do some computation. The elements will be in the same window, but doesn't need to be of the same key. Thanks, Xinyu On Wed, Apr 25, 2018 at 6:02 PM, Robert Bradshaw <rober...@google.com> wrote: > On Wed, Apr 25, 2018 at 5:45 PM Xinyu Liu <xinyuliu...@gmail.com>

Support non-keyed stateful ParDo

2018-04-25 Thread Xinyu Liu
Hi, I am working on adding the stateful ParDo to the upcoming BEAM Samza runner, and realized that the state for each ParDo processElement() is not only associated with the window of the element, but also the key of the element. Chatted with Kenneth over email about this design decision, which

Re: ***UNCHECKED*** Re: Samza Runner

2018-01-31 Thread xinyu liu
Thanks Kenneth to merge the Samza BEAM runner to the feature branch! We will work on the other items (docs, example, capability matrix ..) to get it to the master. Thanks, Xinyu On Fri, Jan 26, 2018 at 9:28 AM, Kenneth Knowles wrote: > Regarding merging directly to master, I