On Tue, Apr 24, 2018 at 3:44 PM Henning Rohde wrote:
> > Note that a KafkaDoFn still needs to be provided, but could be a DoFn
> that
> > fails loudly if it's actually called in the short term rather than a full
> > Python implementation.
>
> For configurable runner-native
Could we do this behind the scenes by writing a Receiver that publishes
periodic pings?
On Tue, Apr 24, 2018 at 10:09 PM Eugene Kirpichov
wrote:
> Kenn - I'm arguing that in Spark SDF style computation can not be
> expressed at all, and neither can Beam's timers.
>
>
> Note that a KafkaDoFn still needs to be provided, but could be a DoFn that
> fails loudly if it's actually called in the short term rather than a full
> Python implementation.
For configurable runner-native IO, for now, I think it is reasonable to use
a URN + special data payload directly
Kenn - I'm arguing that in Spark SDF style computation can not be expressed
at all, and neither can Beam's timers.
Spark, unlike Flink, does not have a timer facility (only state), and as
far as I can tell its programming model has no other primitive that can map
a finite RDD into an infinite
On Tue, Apr 24, 2018 at 1:14 PM Thomas Weise wrote:
> Hi Cham,
> Thanks for the feedback!
> I should have probably clarified that my POC and questions aren't
specific to Kafka as source, but pretty much any other source/sink that we
internally use as well. We have existing
I don't think I understand what the limitations of timers are that you are
referring to. FWIW I would say implementing other primitives like SDF is an
explicit non-goal for Beam state & timers.
I got lost at some point in this thread, but is it actually necessary that
a bounded PCollection maps
Would like to revive this thread one more time.
At this point I'm pretty certain that Spark can't support this out of the
box and we're gonna have to make changes to Spark.
Holden, could you advise who would be some Spark experts (yourself included
:) ) who could advise what kind of Spark change
Hi Cham,
Thanks for the feedback!
I should have probably clarified that my POC and questions aren't specific
to Kafka as source, but pretty much any other source/sink that we
internally use as well. We have existing Flink pipelines that are written
in Java and we want to use the same connectors
Hi Thomas,
Seems like we are working on similar (partially) things :).
On Tue, Apr 24, 2018 at 9:03 AM Thomas Weise wrote:
> I'm working on a mini POC to enable Kafka as custom streaming source for a
> Python pipeline executing on the (in-progress) portable Flink runner.
>
>
1. Yep.
2. I would phrase this as "merge is the default, but we may (should?)
sqaush/rewrite obvious fixup commits."
3. If separable commits belong to the same PR, let's keep them in the same
PR.
Sounds like we made a mistake for this PR, and should err more on the side
of not squashing.
On Tue,
I would like to push back on 2 & 3 and replace it with this:
(not 2, not 3) Don't squash/rebase someone's pull request unless it is
obvious that they are OK with it
If someone has used git in the standard way, and you squash their commits,
you may have broken them and it is (minor) data loss
Kenn: yes, MetaStore is user-facing, Users can choose to implement their
own MetaStore, currently only an InMemory implementation in Beam CodeBase.
Andrew: I like the second option, since it "retains the ability for DDL
operations to be processed by a custom MetaStore.", IMO we should have the
I'm working on a mini POC to enable Kafka as custom streaming source for a
Python pipeline executing on the (in-progress) portable Flink runner.
We eventually want to use the same native Flink connectors for sources and
sinks that we also use in other Flink jobs.
I got a simple example to work
Thanks for the feedback. Sounds like there are a few takeaways from the
discussion:
1. Not everyone is a git power user, squash and merge is commonly used and
can't be disabled.
2. As a non-committer I should expect my commits to be squashed, rewritten,
or otherwise changed at the discretion of
Can you say more about how the metastore is used? I presume it is or will
be user-facing, so are Beam SQL users already providing their own?
I'm sure we want something like that eventually to support things like
Apache Atlas and HCatalog, IIUC for the "create if needed" logic when using
Beam SQL
What I am trying to figure out is which of these we have:
(a) the "test is broken" scenario
- pass in maven (thread did not leak)
- pass in idea (thread did not leak)
- fail in gradle (thread did not leak, but the test incorrectly thinks one
did)
(b) the "found a bug" scenario
- pass in
@Kenn,
Yes it is through the Elasticsearch test framework that you reference that I
discovered that direct runner counter
metrics thread leaked. I opened this ticket
https://issues.apache.org/jira/browse/BEAM-3119 at the time.
But it seemed overkill to ship ES test framework in Romain's PR unit
Hi Kamil,
I did not code nor scheduled the spark performance tests but I see in the log
(https://builds.apache.org/blue/organizati
ons/jenkins/beam_PerformanceTests_Spark/detail/beam_PerformanceTests_Spark/1628/pipeline)
that it uses PerfKit and also
that the job that is run is this one:
Hi Etienne,
I was recently playing a lot with BigQuerry when working on anomaly
detection tool and noticed that in db schema timestamp is defined as FLOAT.
Perfkit also produces it as a float
'timestamp': 1524485484.41655,
so the upload passes.
Probably it was defined as float from the
19 matches
Mail list logo