(eg. some class internals, not exposed through class
>> interface). I'm not sure if adding methods with alternative signatures only
>> because of the time is the way to go (lots of duplication, low gain?). I
>> think we should wait with places like this until the 3.0 version.
>&g
I've posted a full PR for the Java exception handling API that's ready for
review: https://github.com/apache/beam/pull/6586
It implements new WithErrors nested classes on MapElements,
FlatMapElements, Filter, AsJsons, and ParseJsons.
On Wed, Oct 3, 2018 at 7:55 PM Jeff Klukas wrote:
> J
k whether we can implement this on ParDo as well,
> though that might be a bit trickier since it involves hooking into
> DoFnInvoker.
>
> Reuven
>
> On Fri, Oct 5, 2018 at 10:33 AM Jeff Klukas wrote:
>
>> I've posted a full PR for the Java exception handling API that's rea
constructor and thus isn't
AvroCoder-compatible. I'm currently handling this through early failure
with context about how to choose a different exception type.
On Fri, Oct 5, 2018 at 3:59 PM Jeff Klukas wrote:
> It would be ideal to have some higher-level way of wrapping a PTransform
> to
Chaim - If the full list of IDs is able to fit comfortably in memory and
the Mongo collection is small enough that you can read the whole
collection, you may want to fetch the IDs into a Java collection using the
BigQuery API directly, then turn them into a Beam PCollection using
code for
exception handling is nearly identical between the three classes and could
be consolidated without altering the current public interfaces. Are there
concerns with adding such a base class ?
On Thu, Oct 11, 2018 at 4:44 PM Jeff Klukas wrote:
> The PR (https://github.com/apache/beam/p
Max - The website source was indeed merged into the main beam repository a
few weeks ago, separate from this change.
The edit button is a great idea!
On Thu, Oct 25, 2018 at 7:37 AM Maximilian Michels wrote:
> Cool!
>
> I guess the underlying change is that the website can now be edited
>
On Thu, Oct 25, 2018 at 3:17 PM Kenneth Knowles wrote:
> What I haven't figured out is how to get GitHub to create the branch for
> the PR on your fork.
>
GitHub does that work for you. I don't have commit access to apache/beam,
so when I hit the link, it gives me a banner at the top of the
I'm adding a new lastModifiedMillis field to MatchResult.Metadata [0] which
requires also updating MetadataCoder, but it's not clear to me whether
there are guidelines to follow when evolving a type when that changes the
encoding.
Is a user allowed to update Beam library versions as part of
Lukasz - Thanks for those links. That's very helpful context.
It sounds like there's no explicit user contract about evolving Coder
classes in the Java SDK and users might reasonably assume Coders to be
stable between SDK versions. Thus, users of the Dataflow or Flink runners
might reasonably
I just wrote up a JIRA issues proposing that FileSystem implementations
retrieve lastModified time of the files they list:
https://issues.apache.org/jira/browse/BEAM-5910
Any immediate concerns? I'm not intimately familiar with HDFS, but I'm
otherwise confident that GCS, S3, and local filesystems
Anton - Thanks for reading and commenting. I've gone as far as creating a
skeleton AutoValue extension to better understand how that API works, but I
don't yet have a working prototype for either of these proposed additions.
I'll move on to prototyping the Coder generation for AutoValue classes
Hi all - I'm looking for some review and commentary on a proposed design
for providing built-in Coders for AutoValue classes. There's existing
discussion in BEAM-1891 [0] about using AvroCoder, but that's blocked on
incompatibility between AutoValue and Avro's reflection machinery that
don't look
to turn it into a Row if a
> schema is registered).
>
> We already do this for POJOs and basic JavaBeans. I'm happy to help do
> this for AutoValue.
>
> Reuven
>
> On Sat, Nov 10, 2018 at 5:50 AM Jeff Klukas wrote:
>
>> Hi all - I'm looking for some review and commentary
.
Those types would define the schema, and the fromRow impementation would
call the discovered constructor.
On Mon, Nov 12, 2018 at 10:02 AM Reuven Lax wrote:
>
>
> On Mon, Nov 12, 2018 at 11:38 PM Jeff Klukas wrote:
>
>> Reuven - A SchemaProvider makes sense. It's not c
ated
> >> >>> Matadata object) ? In this case where is the right place to identify
> >> >>> and decide what coder to use?
> >> >>>
> >> >>> Other ideas... ?
> >> >>>
> >> >>> Last thing, th
n
>
> On Tue, Oct 2, 2018 at 4:49 PM Jeff Klukas wrote:
>
>> I've seen a few Beam users mention the need to handle errors in their
>> transforms by using a try/catch and routing to different outputs based on
>> whether an exception was thrown. This was particularly nicely wr
anges are related. Please open a JIRA for tracking.
>>
>> Have you though about introducing a similar change to Python SDK ?
>> (doesn't have to be the same PR).
>>
>> - Cham
>>
>> On Wed, Oct 3, 2018 at 8:31 AM Jeff Klukas wrote:
>>
>
I've seen a few Beam users mention the need to handle errors in their
transforms by using a try/catch and routing to different outputs based on
whether an exception was thrown. This was particularly nicely written up in
a post by Vallery Lancey:
I'm working on https://issues.apache.org/jira/browse/BEAM-5638 to add
exception handling options to single message transforms in the Java SDK.
MapElements' via() method is overloaded to accept either a SimpleFunction,
a SerializableFunction, or a Contextful, all of which are ultimately stored
as
Jira issues for adding exception handling in Java and Python SDKs:
https://issues.apache.org/jira/browse/BEAM-5638
https://issues.apache.org/jira/browse/BEAM-5639
I'll plan to have a complete PR for the Java SDK put together in the next
few days.
On Wed, Oct 3, 2018 at 1:29 PM Jeff Klukas
?
> * can't supply display data?
>
> +u...@beam.apache.org , do users think that the
> provided API would be useful enough for it to be added to the core SDK or
> would the addition of the method provide noise/detract from the existing
> API?
>
> On Mon, Sep 17, 2018 at 12:57 P
ava.time
>>> with joda-time. See the giant schema PR here:
>>> https://github.com/apache/beam/pull/4964 I don't think this was ever
>>> discussed on the list though.
>>>
>>> Andrew
>>>
>>> On Wed, Sep 26, 2018 at 9:21 AM Jeff Klukas <
I'm building a high-volume Beam pipeline using PubsubIO and running into
some concerns over performance and delivery semantics, prompting me to want
to better understand the implementation. Reading through the library,
PubsubIO appears to be a completely separate implementation of Pubsub
client
html
> [5]
> https://github.com/apache/beam/blob/2e759fecf63d62d110f29265f9438128e3bdc8ab/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubJsonClient.java#L130
>
> The latter seems to be the older library, so I would assume it's for
> legacy reasons.
>
> Reg
How about: "Once the watermark progresses past the end of a window, any
further elements that arrive with a timestamp in that window are considered
late data."
On Thu, Jan 17, 2019 at 1:43 PM Rui Wang wrote:
> Hi Community,
>
> In Beam programming guide [1], there is a sentence: "Data that
t; Though it's not tied to window. You could be in the global window, so the
> watermark never advances past the end of the window, yet still get late
> data.
>
> On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas
>> How about: "Once the watermark progresses past the end of
add a comment to
> make the test more clear. I am assuming the modified times get copied, not
> re-timestamped on copy, which is why your method works.
> Otherwise looks good to me
>
> On Wed, Jan 23, 2019 at 5:49 AM Jeff Klukas wrote:
>
>> Posted https://github.com/apache/b
to Eugene's suggestion
#1.
On Wed, Jan 23, 2019 at 8:12 AM Jeff Klukas wrote:
> Suggestion #4: Create source files outside the writer thread, and then
> copy them from a source directory to the watched directory. That should
> atomically write the file with the already known lastModific
.
>> 2) Implement a Matcher that ignores last modified time and use
>> that
>>
>> Jeff - your option #3 is unfortunately also race-prone, because the code
>> may match the files after they have been written but before
>> setLastModifiedTime was called.
>>
>>
Suggestion #4: Create source files outside the writer thread, and then copy
them from a source directory to the watched directory. That should
atomically write the file with the already known lastModificationTime.
On Wed, Jan 23, 2019 at 7:37 AM Jeff Klukas wrote:
> I'll work on getting a
Thanks to Kenn Knowles for picking up the review here. The PR is merged, so
the new interface ProcessFunction and abstract class InferableFunction
class (both of which declare `throws Exception`) are now available in
master.
On Fri, Dec 14, 2018 at 12:18 PM Jeff Klukas wrote:
> Check
The general approach in your example looks reasonable to me. I don't think
there's anything built in to Beam to help with parsing the tar file format
and I don't know how robust the method of replacing "^@" and then splitting
on newlines will be. I'd likely use Apache's commons-compress library
using it either opt-in or, if a good enough case can be made, possibly even
>> opt-out could get this unblocked.
>>
>> On Mon, Nov 26, 2018 at 3:05 PM Jeff Klukas wrote:
>>
>>> Lukasz - Were you able to get any more context on the possibility of
>>> versioni
for developers to
be familiar with rather than 2.
On Fri, Dec 7, 2018 at 6:54 AM Robert Bradshaw wrote:
> How should we move forward on this? The idea looks good, and even
> comes with a PR to review. Any objections to the names?
> On Wed, Dec 5, 2018 at 12:48 PM Jeff Klukas wrote:
> &g
You can likely achieve what you want using FileIO with dynamic
destinations, which is described in the "Advanced features" section of the
TextIO docs [0].
Your case might look something like:
PCollection events = ...;
events.apply(FileIO.writeDynamic()
.by(event ->
introduce this giving time for people to
>> migrate, would it make sense to do that now and deprecate the alternatives
>> that it replaces?
>>
>> On Mon, Nov 26, 2018 at 5:59 AM Jeff Klukas wrote:
>>
>>> Picking up this thread again. Based on the
Reminder that I'm looking for review on
https://github.com/apache/beam/pull/7160
On Thu, Nov 29, 2018, 11:48 AM Jeff Klukas I created a JIRA and a PR for this:
>
> https://issues.apache.org/jira/browse/BEAM-6150
> https://github.com/apache/beam/pull/7160
>
> On naming
mes. This has
> everything I love.
>
> Kenn
>
> On Wed, Nov 28, 2018 at 6:43 PM Jeff Klukas wrote:
>
>> It looks like we can add make the new interface a superinterface for the
>> existing SerializableFunction while maintaining binary compatibility [0].
>&g
kedin.com/in/rmannibucau> | Book
>>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>>>>>
>>>>>>>
>>>>>>> Le dim. 14 oct. 2018 à 10:49, Reuven Lax a
>>>>>>> écrit :
>>&
a new
> snapshot with the encodings modified.
>
> Reuven
>
> On Tue, Nov 13, 2018 at 3:16 AM Jeff Klukas wrote:
>
>> Conversation here has fizzled, but sounds like there's basically a
>> consensus here on a need for a new concept of Coder versioning that's
>> accessible
ing constructor support
>>> to this right now.
>>>
>>> On Wed, Nov 14, 2018 at 12:29 AM Jeff Klukas
>>> wrote:
>>>
>>>> Sounds, then, like we need to a define a new `AutoValueSchema extends
>>>> SchemaProvider` and users
There is also JMESPath (http://jmespath.org/) which is quite similar to
JsonPath, but does have a spec and lacks the leading $ character. The AWS
CLI uses JMESPath for defining queries.
On Mon, Jan 7, 2019 at 1:05 PM Reuven Lax wrote:
>
>
> On Mon, Jan 7, 2019 at 1:44 AM Robert Bradshaw
>
t; do with Flattening two PCollections together) with their original trigger.
>> Without this, we also know that you can have three PCollections with
>> identical triggering and you can CoGroupByKey them together but you cannot
>> do this three-way join as a sequence of binary joins.
>&
Lax wrote:
> Ah,
>
> numShards = 0 is explicitly not supported in unbounded mode today, for the
> reason mentioned above. If FileIO doesn't reject the pipeline in that case,
> we should fix that.
>
> Reuven
>
> On Fri, Jan 11, 2019 at 9:23 AM Jeff Klukas
Hello all, I'm a data engineer at Mozilla working on a first project using
Beam. I've been impressed with the usability of the API as there are good
built-in solutions for handling many simple transformation cases with
minimal code, and wanted to discuss one bit of ergonomics that seems to be
I've gone ahead and filed a JIRA Issue and GitHub PR to follow up on this
suggestion and make it more concrete:
https://issues.apache.org/jira/browse/BEAM-5413
https://github.com/apache/beam/pull/6414
On Fri, Sep 14, 2018 at 1:42 PM Jeff Klukas wrote:
> Hello all, I'm a data engin
I don't think I'm fully understanding the input part of your pipeline since
it looks like it's using some custom IO, but I am fairly confident that
FileIO.write() will never produce empty files, so the behavior you describe
sounds expected if you're reading in an empty file.
FileIO doesn't copy
I'm looking for feedback on a new attempt at implementing an exception
handling interface for map transforms as previously discussed on this list
[0] and documented in JIRA [1]. I'd like for users to be able to pass a
function into MapElements, FlatMapElements, etc. that potentially raises an
here each step can either be "fulfilled" or "rejected". The
> promises can then be chained together like above.
>
>
> On Mon, Feb 11, 2019 at 1:47 PM Jeff Klukas wrote:
>
>> Vallery Lancey's post is definitely one of the viewpoints incorporated
>> in
like to add another option that adds A+ Promise spec into the
> apply method. This makes exception handling more general than with only the
> Map method.
>
> On Fri, Feb 8, 2019 at 9:53 AM Jeff Klukas wrote:
>
>> I'm looking for feedback on a new attempt at implementing an except
I haven't needed to do this with Beam before, but I've definitely had
similar needs in the past. Spark, for example, provides an input_file_name
function that can be applied to a dataframe to add the input file as an
additional column. It's not clear to me how that's implemented, though.
Perhaps
I would prefer we move towards option [2]. I just tried the following
refactor in my own code from:
return input
.apply(TextIO.read().from(fileSpec));
to:
return input
.apply(FileIO.match().filepattern(fileSpec))
.apply(FileIO.readMatches())
ses
> each chunk in a ParDo. TextIO.readFiles currently chops up each file into
> 64mb chunks (hardcoded), and then processes each chunk in a ParDo.
>
> Reuven
>
>
> On Wed, Jan 30, 2019 at 9:18 AM Jeff Klukas wrote:
>
>> I would prefer we move towards option [2]. I
I've started experimenting with Beam schemas in the context of creating
custom AutoValue-based classes and using AutoValueSchema to generate
schemas and thus coders.
AFAICT, schemas need to have types fully specified, so it doesn't appear to
be possible to define an AutoValue class with a type
For getting started reading data, there are some public S3 buckets on which
Amazon hosts data used in tutorials. For example, you should be able to
access s3://awssampledbuswest2 which is referenced in Redshift tutorials
[0].
Amazon also has a free tier for S3 for the first year an account is
Beam devs - I'm looking for review on
https://github.com/apache/beam/pull/8945 which builds on a previous PR by
Wout Sheepers that was never merged due to concerns over evolving the coder
for TableDestination.
The new PR defaults to the existing TableDestinationCoderV2 and ensures the
new
What you propose with a writer per bundle is definitely possible, but I
expect the blocker is that in most cases the runner has control of bundle
sizes and there's nothing exposed to the user to control that. I've wanted
to do similar, but found average bundle sizes in my case on Dataflow to be
so
The API reference docs (Java and Python at least) are versioned, so we have
a durable reference there and it's possible to link to particular sections
of API docs for particular versions.
For the major bits of introductory documentation (like the Beam Programming
Guide), I think it's a good thing
+1 (non-binding)
On Thu, Dec 12, 2019 at 11:58 PM Kenneth Knowles wrote:
> Please vote on the proposal for Beam's mascot to be the Firefly. This
> encompasses the Lampyridae family of insects, without specifying a genus or
> species.
>
> [ ] +1, Approve Firefly being the mascot
> [ ] -1,
I've had a one-line bug fix PR open since last week that I'd love to get
merged for 2.17.
Would a committer be willing to take a look at it?
https://github.com/apache/beam/pull/9784
Sameer - Also note that there's some prior art for optional components,
particularly compression algorithms. We added support for zstd compression,
but document that it's the user's responsibility to make sure the relevant
library is available on the classpath [0].
There was some interest in a
Jason - I was going to suggest the same approach that you arrived at. I
think using a static method to a MapElements instance is an elegant
solution.
The one drawback I can think of with that approach is that you lose control
over default naming. From your example usage:
Also note that runners in many cases will fuse discrete transforms together
for efficiency, so while you might worry about performance degradation from
breaking your logic into many discrete transforms, that likely won't be an
issue in practice.
Also note that you have the option of defining
Per the Contribution Guide, I'm bringing a PR that's been idle for more
than 3 days to the dev list to find a reviewer.
This change doesn't add any new code, but makes the existing
withMaxBytesPerPartition method public
and adds documentation concerning its use.
I personally like the sound of "Datum" as a name. I also like the idea of
not assigning them a gender.
As a counterpoint on the naming side, one of the slide decks provided while
iterating on the design mentions:
> Mascot can change colors when it is “full of data” or has a “batch of
data” to
Beam is able to infer compression from file extensions for a variety of
formats, but snappy is not among them currently:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Compression.java
Although ParquetIO and AvroIO each look to have support for
It's definitely desirable to be able to get back Map types from BQ, and
it's nice that BQ is consistent in representing maps as repeated key/value
structs. Inferring maps from that specific structure is preferable to
inventing some new naming convention for the fields, which would hinder
> I don't think BigQuery offers a way to automatically create datasets when
writing.
This exactly, and it also makes sense from the standpoint of the BQ
permissions model. In BigQuery, table creation requires that the service
account have privileges at the dataset level, but dataset creation
I have certainly had use cases in the past where I've made use of the Kafka
consumer library's ability to consume from a set of topics based on a
regular expression.
Specifically, I worked with a microservices architecture where each service
had a Postgres DB with a logical decoding client for
Ke - You are correct that generally data encoded with a previous coder
version cannot be read with an updated coder. The formats have to match
exactly.
As far as I'm aware, it's necessary to flush a job and start with fresh
state in order to upgrade coders.
On Fri, Nov 6, 2020 at 2:13 PM Ke Wu
ri, May 21, 2021 at 5:10 AM Jeff Klukas wrote:
>
>> Beam users,
>>
>> We're attempting to write a Java pipeline that uses Count.perKey() to
>> collect event counts, and then flush those to an HTTP API every ten minutes
>> based on processing time.
We have a few transform-level tests where we check counter behavior in
Mozilla's telemetry pipeline codebase. See:
https://github.com/mozilla/gcp-ingestion/blob/fa98ac0c8fa09b5671a961062e6cf0985ec48b0e/ingestion-beam/src/test/java/com/mozilla/telemetry/decoder/GeoIspLookupTest.java#L77-L83
But
73 matches
Mail list logo