Re: Removing DATETIME in Java Schemas

2020-05-15 Thread Reuven Lax
On Fri, May 15, 2020 at 8:10 PM Kenneth Knowles wrote: > > > On Fri, May 15, 2020 at 5:25 PM Brian Hulette wrote: > >> After thinking about this more I've softened on it some, but I'm still a >> little wary. I like Kenn's suggestion: >> >> > - it is OK to convert known logical types like

Re: Removing DATETIME in Java Schemas

2020-05-15 Thread Reuven Lax
On Fri, May 15, 2020 at 5:25 PM Brian Hulette wrote: > After thinking about this more I've softened on it some, but I'm still a > little wary. I like Kenn's suggestion: > > > - it is OK to convert known logical types like MillisInstant or > NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL

Re: Removing DATETIME in Java Schemas

2020-05-15 Thread Kenneth Knowles
On Fri, May 15, 2020 at 5:25 PM Brian Hulette wrote: > After thinking about this more I've softened on it some, but I'm still a > little wary. I like Kenn's suggestion: > > > - it is OK to convert known logical types like MillisInstant or > NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL

Re: [Proposal] Apache Beam Fn API - GCP IO Debuggability Metrics

2020-05-15 Thread Alex Amato
Thanks everyone. I was able to collect a lot of good feedback from everyone who contributed. I am going to wrap it up for now and label the design as "Design Finalized (Unimplemented)". I really believe we have made a much better design than I initially wrote up. I couldn't have done it without

Re: [DISCUSS] New process for proposing ApacheBeam tweets

2020-05-15 Thread Robert Bradshaw
Sounds like a good plan to me, but I haven't been the one monitoring this spreadsheet (or twitter for that matter). Spam is a concern, but everything is moderated so I think we try it out and see if the volume is really high enough to be an issue. On Fri, May 15, 2020 at 4:46 PM Kenneth Knowles

Re: Removing DATETIME in Java Schemas

2020-05-15 Thread Brian Hulette
After thinking about this more I've softened on it some, but I'm still a little wary. I like Kenn's suggestion: > - it is OK to convert known logical types like MillisInstant or NanosInstant to a ZetaSQL "TIMESTAMP" or Calcite SQL "TIMESTAMP WITH LOCAL TIME ZONE" > - it is OK to convert unknown

Re: [DISCUSS] New process for proposing ApacheBeam tweets

2020-05-15 Thread Kenneth Knowles
I like having easier notifications. It would be great if the notifications had the content also. I get notifications on the spreadsheet, but since I have to click through to look at them there is a little bit of friction. 1. Is it still easy to add the other columns to record LGTM and when they

Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Boyuan Zhang
Hi Steve, Yes that's correct. On Fri, May 15, 2020 at 2:11 PM Steve Niemitz wrote: > ah! ok awesome, I think that was the piece I was misunderstanding. So I > _can_ use a SDF to split the work initially (like I was manually doing in > #1), but it just won't be further split dynamically on

Re: Apache Beam application to Season of Docs 2020

2020-05-15 Thread Aizhamal Nurmamat kyzy
@all We are receiving a few emails from interested applicants - feel no pressure to respond to them. I will be monitoring the dev list and respond accordingly. If they have specific questions regarding either of the projects, I will direct them to the mentors. Stay safe, Aizhamal On Mon, May

[DISCUSS] New process for proposing ApacheBeam tweets

2020-05-15 Thread Aizhamal Nurmamat kyzy
Hi all, I wanted to propose some improvements to the existing process of proposing tweets for the Apache Beam Twitter account. Currently the process requires people to request edit access, and then add tweets on the spreadsheet [1]. I use it a lot because I know the process well, but I think

Re: Removing DATETIME in Java Schemas

2020-05-15 Thread Reuven Lax
I would not describe the base type as the "wire type." If that were true, then the only base type we should support should be byte array. My simple point is that this is no different than normal schema fields. You will find many normal schemas containing data encoded into other field types. You

Re: Removing DATETIME in Java Schemas

2020-05-15 Thread Andrew Pilloud
My understanding is that the base type is effectively the wire format at the type, where the logical type is the in-memory representation for Java. For org.joda.time.Instant, this is just a wrapper around the underlying Long. However for the Date logical type, the LocalDate type has struct as the

Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Steve Niemitz
ah! ok awesome, I think that was the piece I was misunderstanding. So I _can_ use a SDF to split the work initially (like I was manually doing in #1), but it just won't be further split dynamically on dataflow v1 right now. Is my understanding there correct? On Fri, May 15, 2020 at 5:03 PM Luke

Re: Removing DATETIME in Java Schemas

2020-05-15 Thread Reuven Lax
On Fri, Apr 24, 2020 at 11:56 AM Brian Hulette wrote: > When we created the portable representation of schemas last summer we > intentionally did not include DATETIME as a primitive type [1], even though > it already existed as a primitive type in Java [2]. There seemed to be > consensus around

Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Luke Cwik
#3 is the best when you implement @SplitRestriction on the SDF. The size of each restriction is used to better balance the splits within Dataflow runner v2 so it is less susceptible to the too many or unbalanced split problem. For example, if you have 4 workers and make 20 splits, the splits will

Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Steve Niemitz
Thanks for the replies so far. I should have specifically mentioned above, I am building a bounded source. While I was thinking this through, I realized that I might not actually need any fancy splitting, since I can calculate all my split points up front. I think this goes well with Ismaël's

Re: Removing DATETIME in Java Schemas

2020-05-15 Thread Kenneth Knowles
This seems like a good idea. This stuff is all still marked "experimental" for exactly this reason. This is a case where the name fits perfectly. Both SQL and schemas are new and still working towards a form that can be supported indefinitely without layers of workarounds that will never quiesce.

Re: [DISCUSS] Dealing with @Ignored tests

2020-05-15 Thread Luke Cwik
For the ones without the label, someone would need to use blame and track back to why it was sickbayed. On Fri, May 15, 2020 at 1:08 PM Kenneth Knowles wrote: > There are 101 instances of @Ignore, and I've listed them below. A few > takeaways: > > - highly concentrated in ZetaSQL, and then

Re: [DISCUSS] Dealing with @Ignored tests

2020-05-15 Thread Kenneth Knowles
There are 101 instances of @Ignore, and I've listed them below. A few takeaways: - highly concentrated in ZetaSQL, and then second tier in various state tests specific to a runner - there are not that many overall, so I'm not sure a report will add much - they do not all have Jiras - they do

Re: Java transform executed as Python transform?

2020-05-15 Thread Brian Hulette
I just started having an issue that looks similar this morning. I'm trying out running the Python SqlTransform tests with fn_runner (currently they only execute continuously on Flink and Spark), but I'm running into occasional failures. The errors always come from either python or java attempting

BEAM-9958: Code Review Wanted for PR 11674

2020-05-15 Thread Tomo Suzuki
Hi Luke and Beam committers, Would you check this PR to use Linkage Checker's exclusion file? https://github.com/apache/beam/pull/11674 This script used to use "diff" command to identify new linkage errors by comparing line by line. With this PR, it identifies new linkage errors in an appropriate

regarding Google Season of Docs

2020-05-15 Thread Yuvraj Manral
Respected sir/mam, I came around the projects proposed by Apache Beam for Season of Docs 2020. I am a newbie to organisation but really liked the ideas of projects and would love to start contributing and prepare my proposal for Season of Docs. Please guide me through. Where should I start and

Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Ismaël Mejía
For the Bounded case if you do not have a straight forward way to split at fractions, or simply if you do not care about Dynamic Work Rebalancing. You can get away implementing a simple DoFn (without Restrictions) based implementation and evolve from it. More and more IOs at Beam are becoming DoFn

Re: TextIO. Writing late files

2020-05-15 Thread Reuven Lax
Lateness should never be introduced inside a pipeline - generally late data can only come from a source. If data was not dropped as late earlier in the pipeline, it should not be dropped after the file write. I suspect that this is a bug in how the Flink runner handles the Reshuffle transform,

Google Season of Docs

2020-05-15 Thread Amr Maghraby
Dear Apace, My name is Amr Maghraby, I am a new graduate from AAST college got the first rank on my class with CGPA 3.92 and joined the international competition in the US called ROV got the second worldwide and last summer I have involved in Google Summer of code 2019 and did good work also, I

Re: Python Precommit significantly flaky

2020-05-15 Thread Brian Hulette
Thanks to Kyle we captured some additional logging for https://issues.apache.org/jira/browse/BEAM-9975. I spent a little time looking at it and found two different issues (see details in the comments): https://issues.apache.org/jira/browse/BEAM-10006 - PipelineOptions can pick up definitions from

Re: Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Luke Cwik
If it is an unbounded source then SDF is a winner since you are not giving up anything with it when compared to the legacy UnboundedSource API since Dataflow doesn't support dynamic splitting of unbounded SDFs or UnboundedSources (only initial splitting). You gain the ability to compose sources

Writing a new IO on beam, should I use the source API or SDF?

2020-05-15 Thread Steve Niemitz
I'm going to be writing a new IO (in java) for reading files in a custom format, and want to make it splittable. It seems like I have a choice between the "legacy" source API, and newer experimental SDF API. Is there any guidance on which I should use? I can likely tolerate some API churn as

Java transform executed as Python transform?

2020-05-15 Thread Paweł Urbanowicz
Hey, I created a transform method in Java and now I want to use it in Python using Cross-language. I got pretty stuck with the following problem: p | GenerateSequence(...) |ExternalTransform(...) *=> is working like a charm * p | Create(...) | ExternalTransform(...) *=> getting assert

Re: New Grafana dashboards

2020-05-15 Thread Kamil Wasilewski
Fixed, thanks for spotting that! One of the regex wasn't properly interpreted in the latest version of Grafana, but now it should be OK. On Thu, May 14, 2020 at 11:58 PM Pablo Estrada wrote: > I noticed that postcommit status dashboard shows 0/1 values - I remember > it used to show green/red

Re: TextIO. Writing late files

2020-05-15 Thread Jozef Vilcek
Hi Jose, thank you for putting the effort to get example which demonstrate your problem. You are using a streaming pipeline and it seems that watermark in downstream already advanced further, so when your File pane arrives, it is already late. Since you define that lateness is not tolerated, it