Re: Per Element File Output Without writeDynamic

2019-12-02 Thread Eugene Kirpichov
Hi Christopher, So, you have a PCollection, and you're writing it to files. FileIO.write/writeDynamic will write several Document's to each file - however, in your use case some of the individual Document's are so large that you want instead each of those large documents to be split into several f

Re: [VOTE] Beam Mascot animal choice: vote for as many as you want

2019-12-02 Thread Kenneth Knowles
Hi all, I have tweaked Robert's python* and then applied three filters: All voters, committers, and PMC. Summary: - All voters (46): Firefly (but Owl close behind, no others close) - Committers (24): Owl (but Firefly close behind, no others close) - PMC (6): Cuttlefish (but a many-way tie clo

Re: Per Element File Output Without writeDynamic

2019-12-02 Thread Christopher Larsen
Ideally each element (document) will be written to a .thrift file so that it can be compiled without further manipulation. But in the case of an extremely large file I think it would be nice to split into smaller files. As far as splitting points go I think it could be split at a point in the list

Re: Full stream-stream join semantics

2019-12-02 Thread Kenneth Knowles
I agree that in batch the unbounded disorder will prevent the approach in (1) unless the input is sorted. In streaming it works well using watermarks. This is not a reason to reject (1). (1.1) Instead it might make sense to have an annotation that is a hint for *batch* to timesort the input to a s

Re: Per Element File Output Without writeDynamic

2019-12-02 Thread Reuven Lax
What do you mean by shard the output file? Can it be split at any byte location, or only at specific points? On Mon, Dec 2, 2019 at 2:05 PM Christopher Larsen < christopher.lar...@quantiphi.com> wrote: > Hi Reuven, > > We would like to write each element to one file but still allow the runner > t

Re: Per Element File Output Without writeDynamic

2019-12-02 Thread Christopher Larsen
Hi Reuven, We would like to write each element to one file but still allow the runner to shard the output file which could yield more than one output file per element. On Mon, Dec 2, 2019 at 11:55 AM Reuven Lax wrote: > I'm not sure I completely understand the question. Are you saying that you

Re: Request for review of PR [Beam-8564]

2019-12-02 Thread Luke Cwik
I took a look. My biggest concern is finding a good LZO implementation. Looking for one that preferably has: 1) Apache license 2) Has zero transitive dependencies 3) Is small 4) Is performant 5) Is native java or supports execution on the three main OSs (Windows, Linux, Mac) In your PR you suggest

Request for review of PR [Beam-8564]

2019-12-02 Thread Amogh Tiwari
Hi, I have filed a PR for an extension that will enable Apache Beam to work with LZO/LZOP compression. Please refer . I would love it if someone can take this up and review it. Please feel free to share your thoughts/suggestions. Regards, Amogh

Re: Per Element File Output Without writeDynamic

2019-12-02 Thread Reuven Lax
I'm not sure I completely understand the question. Are you saying that you want each element to write to only one file, guaranteeing that two elements are never written to the same file? On Mon, Dec 2, 2019 at 11:53 AM Christopher Larsen < christopher.lar...@quantiphi.com> wrote: > Hi All, > > TL

Per Element File Output Without writeDynamic

2019-12-02 Thread Christopher Larsen
Hi All, TL/DR: can you extend FileIO.sink to write one or more file per element instead of one or more elements per file? In working with Thrift files we have found that since a .thrift file needs to be compiled to generate code the order of the contents of the file are important (ie, the namespa

Re: [DISCUSS] @Experimental annotations - processes and alternatives

2019-12-02 Thread Alexey Romanenko
Thank you Kenn for starting this discussion. As I see, for now, the main goal for “@Experimental" annotation is to relive and be useful in the sense as it’s name says (this is obviously not a case for the moment). I'd suggest a bit more simplified scenario for this: 1. We do a revision of all “

Re: Update on push-down for SQL IOs.

2019-12-02 Thread Kirill Kozlov
> > ParquetIO, CassandraIO/HBaseIO/BigTableIO (all should be about the same), > JdbcIO, IcebergIO (doesn't exist yet, but is basically generalized > schema-aware files as I understand it). I think that adding Jiras with a tag "starter" for implementing push-down for all of the IO interfaces listed

Re: [EXTERNAL] Re: FirestoreIO connector [JavaSDK]

2019-12-02 Thread Chamikara Jayalath
Thanks. Taking a look. It'll probably be helpful if you can write a short doc on design decisions you took including API and client library choice. Also, as for any other sink, we should carefully design this so that duplicate data is not written to the data store when bundles are retried by runner

Beam Dependency Check Report (2019-12-02)

2019-12-02 Thread Apache Jenkins Server
High Priority Dependency Updates Of Beam Python SDK: Dependency Name Current Version Latest Version Release Date Of the Current Used Version Release Date Of The Latest Release JIRA Issue google-cloud-datastore 1.7.4 1.10.0

RE: [EXTERNAL] Re: FirestoreIO connector [JavaSDK]

2019-12-02 Thread Stefan Djelekar
Hi dev team, I’ve submitted the pull request for BEAM-8376. Getting a review would be nice. All the best, Stefan From: Chamikara Jayalath Sent: Wednesday, November 6, 2019 8:05 PM To: dev Subject: Re: