Re: Proposal and plan: new TextIO features based on SDF

2017-07-20 Thread Eugene Kirpichov
Thanks Ben, the Watch transform (https://github.com/apache/beam/pull/3565, in review) is implemented in a way forward-compatible with your ideas, though I didn't go all the way and implement them - I left a couple of TODOs. In other good news - I have a PR in review for incrementally reading new f

Re: Proposal and plan: new TextIO features based on SDF

2017-07-12 Thread Ben Chambers
Regarding changing the coder -- keep in mind that there may be persisted state somewhere, so we can't just change the coder once this is used. If the processing of scanning for modified and new files reported the last-modified-time, could we use that and have the SDF report KV with the last-modifi

Re: Proposal and plan: new TextIO features based on SDF

2017-07-12 Thread Reuven Lax
Yes, you still need SDF to do the root expansion. However it means that the state storage is now distributed. Garbage collection might be trickier with Distinct. On Tue, Jul 11, 2017 at 10:19 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > Yes, I thought of this, but: > - The disti

Re: Proposal and plan: new TextIO features based on SDF

2017-07-12 Thread Eugene Kirpichov
Yes, I thought of this, but: - The distinct transform needs to apply per input (probably easy) - You still need an SDF to run the set expansion repeatedly - It's not clear when to terminate the repeated expansion in this implementation On Tue, Jul 11, 2017 at 10:14 PM Reuven Lax wrote: > As a th

Re: Proposal and plan: new TextIO features based on SDF

2017-07-11 Thread Reuven Lax
As a thought experiment: could this be done by expanding the set into a PCollection and running it through a Distinct (in the global window, trigger every element) transform? On Tue, Jul 11, 2017 at 9:48 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > In the current version, the tra

Re: Proposal and plan: new TextIO features based on SDF

2017-07-11 Thread Eugene Kirpichov
In the current version, the transform is intended to watch a set that is continuously growing; do you mean a GCS bucket that eventually contains more files than can fit in a state tag? I agree that this will eventually become an issue; I can see a couple of solutions: - I suspect many such sets ar

Re: Proposal and plan: new TextIO features based on SDF

2017-07-11 Thread Reuven Lax
BTW - I am worried about SDF storing everything in a single tag for watch. The problem is that streaming pipeline can run "forever." So someone watching a GCS bucket "forever" will eventually crash due to the value getting too large. Is there any reasonable way to garbage collect this state? On Tu

Re: Proposal and plan: new TextIO features based on SDF

2017-07-11 Thread Eugene Kirpichov
First PR has been submitted - enjoy TextIO.readAll() which reads a PCollection of filenames! I've started working on the SDF-based Watch transform http://s.apache.org/beam-watch-transform, and after that will be able to implement the incremental features in TextIO. On Tue, Jun 27, 2017 at 1:55 PM

Re: Proposal and plan: new TextIO features based on SDF

2017-06-27 Thread Eugene Kirpichov
Thanks all. The first PR is out for review: https://github.com/apache/beam/pull/3443 Next work (watching for new files) is in progress, based on https://github.com/apache/beam/pull/3360 On Tue, Jun 27, 2017 at 11:22 AM Kenneth Knowles wrote: > +1 > > This is a really nice doc and plan. > > On Tu

Re: Proposal and plan: new TextIO features based on SDF

2017-06-27 Thread Kenneth Knowles
+1 This is a really nice doc and plan. On Tue, Jun 27, 2017 at 1:49 AM, Aljoscha Krettek wrote: > +1 > > This sounds very good and there is a clear implementation path! > > > On 24. Jun 2017, at 20:55, Jean-Baptiste Onofré wrote: > > > > Fair enough ;) > > > > Let me review the different Jira

Re: Proposal and plan: new TextIO features based on SDF

2017-06-27 Thread Aljoscha Krettek
+1 This sounds very good and there is a clear implementation path! > On 24. Jun 2017, at 20:55, Jean-Baptiste Onofré wrote: > > Fair enough ;) > > Let me review the different Jira and provide some feedback. > > Regards > JB > > On Jun 24, 2017, 20:54, at 20:54, Eugene Kirpichov > wrote: >>

Re: Proposal and plan: new TextIO features based on SDF

2017-06-24 Thread Jean-Baptiste Onofré
Fair enough ;) Let me review the different Jira and provide some feedback. Regards JB On Jun 24, 2017, 20:54, at 20:54, Eugene Kirpichov wrote: >Hi JB, >I haven't yet thought about how this work can be parallelized. For now >I'd >like to just get feedback on the approach :) >But glad that you'

Re: Proposal and plan: new TextIO features based on SDF

2017-06-24 Thread Eugene Kirpichov
Hi JB, I haven't yet thought about how this work can be parallelized. For now I'd like to just get feedback on the approach :) But glad that you're willing to help out - let's discuss this too a bit later! On Sat, Jun 24, 2017 at 11:51 AM Jean-Baptiste Onofré wrote: > Thanks Eugene > > I will pi

Re: Proposal and plan: new TextIO features based on SDF

2017-06-24 Thread Jean-Baptiste Onofré
Thanks Eugene I will pick up some. Regards JB On Jun 24, 2017, 20:00, at 20:00, Eugene Kirpichov wrote: >Filed JIRAs for the proposed features and linked with the doc: >https://issues.apache.org/jira/browse/BEAM-2511 TextIO should support >reading a PCollection of filenames >https://issues.apa

Re: Proposal and plan: new TextIO features based on SDF

2017-06-24 Thread Eugene Kirpichov
Filed JIRAs for the proposed features and linked with the doc: https://issues.apache.org/jira/browse/BEAM-2511 TextIO should support reading a PCollection of filenames https://issues.apache.org/jira/browse/BEAM-2512 TextIO should support watching for new files https://issues.apache.org/jira/browse/

Proposal and plan: new TextIO features based on SDF

2017-06-23 Thread Eugene Kirpichov
Hi all, I've written up a proposal for incrementally delivering a bunch of useful new features in TextIO based on Splittable DoFn. It's applicable to other file-based connectors, TextIO is just one good example. Let me know what you think! https://s.apache.org/textio-sdf Copy of abstract: Users