Thanks Ben, the Watch transform (https://github.com/apache/beam/pull/3565,
in review) is implemented in a way forward-compatible with your ideas,
though I didn't go all the way and implement them - I left a couple of
TODOs.
In other good news - I have a PR in review for incrementally reading new
f
Regarding changing the coder -- keep in mind that there may be persisted
state somewhere, so we can't just change the coder once this is used.
If the processing of scanning for modified and new files reported the
last-modified-time, could we use that and have the SDF report KV with the last-modifi
Yes, you still need SDF to do the root expansion. However it means that the
state storage is now distributed.
Garbage collection might be trickier with Distinct.
On Tue, Jul 11, 2017 at 10:19 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:
> Yes, I thought of this, but:
> - The disti
Yes, I thought of this, but:
- The distinct transform needs to apply per input (probably easy)
- You still need an SDF to run the set expansion repeatedly
- It's not clear when to terminate the repeated expansion in this
implementation
On Tue, Jul 11, 2017 at 10:14 PM Reuven Lax
wrote:
> As a th
As a thought experiment: could this be done by expanding the set into a
PCollection and running it through a Distinct (in the global window,
trigger every element) transform?
On Tue, Jul 11, 2017 at 9:48 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:
> In the current version, the tra
In the current version, the transform is intended to watch a set that is
continuously growing; do you mean a GCS bucket that eventually contains
more files than can fit in a state tag?
I agree that this will eventually become an issue; I can see a couple of
solutions:
- I suspect many such sets ar
BTW - I am worried about SDF storing everything in a single tag for watch.
The problem is that streaming pipeline can run "forever." So someone
watching a GCS bucket "forever" will eventually crash due to the value
getting too large. Is there any reasonable way to garbage collect this
state?
On Tu
First PR has been submitted - enjoy TextIO.readAll() which reads a
PCollection of filenames!
I've started working on the SDF-based Watch transform
http://s.apache.org/beam-watch-transform, and after that will be able to
implement the incremental features in TextIO.
On Tue, Jun 27, 2017 at 1:55 PM
Thanks all. The first PR is out for review:
https://github.com/apache/beam/pull/3443
Next work (watching for new files) is in progress, based on
https://github.com/apache/beam/pull/3360
On Tue, Jun 27, 2017 at 11:22 AM Kenneth Knowles
wrote:
> +1
>
> This is a really nice doc and plan.
>
> On Tu
+1
This is a really nice doc and plan.
On Tue, Jun 27, 2017 at 1:49 AM, Aljoscha Krettek
wrote:
> +1
>
> This sounds very good and there is a clear implementation path!
>
> > On 24. Jun 2017, at 20:55, Jean-Baptiste Onofré wrote:
> >
> > Fair enough ;)
> >
> > Let me review the different Jira
+1
This sounds very good and there is a clear implementation path!
> On 24. Jun 2017, at 20:55, Jean-Baptiste Onofré wrote:
>
> Fair enough ;)
>
> Let me review the different Jira and provide some feedback.
>
> Regards
> JB
>
> On Jun 24, 2017, 20:54, at 20:54, Eugene Kirpichov
> wrote:
>>
Fair enough ;)
Let me review the different Jira and provide some feedback.
Regards
JB
On Jun 24, 2017, 20:54, at 20:54, Eugene Kirpichov
wrote:
>Hi JB,
>I haven't yet thought about how this work can be parallelized. For now
>I'd
>like to just get feedback on the approach :)
>But glad that you'
Hi JB,
I haven't yet thought about how this work can be parallelized. For now I'd
like to just get feedback on the approach :)
But glad that you're willing to help out - let's discuss this too a bit
later!
On Sat, Jun 24, 2017 at 11:51 AM Jean-Baptiste Onofré
wrote:
> Thanks Eugene
>
> I will pi
Thanks Eugene
I will pick up some.
Regards
JB
On Jun 24, 2017, 20:00, at 20:00, Eugene Kirpichov
wrote:
>Filed JIRAs for the proposed features and linked with the doc:
>https://issues.apache.org/jira/browse/BEAM-2511 TextIO should support
>reading a PCollection of filenames
>https://issues.apa
Filed JIRAs for the proposed features and linked with the doc:
https://issues.apache.org/jira/browse/BEAM-2511 TextIO should support
reading a PCollection of filenames
https://issues.apache.org/jira/browse/BEAM-2512 TextIO should support
watching for new files
https://issues.apache.org/jira/browse/
Hi all,
I've written up a proposal for incrementally delivering a bunch of useful
new features in TextIO based on Splittable DoFn. It's applicable to other
file-based connectors, TextIO is just one good example. Let me know what
you think!
https://s.apache.org/textio-sdf
Copy of abstract:
Users
16 matches
Mail list logo