Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-09-27 Thread Jungtaek Lim
bump to see anyone interested or concerned about this. On Tue, Aug 25, 2020 at 4:56 PM Jungtaek Lim wrote: > Bump this again. > > On Tue, Aug 18, 2020 at 12:11 PM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Bump again. >> >> Unlike file stream sink which has lots of limitations an

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-08-25 Thread Jungtaek Lim
Bump this again. On Tue, Aug 18, 2020 at 12:11 PM Jungtaek Lim wrote: > Bump again. > > Unlike file stream sink which has lots of limitations and many of us have > been suggesting alternatives, file stream source is the only way if end > users want to read the data from files. No alternative unl

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-08-17 Thread Jungtaek Lim
Bump again. Unlike file stream sink which has lots of limitations and many of us have been suggesting alternatives, file stream source is the only way if end users want to read the data from files. No alternative unless they introduce another ETL & storage (probably Kafka). On Fri, Jul 31, 2020 a

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-30 Thread Jungtaek Lim
Hi German, option 1 isn't about "deleting" the old files, as your input directory may be accessed by multiple queries. Kafka centralizes the maintenance of input data hence possible to apply retention without problem. option 1 is more about "hiding" the old files being read, so that end users "may

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-30 Thread German Schiavon
HI Jungtaek, I have a question, aren't both approaches compatible? How I see it, I think It would be interesting to have a retention period to delete old files and/or the possibility of indicating an offset (Timestamp). It would be very "similar" to how we do it with kafka. WDYT? On Thu, 30 Jul

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-30 Thread Jungtaek Lim
(I'd like to keep the discussion thread focusing on the specific topic - let's initiate another discussion threads on different topics.) Thanks for the input. I'd like to emphasize that the point in discussion is the "latestFirst" option - the rationalization starts from growing metadata log issue

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-30 Thread vikram agrawal
If we compare file-stream source with other streaming sources such as Kafka, the current behavior is indeed incomplete. Starting the streaming from a custom offset/particular point of time is something that is missing. Typically filestream sources don't have auto-deletion of the older data/files.

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-29 Thread Jungtaek Lim
bump, is there any interest on this topic? On Mon, Jul 20, 2020 at 6:21 AM Jungtaek Lim wrote: > (Just to add rationalization, you can refer the original mail thread on > dev@ list to see efforts on addressing problems in file stream source / > sink - > https://lists.apache.org/thread.html/r1cd5

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-19 Thread Jungtaek Lim
(Just to add rationalization, you can refer the original mail thread on dev@ list to see efforts on addressing problems in file stream source / sink - https://lists.apache.org/thread.html/r1cd548be1cbae91c67e5254adc0404a99a23930f8a6fde810b987285%40%3Cdev.spark.apache.org%3E ) On Mon, Jul 20, 2020

[DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-19 Thread Jungtaek Lim
Hi devs, As I have been going through the various issues on metadata log growing, it's not only the issue of sink, but also the issue of source. Unlike sink metadata log which entries should be available to the readers, the source metadata log is only for the streaming query starting from the chec