[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-12-17 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-566543290
 
 
   @HeartSaVioR 
   Would you mind if I ask to elaborate your answer? IMHO it's not clear which 
one (or both?) you are OK with.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-12-10 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-564409487
 
 
   @uncleGen 
   Hi, do you plan to go ahead with your idea? I have been thinking about this 
issue, and your idea seems to be a realistic solution which doesn't introduce 
too much changes. While we may also want to find the solution which could deal 
with most of things, but for now it would be great even only with your idea.
   
   Otherwise, would you mind if I pick your idea up if you're not planning to 
do it?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-12-06 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-562557622
 
 
   @tdas @zsxwing @jose-torres @gaborgsomogyi Kindly reminder.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-11-27 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-558976957
 
 
   > IMHO, the core problem is the compact metadata log grows bigger and 
bigger, and it is a time-consuming work to compact the metadata log, because it 
will read old compact log file and then write to new compact log file.
   
   I agree with you that the problem is that compact metadata log just grows 
most of the times, though taking plenty of time building metadata log is just a 
one of multiple major issues. The other major issue, reading metadata log won't 
decrease unless we optimize the format of file or just get rid of entities like 
this patch is proposing.
   
   One thing we have to consider is, when `compact` phase happens, Spark is 
able to get rid of some entities which have been existing - that's the feature 
this patch leverages. That requires full read and rewrite of entities per each 
compact phase, and that's why we can't just simply add two compact files.
   
   Looks like `CompactibleFileStreamLog` is introduced to avoid "small files 
problem", which seems to be possible to tweak a bit to change the approach to 
maintain "ranged delta" which might be more similar with what you proposed. 
That's no longer be a "snapshot", but in most cases the entities are not 
removed so it also makes sense to me. I'm expecting the logic more complicated 
than current one, but that might be acceptable given the issue has been 
affecting badly for end users.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-11-25 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-558433816
 
 
   Maybe we can differentiate two major cases:
   
   1) downstream query to read the output directory is also Spark (leverages 
metadata)
   
   In this case, technically we never be able to delete any entries in metadata 
if we want to ensure the downstream query provides same result during multiple 
runs (unless inputs are added in real time). 
   
   We know that's only ideal - if the streaming query runs longer and writes 
gigantic number/size of files for a long time, we would want to get rid of some 
part to gain speed and save storage with fully understanding that we are 
throwing out some inputs which will affect the result of query.
   
   Assume we decided to get rid of some output files. How to do it safely? The 
only safe way to do it is, getting rid of them in metadata first, and delete 
actual files. (Downstream query relies on the metadata to get the list of 
files, so if we don't make sure deleting them in metadata first, the downstream 
query will try to read the file which no longer exist, and fails - depending on 
the option.) 
   
   That means running streaming query should deal with the deletion, as we 
don't have any official offline tool to modify metadata, and you may find 
difficulties to "how" to let streaming query know which files to delete. That's 
why I just simply pick "retention" which is generally acceptable approach 
(Kafka also applies retention policy by default).
   
   2) we never let Spark read the output directory - we let other frameworks to 
read the directory
   
   In this case we don't need to build metadata - though this means end users 
will need to deal with "at-least-once" guarantee. Given the file sink doesn't 
overwrite the file, it may leave corrupted records on partial output as well. 
If that's acceptable, we may be able to add an option to "disable" metadata, 
though there was some comments worried about doing it: 
https://github.com/apache/spark/pull/24128#issuecomment-474109068
   
   So I guess there're not many options here and I guess I picked the viable 
one, but I'd be really appreciated for more ideas!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-11-21 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-557398360
 
 
   SPARK-29995 is just filed which denotes same issues SPARK-24295.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-09-24 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-534717837
 
 
   retest this, please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-09-24 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-534461165
 
 
   retest this, please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-09-08 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-529248401
 
 
   Ping.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-08-20 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-523201466
 
 
   @tdas @zsxwing @jose-torres @gaborgsomogyi Kindly reminder.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-04-30 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-488107523
 
 
   Ping again, as Spark+AI Summit 2019 in SF is end.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-04-06 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-480482791
 
 
   Kindly reminder.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-03-24 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-476025148
 
 
   Could I kindly ask for reviewing on new approach? That would not be 
intrusive unless end users configure the retention badly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-03-19 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-474225477
 
 
   retest this, please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

2019-03-18 Thread GitBox
HeartSaVioR commented on issue #24128: [SPARK-27188][SS] FileStreamSink: 
provide a new option to have retention on output files
URL: https://github.com/apache/spark/pull/24128#issuecomment-474153829
 
 
   Rebased to the approach: applying retention.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org