[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651484627


   (IMHO it might be still good chance to leverage this PR to construct a good 
way for versioning properly - so that version 2 can be used as an interim with 
best practice on versioning, and we get version 3 with such major features 
included without headache on versioning.)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578


   @zsxwing 
   Thanks a lot for your detailed comment!
   
   I think considering all of these would take me to redesign metadata log as 
well as file stream source itself, which wasn't a goal actually. As I commented 
previously, building a holistic solution was not a goal, because I already 
indicated it should take considerable time, and someone might claim that it's 
reinventing the wheel (I know these functionalities are extracted from 
alternatives).
   
   That said, shall we do the following?
   
   1. Review and merge other PRs which don't incur metadata version upgrade, 
**before proceeding 2**.
   
   * #28904
   * #28422
   * #28363
   * #27649
   * #27620
   
   2. File an issue to address these requirements altogether, with one of PMC 
member being shepherd. (It would be amazing if you can volunteer.) Once someone 
volunteers and promises to make thing forward, I'll try to persuade my employer 
to allow me to take the work.
   
   If it's not desired to go with 2, I'd ask to make the file source/sink 
metadata log class be pluggable at least (I'll craft a PR soon if that's the 
way to go), so that someone can move forward with restricting the use case 
instead of thinking too general.
   
   WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578


   @zsxwing 
   Thanks a lot for your detailed comment!
   
   I think considering all of these would take me to redesign metadata log as 
well as file stream source itself, which wasn't a goal actually. As I commented 
previously, building a holistic solution was not a goal, because I already 
indicated it should take considerable time, and someone might claim that it's 
reinventing the wheel (I know these functionalities are extracted from 
alternatives).
   
   That said, shall we do the following?
   
   1. Review and merge other PRs which don't incur metadata version upgrade.
   
   * #28904
   * #28422
   * #28363
   * #27649
   * #27620
   
   2. File an issue to address these requirements altogether, with one of PMC 
member being shepherd. (It would be amazing if you can volunteer.) Once someone 
volunteers and promises to make thing forward, I'll try to persuade my employer 
to allow me to take the work.
   
   If it's not desired to go with 2, I'd ask to make the file source/sink 
metadata log class be pluggable at least (I'll craft a PR soon if that's the 
way to go), so that someone can move forward with restricting the use case 
instead of thinking too general.
   
   WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578


   @zsxwing 
   Thanks a lot for your detailed comment!
   
   I think considering all of these would take me to redesign, which wasn't a 
goal actually. As I commented previously, building a holistic solution was not 
a goal, because I already indicated it should take considerable time, and 
someone might claim that it's reinventing the wheel (I know these 
functionalities are extracted from alternatives).
   
   That said, shall we do the following?
   
   1. Review and merge other PRs which don't incur metadata version upgrade.
   
   * #28904
   * #28422
   * #28363
   * #27649
   * #27620
   
   2. File an issue to address these requirements altogether, with one of PMC 
member being shepherd. (It would be amazing if you can volunteer.) Once someone 
volunteers and promises to make thing forward, I'll try to persuade my employer 
to allow me to take the work.
   
   If it's not desired to go with 2, I'd ask to make the file source/sink 
metadata log class be pluggable at least (I'll craft a PR soon if that's the 
way to go), so that someone can move forward with restricting the use case 
instead of thinking too general.
   
   WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578


   @zsxwing 
   Thanks a lot for your detailed comment!
   
   I think considering all of these would take me to redesign, which wasn't a 
goal actually. As I commented previously, building a holistic solution was not 
a goal, because I already indicated it should take considerable time, and 
someone might claim that it's reinventing the wheel (I know these 
functionalities are extracted from alternatives).
   
   That said, shall we do the following?
   
   1. Review and merge other PRs which don't incur metadata version upgrade.
   
   * #28904
   * #28422
   * #28363
   * #27649
   * #27620
   
   2. File an issue to address these requirement altogether, with one of PMC 
member being shepherd. (It would be amazing if you can volunteer.) Once someone 
volunteers and promises to make thing forward, I'll persuade my employer to 
allow me to take the work.
   
   If it's not desired to go with 2, I'd ask to make the file source/sink 
metadata log class be pluggable at least (I'll craft a PR soon if that's the 
way to go), so that someone can move forward with restricting the use case.
   
   WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578


   @zsxwing 
   Thanks a lot for your detailed comment!
   
   I think considering all of these would take me to redesign, which wasn't a 
goal actually. As I commented previously, building a holistic solution was not 
a goal, because I already indicated it should take considerable time, and 
someone might claim that it's reinventing the wheel (I know these 
functionalities are extracted from alternatives).
   
   That said, shall we do the following?
   
   1. Review and merge other PRs which don't incur metadata version upgrade.
   
   * #28904
   * #28422
   * #28363
   * #27649
   * #27620
   
   2. File an issue to address these requirements altogether, with one of PMC 
member being shepherd. (It would be amazing if you can volunteer.) Once someone 
volunteers and promises to make thing forward, I'll persuade my employer to 
allow me to take the work.
   
   If it's not desired to go with 2, I'd ask to make the file source/sink 
metadata log class be pluggable at least (I'll craft a PR soon if that's the 
way to go), so that someone can move forward with restricting the use case.
   
   WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-29 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578


   @zsxwing 
   Thanks a lot for your detailed comment!
   
   I think considering all of these would take me to redesign, which wasn't a 
goal actually. As I commented previously, building a holistic solution was not 
a goal, because I already indicated it should take considerable time, and 
someone might claim that it's reinventing the wheel (I know these 
functionalities are extracted from alternatives).
   
   That said, shall we do the following?
   
   1. Review and merge other PRs which don't incur metadata version upgrade.
   
   * #28904
   * #28422
   * #28363
   * #27649
   * #27620
   
   2. File an issue to address these requirements altogether, with one of PMC 
member being shepherd. (It would be amazing if you can volunteer.) Once someone 
volunteers and promises to make thing forward, I'll try to persuade my employer 
to allow me to take the work.
   
   If it's not desired to go with 2, I'd ask to make the file source/sink 
metadata log class be pluggable at least (I'll craft a PR soon if that's the 
way to go), so that someone can move forward with restricting the use case.
   
   WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-18 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-646362234


   Just to make sure I really did go through the idea of @uncleGen proposed (at 
that time), I'll elaborate why I didn't go through actual implementation.
   
   While there's effectively NO deletion of the entry on file stream 
source/sink, it's definitely needed to have "retention" on both, so 
consideration of deleting entries was required. Once we read previous file and 
make modification, then it defeats the good side of the idea (no read and no 
rewrite on previous compact batch).
   
   There's a way to achieve both of the goods, via splitting compact batch file 
per timestamp (the unit would be day, hour, etc and the unit defines the 
granularity of retention) and delete compact batch file based on the retention. 
It's a bit complicated to implement, and size of the each compact batch file 
would be out of control.
   
   The idea may be still valid but if we would like to make it work nicely it 
should have been designed thoughtfully.
   
   Compared to that idea, this patch is no-brainer, inherits shortcoming of 
current compact behavior but can do the job pretty much faster (~10x). Based on 
end users' instinct about volume of the outputs per day and good value of 
retention, they can roughly control the size of the compact batch file.
   (It still won't work if the volume is gigantic or they need to set pretty 
high retention, but that's the time we propose trying out alternatives. 
Currently it simply doesn't work or makes end users struggling with normal 
workload.)
   
   Even with this patch & retention, memory issue may still exist, as this 
patch doesn't help reducing memory usage on compact. Current file stream 
source/sink requires to materialize all entries into memory during compact, 
maybe also during get. The problem is more specific to file stream sink, as the 
size of entry are much bigger, and even with retention a compact batch would 
have bunch of entries. Addressing issue on get is hard (because it's also 
coupled with callers' logic), but addressing issue on compact would be 
relatively easier, and helps file stream sink to avoid OOME during compact 
phase. Next item to do.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-18 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-646362234


   Just to make sure I did go through the idea of @uncleGen proposed, I'll 
elaborate why I didn't go through actual implementation.
   
   While there's effectively NO deletion of the entry on file stream 
source/sink, it's definitely needed to have "retention" on both, so 
consideration of deleting entries was required. Once we read previous file and 
make modification, then it defeats the good side of the idea (no read and no 
rewrite on previous compact batch).
   
   There's a way to achieve both of the goods, via splitting compact batch file 
per timestamp (the unit would be day, hour, etc and the unit defines the 
granularity of retention) and delete compact batch file based on the retention. 
It's a bit complicated to implement, and size of the each compact batch file 
would be out of control.
   
   The idea may be still valid but if we would like to make it work nicely it 
should have been designed thoughtfully.
   
   Compared to that idea, this patch is no-brainer, inherits shortcoming of 
current compact behavior but can do the job pretty much faster (~10x). Based on 
end users' instinct about volume of the outputs per day and good value of 
retention, they can roughly control the size of the compact batch file.
   (It still won't work if the volume is gigantic or they need to set pretty 
high retention, but that's the time we propose trying out alternatives. 
Currently it simply doesn't work or makes end users struggling with normal 
workload.)
   
   Even with this patch & retention, memory issue may still exist, as this 
patch doesn't help reducing memory usage on compact. Current file stream 
source/sink requires to materialize all entries into memory during compact, 
maybe also during get. The problem is more specific to file stream sink, as the 
size of entry are much bigger, and even with retention a compact batch would 
have bunch of entries. Addressing issue on get is hard (because it's also 
coupled with callers' logic), but addressing issue on compact would be 
relatively easier, and helps file stream sink to avoid OOME during compact 
phase. Next item to do.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-14 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-643878976


   I’m sorry, but version 4 doesn’t leverage UnsafeRow. (version 2 was.) Please 
read the description thoughtfully.
   
   As I commented earlier there’re still lots of possible improvements in 
metadata, but I don’t want to go through unless we promise dedicated efforts on 
reviewing. This is low hanging fruit which brings massive improvement.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log

2020-06-14 Thread GitBox


HeartSaVioR edited a comment on pull request #27694:
URL: https://github.com/apache/spark/pull/27694#issuecomment-643878976


   I’m sorry, but version 4 doesn’t leverage UnsafeRow. (version 2 was.) Please 
read the description thoughtfully.
   
   As I commented earlier there’re still lots of possible improvements in 
metadata, but I don’t want to go through unless we promise dedicate efforts on 
reviewing. This is low hanging fruit which brings massive improvement.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org