[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651484627 (IMHO it might be still good chance to leverage this PR to construct a good way for versioning properly - so that version 2 can be used as an interim with best practice on versioning, and we get version 3 with such major features included without headache on versioning.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578 @zsxwing Thanks a lot for your detailed comment! I think considering all of these would take me to redesign metadata log as well as file stream source itself, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives). That said, shall we do the following? 1. Review and merge other PRs which don't incur metadata version upgrade, **before proceeding 2**. * #28904 * #28422 * #28363 * #27649 * #27620 2. File an issue to address these requirements altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll try to persuade my employer to allow me to take the work. If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case instead of thinking too general. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578 @zsxwing Thanks a lot for your detailed comment! I think considering all of these would take me to redesign metadata log as well as file stream source itself, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives). That said, shall we do the following? 1. Review and merge other PRs which don't incur metadata version upgrade. * #28904 * #28422 * #28363 * #27649 * #27620 2. File an issue to address these requirements altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll try to persuade my employer to allow me to take the work. If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case instead of thinking too general. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578 @zsxwing Thanks a lot for your detailed comment! I think considering all of these would take me to redesign, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives). That said, shall we do the following? 1. Review and merge other PRs which don't incur metadata version upgrade. * #28904 * #28422 * #28363 * #27649 * #27620 2. File an issue to address these requirements altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll try to persuade my employer to allow me to take the work. If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case instead of thinking too general. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578 @zsxwing Thanks a lot for your detailed comment! I think considering all of these would take me to redesign, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives). That said, shall we do the following? 1. Review and merge other PRs which don't incur metadata version upgrade. * #28904 * #28422 * #28363 * #27649 * #27620 2. File an issue to address these requirement altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll persuade my employer to allow me to take the work. If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578 @zsxwing Thanks a lot for your detailed comment! I think considering all of these would take me to redesign, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives). That said, shall we do the following? 1. Review and merge other PRs which don't incur metadata version upgrade. * #28904 * #28422 * #28363 * #27649 * #27620 2. File an issue to address these requirements altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll persuade my employer to allow me to take the work. If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-651479578 @zsxwing Thanks a lot for your detailed comment! I think considering all of these would take me to redesign, which wasn't a goal actually. As I commented previously, building a holistic solution was not a goal, because I already indicated it should take considerable time, and someone might claim that it's reinventing the wheel (I know these functionalities are extracted from alternatives). That said, shall we do the following? 1. Review and merge other PRs which don't incur metadata version upgrade. * #28904 * #28422 * #28363 * #27649 * #27620 2. File an issue to address these requirements altogether, with one of PMC member being shepherd. (It would be amazing if you can volunteer.) Once someone volunteers and promises to make thing forward, I'll try to persuade my employer to allow me to take the work. If it's not desired to go with 2, I'd ask to make the file source/sink metadata log class be pluggable at least (I'll craft a PR soon if that's the way to go), so that someone can move forward with restricting the use case. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-646362234 Just to make sure I really did go through the idea of @uncleGen proposed (at that time), I'll elaborate why I didn't go through actual implementation. While there's effectively NO deletion of the entry on file stream source/sink, it's definitely needed to have "retention" on both, so consideration of deleting entries was required. Once we read previous file and make modification, then it defeats the good side of the idea (no read and no rewrite on previous compact batch). There's a way to achieve both of the goods, via splitting compact batch file per timestamp (the unit would be day, hour, etc and the unit defines the granularity of retention) and delete compact batch file based on the retention. It's a bit complicated to implement, and size of the each compact batch file would be out of control. The idea may be still valid but if we would like to make it work nicely it should have been designed thoughtfully. Compared to that idea, this patch is no-brainer, inherits shortcoming of current compact behavior but can do the job pretty much faster (~10x). Based on end users' instinct about volume of the outputs per day and good value of retention, they can roughly control the size of the compact batch file. (It still won't work if the volume is gigantic or they need to set pretty high retention, but that's the time we propose trying out alternatives. Currently it simply doesn't work or makes end users struggling with normal workload.) Even with this patch & retention, memory issue may still exist, as this patch doesn't help reducing memory usage on compact. Current file stream source/sink requires to materialize all entries into memory during compact, maybe also during get. The problem is more specific to file stream sink, as the size of entry are much bigger, and even with retention a compact batch would have bunch of entries. Addressing issue on get is hard (because it's also coupled with callers' logic), but addressing issue on compact would be relatively easier, and helps file stream sink to avoid OOME during compact phase. Next item to do. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-646362234 Just to make sure I did go through the idea of @uncleGen proposed, I'll elaborate why I didn't go through actual implementation. While there's effectively NO deletion of the entry on file stream source/sink, it's definitely needed to have "retention" on both, so consideration of deleting entries was required. Once we read previous file and make modification, then it defeats the good side of the idea (no read and no rewrite on previous compact batch). There's a way to achieve both of the goods, via splitting compact batch file per timestamp (the unit would be day, hour, etc and the unit defines the granularity of retention) and delete compact batch file based on the retention. It's a bit complicated to implement, and size of the each compact batch file would be out of control. The idea may be still valid but if we would like to make it work nicely it should have been designed thoughtfully. Compared to that idea, this patch is no-brainer, inherits shortcoming of current compact behavior but can do the job pretty much faster (~10x). Based on end users' instinct about volume of the outputs per day and good value of retention, they can roughly control the size of the compact batch file. (It still won't work if the volume is gigantic or they need to set pretty high retention, but that's the time we propose trying out alternatives. Currently it simply doesn't work or makes end users struggling with normal workload.) Even with this patch & retention, memory issue may still exist, as this patch doesn't help reducing memory usage on compact. Current file stream source/sink requires to materialize all entries into memory during compact, maybe also during get. The problem is more specific to file stream sink, as the size of entry are much bigger, and even with retention a compact batch would have bunch of entries. Addressing issue on get is hard (because it's also coupled with callers' logic), but addressing issue on compact would be relatively easier, and helps file stream sink to avoid OOME during compact phase. Next item to do. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-643878976 I’m sorry, but version 4 doesn’t leverage UnsafeRow. (version 2 was.) Please read the description thoughtfully. As I commented earlier there’re still lots of possible improvements in metadata, but I don’t want to go through unless we promise dedicated efforts on reviewing. This is low hanging fruit which brings massive improvement. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #27694: [SPARK-30946][SS] Serde entry via DataInputStream/DataOutputStream with LZ4 compression on FileStream(Source/Sink)Log
HeartSaVioR edited a comment on pull request #27694: URL: https://github.com/apache/spark/pull/27694#issuecomment-643878976 I’m sorry, but version 4 doesn’t leverage UnsafeRow. (version 2 was.) Please read the description thoughtfully. As I commented earlier there’re still lots of possible improvements in metadata, but I don’t want to go through unless we promise dedicate efforts on reviewing. This is low hanging fruit which brings massive improvement. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org