[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Alex Behm has posted comments on this change. Change subject: MPALA-5776: Write partial tuple to the correct mempool .. Patch Set 5: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/7639/5/be/src/exec/hdfs-text-scanner.cc File be/src/exec/hdfs-text-scanner.cc: Line 297: boundary_column_.Clear(); > I tried it and it doesn't quite work (without clearing the boundary field) Ahh you are right, I missed that one memcpy(). Thanks for investigating, I agree the current solution is least confusing. -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Taras Bobrovytsky has posted comments on this change. Change subject: MPALA-5776: Write partial tuple to the correct mempool .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/7639/5/be/src/exec/hdfs-text-scanner.cc File be/src/exec/hdfs-text-scanner.cc: Line 297: boundary_column_.Clear(); > Can we use the existing CopyBoundaryField() function for consistency? Other I tried it and it doesn't quite work (without clearing the boundary field) This is a slightly different case. CopyBoundaryField() copies the contents of the field followed by the boundary column to memory allocated from the row batch. In this case, after FillColumns(), field_locations is pointing at the data in the boundary column. We can trick CopyBoundaryField() by clearing the boundary column right before CopyBoundaryField(), but it's such a weird thing to do. (It looks weird because we just cleared the boundary column, why are we trying to copy out of it?) Another thing we can do, is check that the field is pointing to a different memory location than the boundary column, but I also think that's weird. I suggest we leave it as is (for me, the way it is now is the least confusing way). What do you think? -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Alex Behm has posted comments on this change. Change subject: MPALA-5776: Write partial tuple to the correct mempool .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/7639/5/be/src/exec/hdfs-text-scanner.cc File be/src/exec/hdfs-text-scanner.cc: Line 297: boundary_column_.Clear(); Can we use the existing CopyBoundaryField() function for consistency? Otherwise, a future reader might be confused why we are not using CopyBoundaryField() here. I believe we could call that after delimited_text_parser_->FillColumns(), instead of coping before calling FillColumns(). -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Tim Armstrong has posted comments on this change. Change subject: MPALA-5776: Write partial tuple to the correct mempool .. Patch Set 5: Code-Review+1 Thanks for your patience, this is a lot easier to understand. Will let Alex +2 once he's had a look. -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Taras Bobrovytsky has posted comments on this change. Change subject: MPALA-5776: Write partial tuple to the correct mempool .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/7639/5//COMMIT_MSG Commit Message: PS5, Line 7: MPALA Oops. -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Taras Bobrovytsky has uploaded a new patch set (#5). Change subject: MPALA-5776: Write partial tuple to the correct mempool .. MPALA-5776: Write partial tuple to the correct mempool In the text scanner, we were writing the partial tuple variable length data to data_buffer_pool_ mempool which caused strange behavior, such as incorrect results. If we are scanning compressed data, the pool gets attached to the row batch at the end of a GetNext() call and gets freed before the next GetNext() call. This is wrong because we expect the data in the partial tuple to survive between the GetNext() calls. If we are scanning non compressed data, data_buffer_pool_ never gets cleared and grows over time until the scanner finishes reading the scan range. We fix the problem by writing the varlen partial tuple data to boundary_pool_, which is where the constant length partial tuple data is written. We also make sure that boundary pool does not hold any tuple data of returned batches by always deep copying it to output batches. Testing: - Ran some tests locally on ASAN build. - Updated test_scanners_fuzz.py to make slightly more significant changes to the data files. This change was helpful for finding issues while developing this patch. Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 --- M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-scanner.h M tests/query_test/test_scanners_fuzz.py 3 files changed, 65 insertions(+), 51 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/5 -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Hello Alex Behm, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7639 to look at the new patch set (#5). Change subject: MPALA-5776: Write partial tuple to the correct mempool .. MPALA-5776: Write partial tuple to the correct mempool In the text scanner, we were writing the partial tuple variable length data to data_buffer_pool_ mempool which caused strange behavior, such as incorrect results. If we are scanning compressed data, the pool gets attached to the row batch at the end of a GetNext() call and gets freed before the next GetNext() call. This is wrong because we expect the data in the partial tuple to survive between the GetNext() calls. If we are scanning non compressed data, data_buffer_pool_ never gets cleared and grows over time until the scanner finishes reading the scan range. We fix the problem by writing the varlen partial tuple data to boundary_pool_, which is where the constant length partial tuple data is written. We also make sure that boundary pool does not hold any tuple data of returned batches by always deep copying it to output batches. Testing: - Ran some tests locally on ASAN build. - Updated test_scanners_fuzz.py to make slightly more significant changes to the data files. This change was helpful for finding issues while developing this patch. Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 --- M be/src/exec/hdfs-text-scanner.cc M be/src/exec/hdfs-text-scanner.h M tests/query_test/test_scanners_fuzz.py 3 files changed, 65 insertions(+), 51 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/5 -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Taras Bobrovytsky has posted comments on this change. Change subject: MPALA-5776: Write partial tuple to the correct mempool .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc File be/src/exec/hdfs-text-scanner.cc: Line 169: row_batch->tuple_data_pool()->AcquireData(boundary_pool_.get(), false); > I don't think that data from the boundary pool is ever used outside of this Actually, after running some tests with ASAN, it turns out that it's not ok to clear the boundary pool. The boundary col contains some var len data that is used after Close() is called. Also, what do you mean we should be transferring the contents of boundary pool below instead of clearing it? Do you mean in the else clause? If row batch is null, where would would we transfer the contents to? -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Taras Bobrovytsky has posted comments on this change. Change subject: MPALA-5776: Write partial tuple to the correct mempool .. Patch Set 2: (12 comments) http://gerrit.cloudera.org:8080/#/c/7639/2//COMMIT_MSG Commit Message: Line 7: MPALA-5776: Write partial tuple to the correct mempool > Missing I. Done http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc File be/src/exec/hdfs-text-scanner.cc: Line 169: row_batch->tuple_data_pool()->AcquireData(boundary_pool_.get(), false); > Does this even make sense? are there any cases where it's valid for the row I don't think that data from the boundary pool is ever used outside of this class, so it makes sense to clear it. I analyzed the code, and it doesn't look to me like data in boundary pool is ever used, but how can we be 100% sure? Line 269: bool eosr = true; > Remove. This not needed/used. It's used on line 277. Are you suggesting to remove it from the signature of FillByteBuffer as well? Line 299: boundary_column_.Clear(); > What's the point of clearing the boundary column here? I don't think the sc Removed this. PS2, Line 765: batches > I'm not sure that I understand the reference to batches. Do they mean block Yeah, "batches" doesn't make sense here. Changed it to blocks. Line 806: partial_tuple_->DeepCopy(tuple_, *scan_node_->tuple_desc(), pool); > Can you add a column clarifying what the intent is. E.g. "Copy over all mat Done Line 808: boundary_row_.Reset(); > Another idea: Add a separate function for copying the partial tuple into th I added a new mem pool, so boundary row and columns are not affected any more. Alex, do you guys think it's still worth creating a separate function? PS2, Line 810: emptied > cleared? The comment makes me thing that all the memory has been freed, but Done Line 814: partial_tuple_ = Tuple::Create(tuple_byte_size_, boundary_pool_.get()); > Is it possible to just initialize partial_tuple_ when it's needed? I.e. bef Done Line 896: int num_fields, bool copy_strings) { > Can you remove the 'copy_strings' parameter? It's not used in this function Done http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.h File be/src/exec/hdfs-text-scanner.h: Line 188: /// the boundary pool. > Hah! If only the code had matched the comment... totally agree Line 194: /// Mem pool for boundary_row_ and boundary_column_. > Needs updating since it also backs partial_tuple_. I made changes in the next patch so that this comment is now true. -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Taras Bobrovytsky Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Alex Behm has posted comments on this change. Change subject: MPALA-5776: Write partial tuple to the correct mempool .. Patch Set 2: (4 comments) http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc File be/src/exec/hdfs-text-scanner.cc: Line 269: bool eosr = true; Remove. This not needed/used. Line 299: boundary_column_.Clear(); > This is subtle because the contents of boundary_column_ are needed until Wr What's the point of clearing the boundary column here? I don't think the scanner is going to touch it after this point. Line 808: boundary_row_.Reset(); > It might be clearer if we factored out this reset logic for boundary_ and p Another idea: Add a separate function for copying the partial tuple into the output batch. All this logic can go inside there and we can document/DCHECK pre- and post-conditions. Line 896: int num_fields, bool copy_strings) { Can you remove the 'copy_strings' parameter? It's not used in this function and not copying seems too complicated to reason about. -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Tim Armstrong has posted comments on this change. Change subject: MPALA-5776: Write partial tuple to the correct mempool .. Patch Set 2: (10 comments) This is very subtle. I think the solution works but maybe there are some ways we can make it clearer why it works. http://gerrit.cloudera.org:8080/#/c/7639/2//COMMIT_MSG Commit Message: Line 7: MPALA-5776: Write partial tuple to the correct mempool Missing I. http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc File be/src/exec/hdfs-text-scanner.cc: Line 169: row_batch->tuple_data_pool()->AcquireData(boundary_pool_.get(), false); Does this even make sense? are there any cases where it's valid for the row batch to reference data in the boundary pool directly? It seems like either this is unnecessary or we should be transferring the contents of boundary_pool_ below instead of clearing it. Line 299: boundary_column_.Clear(); This is subtle because the contents of boundary_column_ are needed until WriteFields() below, I think. Maybe it would be clearer to document this or even move it below to the point where the data is no longer needed. PS2, Line 765: batches I'm not sure that I understand the reference to batches. Do they mean blocks? Line 806: partial_tuple_->DeepCopy(tuple_, *scan_node_->tuple_desc(), pool); Can you add a column clarifying what the intent is. E.g. "Copy over all materialized slots in the partial tuple". Line 808: boundary_row_.Reset(); It might be clearer if we factored out this reset logic for boundary_ and partial_tuple_ into a separate function and documented the pre-conditions for it. I.e. the row has been fully materialized and all memory has been copied into the RowBatch pool (I think that's the precondition but not sure). PS2, Line 810: emptied cleared? The comment makes me thing that all the memory has been freed, but Clear() just enables recycling of the chunks. Line 814: partial_tuple_ = Tuple::Create(tuple_byte_size_, boundary_pool_.get()); Is it possible to just initialize partial_tuple_ when it's needed? I.e. before we call InitTuple(template_tuple_, partial_tuple_). Then maybe we can get rid of partial_tuple_empty_ and represent that as partition_tuple_ == nullptr. http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.h File be/src/exec/hdfs-text-scanner.h: Line 188: /// the boundary pool. Hah! If only the code had matched the comment... Line 194: /// Mem pool for boundary_row_ and boundary_column_. Needs updating since it also backs partial_tuple_. -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Taras Bobrovytsky has uploaded a new patch set (#2). Change subject: MPALA-5776: Write partial tuple to the correct mempool .. MPALA-5776: Write partial tuple to the correct mempool In the text scanner, we were writing the partial tuple variable length data to data_buffer_pool_ mempool which caused strange behavior, such as incorrect results. If we are scanning compressed data, the pool gets attached to the row batch at the end of a GetNext() call and gets freed before the next GetNext() call. This is wrong because we expect the data in the partial tuple to survive between the GetNext() calls. If we are scanning non compressed data, data_buffer_pool_ never gets cleared and grows over time until the scanner finishes reading the scan range. We fix the problem by writing the varlen partial tuple data to boundary_pool_, which is where the constant length partial tuple data is written. Testing: - Ran some tests locally on ASAN build. - No new tests were added, because it is difficult to construct test cases due to the issue being non-deterministic. Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 --- M be/src/exec/hdfs-text-scanner.cc 1 file changed, 13 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/2 -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Taras Bobrovytsky has uploaded a new patch set (#2). Change subject: MPALA-5776: Write partial tuple to the correct mempool .. MPALA-5776: Write partial tuple to the correct mempool In the text scanner, we were writing the partial tuple variable length data to data_buffer_pool_ mempool which caused strange behavior, such as incorrect results. If we are scanning compressed data, the pool gets attached to the row batch at the end of a GetNext() call and gets freed before the next GetNext() call. This is wrong because we expect the data in the partial tuple to survive between the GetNext() calls. If we are scanning non compressed data, data_buffer_pool_ never gets cleared and grows over time until the scanner finishes reading the scan range. We fix the problem by writing the varlen partial tuple data to boundary_pool_, which is where the constant length partial tuple data is written. Testing: - Ran some tests locally on ASAN build. - No new tests were added, because it is difficult to construct test cases due to the issue being non-deterministic. Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 --- M be/src/exec/hdfs-text-scanner.cc 1 file changed, 13 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/2 -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Taras Bobrovytsky has uploaded a new patch set (#2). Change subject: MPALA-5776: Write partial tuple to the correct mempool .. MPALA-5776: Write partial tuple to the correct mempool In the text scanner, we were writing the partial tuple variable length data to data_buffer_pool_ mempool which caused strange behavior, such as incorrect results. If we are scanning compressed data, the pool gets attached to the row batch at the end of a GetNext() call and gets freed before the next GetNext() call. This is wrong because we expect the data in the partial tuple to survive between the GetNext() calls. If we are scanning non compressed data, data_buffer_pool_ never gets cleared and grows over time until the scanner finishes reading the scan range. We fix the problem by writing the varlen partial tuple data to boundary_pool_, which is where the constant length partial tuple data is written. Testing: - Ran some tests locally on ASAN build. - No new tests were added, because it is difficult to construct test cases due to the issue being non-deterministic. Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 --- M be/src/exec/hdfs-text-scanner.cc 1 file changed, 13 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/2 -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Taras Bobrovytsky has uploaded a new patch set (#2). Change subject: MPALA-5776: Write partial tuple to the correct mempool .. MPALA-5776: Write partial tuple to the correct mempool In the text scanner, we were writing the partial tuple variable length data to data_buffer_pool_ mempool which caused strange behavior, such as incorrect results. If we are scanning compressed data, the pool gets attached to the row batch at the end of a GetNext() call and gets freed before the next GetNext() call. This is wrong because we expect the data in the partial tuple to survive between the GetNext() calls. If we are scanning non compressed data, data_buffer_pool_ never gets cleared and grows over time until the scanner finishes reading the scan range. We fix the problem by writing the varlen partial tuple data to boundary_pool_, which is where the constant length partial tuple data is written. Testing: - Ran some tests locally on ASAN build. - No new tests were added, because it is difficult to construct test cases due to the issue being non-deterministic. Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 --- M be/src/exec/hdfs-text-scanner.cc 1 file changed, 14 insertions(+), 4 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/2 -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras BobrovytskyGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool
Taras Bobrovytsky has uploaded a new change for review. http://gerrit.cloudera.org:8080/7639 Change subject: MPALA-5776: Write partial tuple to the correct mempool .. MPALA-5776: Write partial tuple to the correct mempool In the text scanner, we were writing the partial tuple variable length data to data_buffer_pool_ mempool which caused strange behavior, such as incorrect results. If we are scanning compressed data, the pool gets attached to the row batch at the end of a GetNext() call and gets freed before the next GetNext() call. This is wrong because we expect the data in the partial tuple to survive between the GetNext() calls. If we are scanning non compressed data, data_buffer_pool_ never gets cleared and grows over time until the scanner finishes reading the scan range. We fix the problem by writing the varlen partial tuple data to boundary_pool_, which is where the constant length partial tuple data is written. Testing: - Ran some tests locally on ASAN build. - No new tests were added, because it is difficult to construct test cases due to the issue being non-deterministic. Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 --- M be/src/exec/hdfs-text-scanner.cc 1 file changed, 11 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/1 -- To view, visit http://gerrit.cloudera.org:8080/7639 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky