[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-17 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..


Patch Set 5: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7639/5/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 297: boundary_column_.Clear();
> I tried it and it doesn't quite work (without clearing the boundary field)
Ahh you are right, I missed that one memcpy(). Thanks for investigating, I 
agree the current solution is least confusing.


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-17 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7639/5/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 297: boundary_column_.Clear();
> Can we use the existing CopyBoundaryField() function for consistency? Other
I tried it and it doesn't quite work (without clearing the boundary field)

This is a slightly different case. CopyBoundaryField() copies the contents of 
the field followed by the boundary column to memory allocated from the row 
batch.

In this case, after FillColumns(), field_locations is pointing at the data in 
the boundary column. We can trick CopyBoundaryField() by clearing the boundary 
column right before CopyBoundaryField(), but it's such a weird thing to do. (It 
looks weird because we just cleared the boundary column, why are we trying to 
copy out of it?)

Another thing we can do, is check that the field is pointing to a different 
memory location than the boundary column, but I also think that's weird.

I suggest we leave it as is (for me, the way it is now is the least confusing 
way). What do you think?


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-17 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7639/5/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 297: boundary_column_.Clear();
Can we use the existing CopyBoundaryField() function for consistency? 
Otherwise, a future reader might be confused why we are not using 
CopyBoundaryField() here.

I believe we could call that after delimited_text_parser_->FillColumns(), 
instead of coping before calling FillColumns().


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-17 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..


Patch Set 5: Code-Review+1

Thanks for your patience, this is a lot easier to understand. Will let Alex +2 
once he's had a look.

-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-17 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7639/5//COMMIT_MSG
Commit Message:

PS5, Line 7: MPALA
Oops.


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-16 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has uploaded a new patch set (#5).

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written. We also make sure that boundary pool does not hold any tuple
data of returned batches by always deep copying it to output batches.

Testing:
- Ran some tests locally on ASAN build.
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files. This change was helpful for finding
  issues while developing this patch.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M tests/query_test/test_scanners_fuzz.py
3 files changed, 65 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-16 Thread Taras Bobrovytsky (Code Review)
Hello Alex Behm,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7639

to look at the new patch set (#5).

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written. We also make sure that boundary pool does not hold any tuple
data of returned batches by always deep copying it to output batches.

Testing:
- Ran some tests locally on ASAN build.
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files. This change was helpful for finding
  issues while developing this patch.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M tests/query_test/test_scanners_fuzz.py
3 files changed, 65 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-11 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 169: row_batch->tuple_data_pool()->AcquireData(boundary_pool_.get(), 
false);
> I don't think that data from the boundary pool is ever used outside of this
Actually, after running some tests with ASAN, it turns out that it's not ok to 
clear the boundary pool. The boundary col contains some var len data that is 
used after Close() is called.

Also, what do you mean we should be transferring the contents of boundary pool 
below instead of clearing it? Do you mean in the else clause? If row batch is 
null, where would would we transfer the contents to?


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-11 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..


Patch Set 2:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/7639/2//COMMIT_MSG
Commit Message:

Line 7: MPALA-5776: Write partial tuple to the correct mempool
> Missing I.
Done


http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 169: row_batch->tuple_data_pool()->AcquireData(boundary_pool_.get(), 
false);
> Does this even make sense? are there any cases where it's valid for the row
I don't think that data from the boundary pool is ever used outside of this 
class, so it makes sense to clear it.

I analyzed the code, and it doesn't look to me like data in boundary pool is 
ever used, but how can we be 100% sure?


Line 269: bool eosr = true;
> Remove. This not needed/used.
It's used on line 277. Are you suggesting to remove it from the signature of 
FillByteBuffer as well?


Line 299: boundary_column_.Clear();
> What's the point of clearing the boundary column here? I don't think the sc
Removed this.


PS2, Line 765: batches
> I'm not sure that I understand the reference to batches. Do they mean block
Yeah, "batches" doesn't make sense here. Changed it to blocks.


Line 806:   partial_tuple_->DeepCopy(tuple_, *scan_node_->tuple_desc(), 
pool);
> Can you add a column clarifying what the intent is. E.g. "Copy over all mat
Done


Line 808:   boundary_row_.Reset();
> Another idea: Add a separate function for copying the partial tuple into th
I added a new mem pool, so boundary row and columns are not affected any more. 
Alex, do you guys think it's still worth creating a separate function?


PS2, Line 810: emptied
> cleared? The comment makes me thing that all the memory has been freed, but
Done


Line 814:   partial_tuple_ = Tuple::Create(tuple_byte_size_, 
boundary_pool_.get());
> Is it possible to just initialize partial_tuple_ when it's needed? I.e. bef
Done


Line 896: int num_fields, bool copy_strings) {
> Can you remove the 'copy_strings' parameter? It's not used in this function
Done


http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.h
File be/src/exec/hdfs-text-scanner.h:

Line 188:   /// the boundary pool.
> Hah! If only the code had matched the comment...
totally agree


Line 194:   /// Mem pool for boundary_row_ and boundary_column_.
> Needs updating since it also backs partial_tuple_.
I made changes in the next patch so that this comment is now true.


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Taras Bobrovytsky 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-11 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 269: bool eosr = true;
Remove. This not needed/used.


Line 299: boundary_column_.Clear();
> This is subtle because the contents of boundary_column_ are needed until Wr
What's the point of clearing the boundary column here? I don't think the 
scanner is going to touch it after this point.


Line 808:   boundary_row_.Reset();
> It might be clearer if we factored out this reset logic for boundary_ and p
Another idea: Add a separate function for copying the partial tuple into the 
output batch. All this logic can go inside there and we can document/DCHECK 
pre- and post-conditions.


Line 896: int num_fields, bool copy_strings) {
Can you remove the 'copy_strings' parameter? It's not used in this function and 
not copying seems too complicated to reason about.


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..


Patch Set 2:

(10 comments)

This is very subtle. I think the solution works but maybe there are some ways 
we can make it clearer why it works.

http://gerrit.cloudera.org:8080/#/c/7639/2//COMMIT_MSG
Commit Message:

Line 7: MPALA-5776: Write partial tuple to the correct mempool
Missing I.


http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 169: row_batch->tuple_data_pool()->AcquireData(boundary_pool_.get(), 
false);
Does this even make sense? are there any cases where it's valid for the row 
batch to reference data in the boundary pool directly?

It seems like either this is unnecessary or we should be transferring the 
contents of boundary_pool_ below instead of clearing it.


Line 299: boundary_column_.Clear();
This is subtle because the contents of boundary_column_ are needed until 
WriteFields() below, I think. Maybe it would be clearer to document this or 
even move it below to the point where the data is no longer needed.


PS2, Line 765: batches
I'm not sure that I understand the reference to batches. Do they mean blocks?


Line 806:   partial_tuple_->DeepCopy(tuple_, *scan_node_->tuple_desc(), 
pool);
Can you add a column clarifying what the intent is. E.g. "Copy over all 
materialized slots in the partial tuple".


Line 808:   boundary_row_.Reset();
It might be clearer if we factored out this reset logic for boundary_ and 
partial_tuple_ into a separate function and documented the pre-conditions for 
it. I.e. the row has been fully materialized and all memory has been copied 
into the RowBatch pool (I think that's the precondition but not sure).


PS2, Line 810: emptied
cleared? The comment makes me thing that all the memory has been freed, but 
Clear() just enables recycling of the chunks.


Line 814:   partial_tuple_ = Tuple::Create(tuple_byte_size_, 
boundary_pool_.get());
Is it possible to just initialize partial_tuple_ when it's needed? I.e. before 
we call InitTuple(template_tuple_, partial_tuple_). Then maybe we can get rid 
of partial_tuple_empty_ and represent that as partition_tuple_ == nullptr.


http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.h
File be/src/exec/hdfs-text-scanner.h:

Line 188:   /// the boundary pool.
Hah! If only the code had matched the comment...


Line 194:   /// Mem pool for boundary_row_ and boundary_column_.
Needs updating since it also backs partial_tuple_.


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-09 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has uploaded a new patch set (#2).

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written.

Testing:
- Ran some tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
1 file changed, 13 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-09 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has uploaded a new patch set (#2).

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written.

Testing:
- Ran some tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
1 file changed, 13 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-09 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has uploaded a new patch set (#2).

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written.

Testing:
- Ran some tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
1 file changed, 13 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-09 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has uploaded a new patch set (#2).

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written.

Testing:
- Ran some tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
1 file changed, 14 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

2017-08-09 Thread Taras Bobrovytsky (Code Review)
Taras Bobrovytsky has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7639

Change subject: MPALA-5776: Write partial tuple to the correct mempool
..

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written.

Testing:
- Ran some tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
1 file changed, 11 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky