[GitHub] drill pull request #1057: DRILL-5993 Append Row Method For VectorContainer (...
Github user ilooner commented on a diff in the pull request: https://github.com/apache/drill/pull/1057#discussion_r153958856 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/record/TestVectorContainer.java --- @@ -124,4 +132,52 @@ public void testContainerMerge() { leftIndirect.clear(); right.clear(); } + + @Test + public void testAppendRow() + { +MaterializedField colA = MaterializedField.create("colA", Types.required(TypeProtos.MinorType.INT)); +MaterializedField colB = MaterializedField.create("colB", Types.required(TypeProtos.MinorType.INT)); --- End diff -- Added more interesting types. Currently RowSet classes don't support the Map data type. Paul asked me to look into adding support for this a while ago for DRILL-5870 . I'll update the test framework to support that in the next PR. ---
[GitHub] drill pull request #1057: DRILL-5993 Append Row Method For VectorContainer (...
Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1057#discussion_r153934729 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java --- @@ -353,6 +353,23 @@ public int getRecordCount() { public boolean hasRecordCount() { return recordCount != -1; } + /** + * This works with non-hyper {@link VectorContainer}s which have no selection vectors. + * Appends a row taken from a source {@link VectorContainer} to this {@link VectorContainer}. + * @param srcContainer The {@link VectorContainer} to copy a row from. + * @param srcIndex The index of the row to copy from the source {@link VectorContainer}. + */ + public void appendRow(VectorContainer srcContainer, int srcIndex) { +for (int vectorIndex = 0; vectorIndex < wrappers.size(); vectorIndex++) { + ValueVector destVector = wrappers.get(vectorIndex).getValueVector(); + ValueVector srcVector = srcContainer.wrappers.get(vectorIndex).getValueVector(); + + destVector.copyEntry(recordCount, srcVector, srcIndex); +} + +recordCount++; --- End diff -- The immediate need for appendRow() is to distribute rows from a single incoming batch into multiple other batches (for the Hash Join internal partitioning), based on the hash value of the key columns at each row. This would not work well with the second suggestion (vectorizing - column 1, column 2, etc.) ---
[GitHub] drill pull request #1057: DRILL-5993 Append Row Method For VectorContainer (...
Github user Ben-Zvi commented on a diff in the pull request: https://github.com/apache/drill/pull/1057#discussion_r153935071 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/record/TestVectorContainer.java --- @@ -124,4 +132,52 @@ public void testContainerMerge() { leftIndirect.clear(); right.clear(); } + + @Test + public void testAppendRow() + { +MaterializedField colA = MaterializedField.create("colA", Types.required(TypeProtos.MinorType.INT)); +MaterializedField colB = MaterializedField.create("colB", Types.required(TypeProtos.MinorType.INT)); --- End diff -- Maybe add some "interesting" datatypes ? Testing integers only may miss some issue. ---
[GitHub] drill pull request #1057: DRILL-5993 Append Row Method For VectorContainer (...
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1057#discussion_r153926390 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java --- @@ -353,6 +353,23 @@ public int getRecordCount() { public boolean hasRecordCount() { return recordCount != -1; } + /** + * This works with non-hyper {@link VectorContainer}s which have no selection vectors. + * Appends a row taken from a source {@link VectorContainer} to this {@link VectorContainer}. + * @param srcContainer The {@link VectorContainer} to copy a row from. + * @param srcIndex The index of the row to copy from the source {@link VectorContainer}. + */ + public void appendRow(VectorContainer srcContainer, int srcIndex) { +for (int vectorIndex = 0; vectorIndex < wrappers.size(); vectorIndex++) { + ValueVector destVector = wrappers.get(vectorIndex).getValueVector(); + ValueVector srcVector = srcContainer.wrappers.get(vectorIndex).getValueVector(); + + destVector.copyEntry(recordCount, srcVector, srcIndex); +} + +recordCount++; --- End diff -- This is OK for a row-by-row copy. But, you'll get better performance if you optimize for the entire batch. Because you have no SV4, the source and dest batches are the same so the vectors can be preloaded into an array of vectors to avoid the vector wrapper lookup per column. Plus, if the code is written per batch, you can go a step further: vectorize the operation. Copy all values for column 1, then all for column 2, and so on. (In this case, you only get each vector once, so sticking with the wrappers is fine.) By vectorizing, you may get the vectorized cache-locality benefit that Drill promises from its operations. Worth a try to see if you get any speed-up. ---
[GitHub] drill pull request #1057: DRILL-5993 Append Row Method For VectorContainer (...
GitHub user ilooner opened a pull request: https://github.com/apache/drill/pull/1057 DRILL-5993 Append Row Method For VectorContainer (WIP) ## Motivation HashJoin requires a method that can take a row from a VectorContainer and append it to a destination VectorContainer. This is a WIP and this PR is mainly opened to improve my understanding. ## Implementation This is an initial implementation that works with simple VectorContainers that are not hyper batches and do not have selection vectors. It is also assumed that the user called **SchemaUtil.coerceContainer** on the source VectorContainer before using the newly added **appendRow** method. ## Questions - Do we have to worry about selection vectors in the source container? - Do we have to think about support hyper batches in the destination container? You can merge this pull request into a Git repository by running: $ git pull https://github.com/ilooner/drill DRILL-5993 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1057.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1057 commit ee43d6808562a1ff60c17fa7622b8358b63c7276 Author: Timothy FarkasDate: 2017-11-29T20:38:41Z - Initial Implementation of append row for a vector container ---