[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 19: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 19 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sat, 29 Sep 2018 11:59:01 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. IMPALA-376: add built-in functions for parsing JSON This patch implements the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. In order to track the memory used in RapidJson, we wrap FunctionContext into an allocator. get_json_object accepts two parameters: a json string and a selector (json path). We parse the json string into a Document tree and then perform BFS according to the selector. For example, to process get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'), we first perform '$[*]' to extract all the items in the root array. Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a' selector on all values in the queue. The final results is 1,2,3 in the queue. As there're multiple results, they should be encapsulated into an array. The output results is a string of '[1,2,3]'. More examples can be found in expr-test.cc. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test * Add tests in test_alloc_fail.py to check handling of out of memory Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Reviewed-on: http://gerrit.cloudera.org:8080/10950 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins --- M be/src/exprs/CMakeLists.txt M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc A be/src/exprs/string-functions.cc M be/src/exprs/string-functions.h M be/src/util/string-util.cc M be/src/util/string-util.h M common/function-registry/impala_functions.py M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-init.test M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-update.test M testdata/workloads/functional-query/queries/QueryTest/exprs.test 11 files changed, 658 insertions(+), 1 deletion(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 20 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 19: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 19 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sat, 29 Sep 2018 08:15:44 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 19: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3249/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 19 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sat, 29 Sep 2018 08:15:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 18: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 18 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sat, 29 Sep 2018 00:31:14 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 18: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/870/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 18 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 28 Sep 2018 23:48:56 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Hello Greg Rahn, Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10950 to look at the new patch set (#18). Change subject: IMPALA-376: add built-in functions for parsing JSON .. IMPALA-376: add built-in functions for parsing JSON This patch implements the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. In order to track the memory used in RapidJson, we wrap FunctionContext into an allocator. get_json_object accepts two parameters: a json string and a selector (json path). We parse the json string into a Document tree and then perform BFS according to the selector. For example, to process get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'), we first perform '$[*]' to extract all the items in the root array. Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a' selector on all values in the queue. The final results is 1,2,3 in the queue. As there're multiple results, they should be encapsulated into an array. The output results is a string of '[1,2,3]'. More examples can be found in expr-test.cc. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test * Add tests in test_alloc_fail.py to check handling of out of memory Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f --- M be/src/exprs/CMakeLists.txt M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc A be/src/exprs/string-functions.cc M be/src/exprs/string-functions.h M be/src/util/string-util.cc M be/src/util/string-util.h M common/function-registry/impala_functions.py M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-init.test M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-update.test M testdata/workloads/functional-query/queries/QueryTest/exprs.test 11 files changed, 658 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/18 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 18 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 17: Looks like a clang-tidy failure: 12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:142:5: warning: unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter] 12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:144:5: warning: unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter] 12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:218:3: warning: unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter] 12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:243:3: warning: unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter] 12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:277:3: warning: unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter] 12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:285:3: warning: unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter] 12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:326:3: warning: unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter] 12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:327:3: warning: unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter] 12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:367:9: warning: unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter] -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 17 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 28 Sep 2018 19:01:56 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 17: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3243/ -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 17 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 28 Sep 2018 15:11:53 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 17: Will this warning fail the GVO? /home/ubuntu/Impala/be/src/exprs/string-functions.cc:142:5: warning: unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter] -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 17 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 28 Sep 2018 15:05:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 17: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3243/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 17 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 28 Sep 2018 11:30:07 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 17: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 17 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 28 Sep 2018 11:30:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 16: Code-Review+2 LGTM! Thank you Quanlong for your great work, well done! -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 16 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 28 Sep 2018 11:28:37 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 16: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/863/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 16 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 28 Sep 2018 11:14:10 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 16: (6 comments) Thanks for your patience too. Adjusted the patch based on your comments. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51 PS9, Line 51: static const bool kNeedFree = false; > That memory should be tracked indirectly - results_pool_ is a FreePool, whi Oh, yes. I thought it's the MemPool in udf.cc. Then I have no concerns :) http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@72 PS15, Line 72: if (originalSize >= newSize) return originalPtr; > Put conditional on a single line. Done http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@143 PS15, Line 143: } else { // multiple selected items, return an array string > Can't we use RETURN_NULL_IF_OOM in some of these places? Yes, missing these. Done http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@267 PS15, Line 267: VERFLO > Nit: "Expected" Done http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@336 PS15, Line 336: ctx->SetError(msg.c_str()); > line too long (102 > 90) Done http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@389 PS15, Line 389: ++i; > Maybe put ++i on the next line? formatting looks weird. Done -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 16 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 28 Sep 2018 10:42:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Hello Greg Rahn, Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10950 to look at the new patch set (#16). Change subject: IMPALA-376: add built-in functions for parsing JSON .. IMPALA-376: add built-in functions for parsing JSON This patch implements the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. In order to track the memory used in RapidJson, we wrap FunctionContext into an allocator. get_json_object accepts two parameters: a json string and a selector (json path). We parse the json string into a Document tree and then perform BFS according to the selector. For example, to process get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'), we first perform '$[*]' to extract all the items in the root array. Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a' selector on all values in the queue. The final results is 1,2,3 in the queue. As there're multiple results, they should be encapsulated into an array. The output results is a string of '[1,2,3]'. More examples can be found in expr-test.cc. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test * Add tests in test_alloc_fail.py to check handling of out of memory Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f --- M be/src/exprs/CMakeLists.txt M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc A be/src/exprs/string-functions.cc M be/src/exprs/string-functions.h M be/src/util/string-util.cc M be/src/util/string-util.h M common/function-registry/impala_functions.py M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-init.test M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-update.test M testdata/workloads/functional-query/queries/QueryTest/exprs.test 11 files changed, 657 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/16 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 16 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 15: Code-Review+1 Just skimmed through what have changed since the last time I reviewed. LGTM once the remainder comments are resolved. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 15 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 27 Sep 2018 15:37:09 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 15: Code-Review+1 (4 comments) Thanks for your patience and addressing the comments. LGTM once a few small things are fix. I'll give Zoltan a chance to take another look since he's in a different time zone (CET). http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@72 PS15, Line 72: if (originalSize >= newSize) Put conditional on a single line. http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@143 PS15, Line 143: RETURN_IF_OOM(v.Accept(writer), StringVal::null()) Can't we use RETURN_NULL_IF_OOM in some of these places? http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@267 PS15, Line 267: Expect Nit: "Expected" http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@389 PS15, Line 389: case ' ': ++i; Maybe put ++i on the next line? formatting looks weird. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 15 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 26 Sep 2018 20:23:44 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 15: (2 comments) Responses to responses. Need to look at code changes still. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51 PS9, Line 51: static const bool kNeedFree = false; > When dealing with allocation failures, I realized that memory allocated by That memory should be tracked indirectly - results_pool_ is a FreePool, which is backed by a MemPool, which tracks memory. There is an alternative implementation of FreePool in udf.cc that is only used in the UDF SDK - when IMPALA_UDF_SDK_BUILD. That doesn't track memory but is not used in an actual impala daemon. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@139 PS14, Line 139: ing is at the root, so we conv > OK. That's really strict. Done Yeah I agree it's a bit pedantic but it does make the code more readable overall. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 15 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 26 Sep 2018 20:10:08 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 15: (1 comment) http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@336 PS15, Line 336: " $$ should only be placed at start", AnyValUtil::ToString(path_str)); line too long (102 > 90) -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 15 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 24 Sep 2018 16:34:40 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 15: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/756/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 15 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 24 Sep 2018 15:41:29 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 15: (24 comments) Thanks for your comments, Tim! Significant changes to the last patch: * handle memory allocation failures * replace std::vector with array in RapidJson so its memory are allocated by FuntionContext The memory tracking still needs to discuss. http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG@27 PS14, Line 27: * Add e2e tests in exprs.test > Can we add a couple of tests to tests/custom_cluster/test_alloc_fail.py to Good point! Found that RapidJSON cannot handle failure of memory allocation. I have to throw exceptions in JsonUdfAllocator::Malloc to stop it. https://github.com/Tencent/rapidjson/issues/753#issuecomment-250058390 https://github.com/Tencent/rapidjson/issues/1269#issuecomment-392674252 http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc@8856 PS14, Line 8856: TestIsNull("get_json_object('{\"a\": 1}', '$.b')", TYPE_STRING); > Can you add some tests for: Done http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51 PS9, Line 51: static const bool kNeedFree = false; > Done When dealing with allocation failures, I realized that memory allocated by FunctionContextImpl::results_pool_ (used by StringVal(FunctionContext* context, int len)) is not tracked. Looks like it's intentional. Do you know some background for this? Though I think RapidJson won't consume too much memory than the original JSON strings, we still need to track the memory consumption of the UDF, especially when we introduce cache for the parsed JSON paths in future. http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@57 PS13, Line 57: void* Malloc(size_t size) { > I think we need to copy over the previous contents of 'originalPtr'. I look Done http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@133 PS13, Line 133: > Can we use sb.GetSize() instead of strlen()? It seems unnecessary to walk t Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@109 PS14, Line 109: > These functions can all be static I think. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@121 PS14, Line 121: > Use StringVal::CopyFrom() instead - it handles the edge case when memory al Done. Good point! http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@136 PS14, Line 136: if (v.IsNull()) return StringVal::null(); > Use StringVal::CopyFrom() instead Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@138 PS14, Line 138: // RapidJSON will quote the strings. It's incompatible with Hive's behavior when > SelectByKey, SelectByIndex, ExpandArrays and ProcessWildcard key could bene Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@139 PS14, Line 139: ing is at the root, so we conv > Our style says use pass by pointer for mutable arguments, and (generally) t OK. That's really strict. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@142 PS14, Line 142: } > Why not use i for the index? here and below. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@162 PS14, Line 162: queue->PushBack(item[key_ptr], *allocator); > Use braces here Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@172 PS14, Line 172: SizeType old_items = queue->Size(); > Use braces here Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@187 PS14, Line 187: to& v : ( > nit: "Encountered" . Here and in a couple of other places like line 205. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@219 PS14, Line 219: RETURN_IF_OOM(ExtractValues(queue, allocator), -1); > Can't this be "return path_idx + 1". We don't use path_idx after returning. Sure. Done http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@224 PS14, Line 224: /// path_str. Return next unprocessed index in path_str. Return -1 for errors. > This is future work, but it would be nice in future to separate out the par Sure. Created IMPALA-7610 for this improvement.
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Hello Greg Rahn, Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10950 to look at the new patch set (#15). Change subject: IMPALA-376: add built-in functions for parsing JSON .. IMPALA-376: add built-in functions for parsing JSON This patch implements the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. In order to track the memory used in RapidJson, we wrap FunctionContext into an allocator. get_json_object accepts two parameters: a json string and a selector (json path). We parse the json string into a Document tree and then perform BFS according to the selector. For example, to process get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'), we first perform '$[*]' to extract all the items in the root array. Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a' selector on all values in the queue. The final results is 1,2,3 in the queue. As there're multiple results, they should be encapsulated into an array. The output results is a string of '[1,2,3]'. More examples can be found in expr-test.cc. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test * Add tests in test_alloc_fail.py to check handling of out of memory Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f --- M be/src/exprs/CMakeLists.txt M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc A be/src/exprs/string-functions.cc M be/src/exprs/string-functions.h M be/src/util/string-util.cc M be/src/util/string-util.h M common/function-registry/impala_functions.py M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-init.test M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-update.test M testdata/workloads/functional-query/queries/QueryTest/exprs.test 11 files changed, 657 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/15 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 15 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 14: (23 comments) Based on the list discussion it sounds like we want to move forward with this. http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG@27 PS14, Line 27: * Add e2e tests in exprs.test Can we add a couple of tests to tests/custom_cluster/test_alloc_fail.py to check handling of out of memory. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc@8856 PS14, Line 8856: TestIsNull("get_json_object('{illegal_json}', '$.a')", TYPE_STRING); Can you add some tests for: * trying to extract from just a JSON literal. E.g. numbers or strings * trying to extract from an empty list or object http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@57 PS13, Line 57: void* Realloc(void* originalPtr, size_t originalSize, size_t newSize) { I think we need to copy over the previous contents of 'originalPtr'. I looked at the RapidJSON Realloc() implementations here and they both seem to do that: https://github.com/Tencent/rapidjson/blob/7e68aa0a21b7800ec98133cb106e49bd6536e25c/include/rapidjson/allocators.h http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@133 PS13, Line 133: strlen Can we use sb.GetSize() instead of strlen()? It seems unnecessary to walk the string to find the null terminator. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@109 PS14, Line 109: StringVal ToStringVal(FunctionContext* ctx, JsonUdfAllocator* allocator, These functions can all be static I think. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@121 PS14, Line 121: memcpy(res.ptr, v->GetString(), v->GetStringLength()); Use StringVal::CopyFrom() instead - it handles the edge case when memory allocation fails. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@136 PS14, Line 136: return result; Use StringVal::CopyFrom() instead http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@138 PS14, Line 138: SelectByKey, SelectByIndex, ExpandArrays and ProcessWildcard key could benefit from short function comments explaining how they transform 'queue'. E.g. for SelectByKey something like: // Extract all the values for 'key' where objects in 'queue' contain that queue. // Replace the contents of queue with the values found. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@139 PS14, Line 139: (vector& queue, Our style says use pass by pointer for mutable arguments, and (generally) to put mutable arguments last: https://google.github.io/styleguide/cppguide.html#Reference_Arguments Exception: we sometimes put mutable arguments before immutable for context data structures like "FunctionContext*" http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@142 PS14, Line 142: for (int k = 0; k < old_items; ++k) { Why not use i for the index? here and below. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@162 PS14, Line 162: for (JsonUdfValue* it = queue[k]->Begin(); it != queue[k]->End(); ++it) Use braces here http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@172 PS14, Line 172: for (auto it = queue[k]->MemberBegin(); it != queue[k]->MemberEnd(); ++it) Use braces here http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@187 PS14, Line 187: Encounter nit: "Encountered" . Here and in a couple of other places like line 205. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@219 PS14, Line 219: return ++path_idx; // path_idx points at ']' Can't this be "return path_idx + 1". We don't use path_idx after returning. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@224 PS14, Line 224: int ProcessNumberIndex(FunctionContext* ctx, vector& queue, This is future work, but it would be nice in future to separate out the parsing of the path from the processing of the JSON. That way for constant path strings we could save re-parsing the path. The benefit probably isn't that huge but might be significant. http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@275 PS14, Line 275: if (UNLIKELY(json_str.is_null || json_str.len == 0)) { nit: could fit on
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 14: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/623/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 14 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sun, 09 Sep 2018 00:09:46 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 14: Add more tests for coverage and add support for the wildcard in value selection (e.g. $.*). -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 14 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sat, 08 Sep 2018 23:36:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Hello Greg Rahn, Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10950 to look at the new patch set (#14). Change subject: IMPALA-376: add built-in functions for parsing JSON .. IMPALA-376: add built-in functions for parsing JSON This patch implements the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. In order to track the memory used in RapidJson, we wrap FunctionContext into an allocator. get_json_object accepts two parameters: a json string and a selector (json path). We parse the json string into a Document tree and then perform BFS according to the selector. For example, to process get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'), we first perform '$[*]' to extract all the items in the root array. Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a' selector on all values in the queue. The final results is 1,2,3 in the queue. As there're multiple results, they should be encapsulated into an array. The output results is a string of '[1,2,3]'. More examples can be found in expr-test.cc. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f --- M be/src/exprs/CMakeLists.txt M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc A be/src/exprs/string-functions.cc M be/src/exprs/string-functions.h M be/src/util/string-util.cc M be/src/util/string-util.h M common/function-registry/impala_functions.py M testdata/workloads/functional-query/queries/QueryTest/exprs.test 9 files changed, 598 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/14 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 14 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py File common/function-registry/impala_functions.py: http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514 PS13, Line 514: get_json_object > I'm not opposed to having both functions, Hive & ANSI, but I wanted to rais Sure, Greg. I'll send out an email to the dev list for discussion. I'm in favor of implementing both functions, so we're compatible to Hive and also provide more powerful syntax in ANSI. I think most of the existing codes can be shared with ANSI json functions. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sat, 08 Sep 2018 23:34:37 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Greg Rahn has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py File common/function-registry/impala_functions.py: http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514 PS13, Line 514: get_json_object > Ah, haven't considered the ANSI standards before. My initial motivation is I'm not opposed to having both functions, Hive & ANSI, but I wanted to raise awareness to see if folks felt it was necessary and having knowledge that the ANSI JSON functions are likely to also be implemented and that hopefully it could be done so in a way that makes the most sense in terms of sharing code, etc. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 06 Sep 2018 04:03:31 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py File common/function-registry/impala_functions.py: http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514 PS13, Line 514: get_json_object > I'd prefer to make Hive compatibility be explicity brought up before decidi Ah, haven't considered the ANSI standards before. My initial motivation is to replace the Hive UDF get_json_object written in Java since it's out of control in memory management. There's a slight difference between the Hive get_json_object() and ANSI json_value(). json_value only returns scalar values, while get_json_object can return JSON string for a selected JSON object. Maybe the ANSI json_query() function is more similar. In Hulu, most of our Impala users are already using the get_json_object function. Some upper systems may also use this function. Since it's already a Hive convention, I think it worths to keep this name. My principle is "start from the user". Rename the function may cause trouble for them. However, the ANSI standards have more flexible syntax. What about creating a new JIRA for implementing the "returning " and "value on error" clauses? -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 06 Sep 2018 00:14:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Greg Rahn has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py File common/function-registry/impala_functions.py: http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514 PS13, Line 514: get_json_object > I just want to note that ANSI SQL also has a function called json_object() I'd prefer to make Hive compatibility be explicity brought up before deciding on implementing it hence me favoring ANSI standards first. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 05 Sep 2018 23:41:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py File common/function-registry/impala_functions.py: http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514 PS13, Line 514: get_json_object > Is it worth doing both? I think portability from Hive is also important (no I just want to note that ANSI SQL also has a function called json_object() with quite different functionality (it creates a JSON object from a list of key-value pairs). So, it can be confusing for users if we want to be portable with both Hive and ANSI SQL. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 05 Sep 2018 22:09:49 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py File common/function-registry/impala_functions.py: http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514 PS13, Line 514: get_json_object > I also left a comment on the Jira, but I'd recommend we name and implement Is it worth doing both? I think portability from Hive is also important (not just portability from other ANSI sql dbs) -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 05 Sep 2018 19:32:21 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Greg Rahn has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py File common/function-registry/impala_functions.py: http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514 PS13, Line 514: get_json_object I also left a comment on the Jira, but I'd recommend we name and implement the ANSI SQL function json_value(). It will be more portable and is the ANSI SQL standard and appears to be very similar. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Greg Rahn Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 05 Sep 2018 17:10:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: Yeah, we should have full test coverage on this code, however that doesn't guarantee that our implementation is correct. So maybe it'd be also beneficial to get some inspiration from the test cases of Apache-licensed (or some other permissive license) json path libraries. There are quite a few of them on GitHub. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 05 Sep 2018 16:36:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: > Patch Set 13: > > I think adding more tests for the json path will be more practical. Will add > test cases we currently support by picking them up in > https://github.com/mysql/mysql-server/blob/8.0/unittest/gunit/json_path-t.cc#L468 Given that code is GPL we can't copy those test cases. Probably better to do our own analysis of coverage and make sure we are confident without relying on GPL code, even if it's "only tests" -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 05 Sep 2018 15:20:00 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: I think adding more tests for the json path will be more practical. Will add test cases we currently support by picking them up in https://github.com/mysql/mysql-server/blob/8.0/unittest/gunit/json_path-t.cc#L468 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 05 Sep 2018 12:59:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: > We can also add things to the random query generator which validates them > against postgres but I don't think there's a close enough postgres function > to validate against: > https://github.com/apache/impala/blob/master/tests/comparison/funcs.py#L574 Yes, postgres has a quite different syntax for JSON: https://www.postgresql.org/docs/9.6/static/functions-json.html It'd be better if we can test against MySQL, since it's output is more robust and reasonable than Hive's. I found there's a template file about MySQL: https://github.com/apache/impala/blob/master/fe/src/test/resources/mysql-hive-site.xml.template Looks like we used to depend on MySQL in the past? Is it hard to enable it back? Or maybe we can just pull a MySQL docker image and use it. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 28 Aug 2018 04:51:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: I didn't do anything that sophisticated - just wrote a python test that flipped bytes in HDFS files: https://github.com/apache/impala/blob/master/tests/query_test/test_scanners_fuzz.py#L35 Taras did some randomised testing of our decimal implementation here: https://github.com/apache/impala/blob/master/tests/query_test/test_decimal_fuzz.py We can also add things to the random query generator which validates them against postgres but I don't think there's a close enough postgres function to validate against: https://github.com/apache/impala/blob/master/tests/comparison/funcs.py#L574 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 28 Aug 2018 04:29:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: I don't feel really strongly, but this seems like the perfect example of something we should consider fuzz-testing -- lots of hand-written parsing code with user supplied input. Tim, I seem to recall you did some fuzz testing in the past. Do we have some library that would make this easier? Or should we consider just a simple randomized test case that could produce JSON paths? (since I think we believe that rapidjson itself is safe since it's been fuzzed by others already) -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 28 Aug 2018 03:26:01 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: Code-Review+1 Thanks Quanlong for your great work! I give it +1 for now to let others review it if they have time. If there will be no activity in the following days, I'll +2 it. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 27 Aug 2018 14:10:26 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/487/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 27 Aug 2018 13:17:44 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 13: (2 comments) http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@129 PS11, Line 129: > It's still in PS12. Please also add a test case for the statement above. Thank you, Zoltan! I almost introduced a bug... I removed the judgment in L117 where it should be kept. Added test coverage for all of these. http://gerrit.cloudera.org:8080/#/c/10950/12/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/12/be/src/exprs/string-functions.cc@235 PS12, Line 235: static const int INITIAL_QUEUE_CAPACITY = 64; > static const int would be better. Done -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 27 Aug 2018 12:46:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Hello Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10950 to look at the new patch set (#13). Change subject: IMPALA-376: add built-in functions for parsing JSON .. IMPALA-376: add built-in functions for parsing JSON This patch implements the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. In order to track the memory used in RapidJson, we wrap FunctionContext into an allocator. get_json_object accepts two parameters: a json string and a selector (json path). We parse the json string into a Document tree and then perform BFS according to the selector. For example, to process get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'), we first perform '$[*]' to extract all the items in the root array. Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a' selector on all values in the queue. The final results is 1,2,3 in the queue. As there're multiple results, they should be encapsulated into an array. The output results is a string of '[1,2,3]'. More examples can be found in expr-test.cc. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f --- M be/src/exprs/CMakeLists.txt M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc A be/src/exprs/string-functions.cc M be/src/exprs/string-functions.h M be/src/util/string-util.cc M be/src/util/string-util.h M common/function-registry/impala_functions.py M testdata/workloads/functional-query/queries/QueryTest/exprs.test 9 files changed, 505 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/13 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 11: (3 comments) Thanks for applying the changes. Found 2 smaller issues. http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc@8816 PS11, Line 8816: TestError("get_json_object('[1,2]', '$[-2]')"); : TestError("get_json_object('[1,2]', '$[0.1]')"); : TestError("get_json_object('[1,2]', '$[a]')"); : : TestError("get_json_object('{\"key\": {\"a\": 1}}', '$..a')"); : TestError("get_json_object('{\"key\": \"value\"}', '$$')"); : TestError("get_json_object('{\"a\": 1}', '$a')"); : TestError("get_json_object('{\"a\": 1}', 'a')"); : TestError("get_json_object('[1,2,3]', '$[**]')"); > Sure. I think we should align with MySQL in corner cases, since Hive's impl Yeah, that's my impression too. http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@129 PS11, Line 129: if (it->IsNull()) continue; > Done. Nice find! It's still in PS12. Please also add a test case for the statement above. http://gerrit.cloudera.org:8080/#/c/10950/12/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/12/be/src/exprs/string-functions.cc@235 PS12, Line 235: static const int would be better. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 11 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 27 Aug 2018 10:37:52 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 12: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/484/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 12 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sat, 25 Aug 2018 09:44:33 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 12: (12 comments) Refactor based on comments and add tests on the error messages. Please have a look when you got a chance. http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc@8804 PS11, Line 8804: "get_json_object('{\"a\":-1, \"b\":-2, \"c\":-3}', '$.b')", "-2"); > Hive's output is different here, but I think this behavior is more reasonab Nice find! http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc@8816 PS11, Line 8816: "get_json_object('[{\"key\": \"v1\"}, {\"key\": \"v2\"}]', '$[*].key')", : "[\"v1\",\"v2\"]"); : TestStringValue( : "get_json_object('[{\"key\": [1,2,3]}, {\"key\": [4,5]}]', '$[*].key')", : "[[1,2,3],[4,5]]"); : TestStringValue( : "get_json_object('[{\"key\": [1,2,3]}, {\"key\": [4,5]}]', '$[*].key[*]')", : "[1,2,3,4,5]"); : TestStringValue("get_json_object('[[0,1,2], [3,4, > Just a note here: Sure. I think we should align with MySQL in corner cases, since Hive's implementation seems not be strictly reviewed. http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@49 PS11, Line 49: > We usually use DCHECK(false) for this in Impala. Done http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@129 PS11, Line 129: > I don't think we need this, because hive also keeps the null values and I t Done. Nice find! http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@176 PS11, Line 176: "Encounter '$1' in p > Please add test case for wildcard + spaces. Done http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@177 PS11, Line 177: d > nit: add a space after the colon Done http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@179 PS11, Line 179: > Please use static_cast() to print it as a character instead of a numb Nice find. I should test the error messages too. Refined the tests in expr-test.cc http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@254 PS11, Line 254: or > nit: might be better to use a named constant here, like INITIAL_QUEUE_CAPAC Done http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@266 PS11, Line 266: nt i = 1; i < path_str.len; ) { > if it means an invalid selector then we should set an error. Done http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@273 PS11, Line 273: > should be 'path_str' Oops! Done http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@303 PS11, Line 303: it yet > should be 'path_str' Done. http://gerrit.cloudera.org:8080/#/c/10950/10/common/function-registry/impala_functions.py File common/function-registry/impala_functions.py: http://gerrit.cloudera.org:8080/#/c/10950/10/common/function-registry/impala_functions.py@514 PS10, Line 514: > flake8: E501 line too long (98 > 90 characters) Done -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 12 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sat, 25 Aug 2018 09:04:16 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Hello Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10950 to look at the new patch set (#12). Change subject: IMPALA-376: add built-in functions for parsing JSON .. IMPALA-376: add built-in functions for parsing JSON This patch implements the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. In order to track the memory used in RapidJson, we wrap FunctionContext into an allocator. get_json_object accepts two parameters: a json string and a selector (json path). We parse the json string into a Document tree and then perform BFS according to the selector. For example, to process get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'), we first perform '$[*]' to extract all the items in the root array. Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a' selector on all values in the queue. The final results is 1,2,3 in the queue. As there're multiple results, they should be encapsulated into an array. The output results is a string of '[1,2,3]'. More examples can be found in expr-test.cc. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f --- M be/src/exprs/CMakeLists.txt M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc A be/src/exprs/string-functions.cc M be/src/exprs/string-functions.h M be/src/util/string-util.cc M be/src/util/string-util.h M common/function-registry/impala_functions.py M testdata/workloads/functional-query/queries/QueryTest/exprs.test 9 files changed, 496 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/12 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 12 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 11: (11 comments) http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc File be/src/exprs/expr-test.cc: http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc@8804 PS11, Line 8804: "[[1,2,3],[4,5]]"); Hive's output is different here, but I think this behavior is more reasonable, so I filed HIVE-20448. https://issues.apache.org/jira/browse/HIVE-20448 http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc@8816 PS11, Line 8816: TestError("get_json_object('[1,2]', '$[-2]')"); : TestError("get_json_object('[1,2]', '$[0.1]')"); : TestError("get_json_object('[1,2]', '$[a]')"); : : TestError("get_json_object('{\"key\": {\"a\": 1}}', '$..a')"); : TestError("get_json_object('{\"key\": \"value\"}', '$$')"); : TestError("get_json_object('{\"a\": 1}', '$a')"); : TestError("get_json_object('{\"a\": 1}', 'a')"); : TestError("get_json_object('[1,2,3]', '$[**]')"); Just a note here: Regarding to the json path we are more strict than Hive, but I think it is a good thing. Hive's behavior is quite random on wrong json paths, it either returns all values from the json, or returns null. Since the json path is a constant expression that we want to evaluate on all json data, it is quite important to raise an error when it is not valid. MySQL also does that for its JSON_EXTRACT() function. http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@49 PS11, Line 49: RAPIDJSON_ASSERT(false) We usually use DCHECK(false) for this in Impala. http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@129 PS11, Line 129: if (it->IsNull()) continue; I don't think we need this, because hive also keeps the null values and I think it is the desired behavior. hive> select get_json_object('[0, 1, null, 2, 3]', '$[*]'); [0,1,null,2,3] MySQL's JSON_EXTRACT() and online jsonpath evaluators like https://jsonpath.curiousconcept.com/ also preserve nulls. http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@176 PS11, Line 176: if (path[path_idx] != ' ') Please add test case for wildcard + spaces. http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@177 PS11, Line 177: " nit: add a space after the colon http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@179 PS11, Line 179: path[path_idx] Please use static_cast() to print it as a character instead of a number. http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@254 PS11, Line 254: 64 nit: might be better to use a named constant here, like INITIAL_QUEUE_CAPACITY or something like that. http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@266 PS11, Line 266: if (i == path_str.len) return StringVal::null(); if it means an invalid selector then we should set an error. http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@273 PS11, Line 273: json_str should be 'path_str' http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@303 PS11, Line 303: json_str should be 'path_str' -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 11 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Thu, 23 Aug 2018 13:53:44 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/448/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 10 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 22 Aug 2018 14:10:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Hello Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10950 to look at the new patch set (#11). Change subject: IMPALA-376: add built-in functions for parsing JSON .. IMPALA-376: add built-in functions for parsing JSON This patch implements the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. In order to track the memory used in RapidJson, we wrap FunctionContext into an allocator. get_json_object accepts two parameters: a json string and a selector (json path). We parse the json string into a Document tree and then perform BFS according to the selector. For example, to process get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'), we first perform '$[*]' to extract all the items in the root array. Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a' selector on all values in the queue. The final results is 1,2,3 in the queue. As there're multiple results, they should be encapsulated into an array. The output results is a string of '[1,2,3]'. More examples can be found in expr-test.cc. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f --- M be/src/exprs/CMakeLists.txt M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc A be/src/exprs/string-functions.cc M be/src/exprs/string-functions.h M be/src/util/string-util.cc M be/src/util/string-util.h M common/function-registry/impala_functions.py M testdata/workloads/functional-query/queries/QueryTest/exprs.test 9 files changed, 432 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/11 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 11 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 10: (5 comments) http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51 PS9, Line 51: JsonUdfAllocator(FunctionContext* ctx): ctx_(ctx) {} > I'm ok since this patch is for 3.x. We're still using the 2.x branch in pro Done http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@96 PS9, Line 96: in() > yeah, 'pos_' is also a good name for it. Done http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@194 PS9, Line 194: > Sure. Will add test coverage for this. Done http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@195 PS9, Line 195: ocess numb > Good point! I'll add more error messages. Done http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h File be/src/util/string-util.h: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h@53 PS9, Line 53: FindEndOfIdentifi > Oh I see, I still think it's not a good name for either functions unless it Done -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 10 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 22 Aug 2018 13:38:21 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/10950/10/common/function-registry/impala_functions.py File common/function-registry/impala_functions.py: http://gerrit.cloudera.org:8080/#/c/10950/10/common/function-registry/impala_functions.py@514 PS10, Line 514: b flake8: E501 line too long (98 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 10 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Wed, 22 Aug 2018 13:37:23 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Hello Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10950 to look at the new patch set (#10). Change subject: IMPALA-376: add built-in functions for parsing JSON .. IMPALA-376: add built-in functions for parsing JSON This patch implements the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. In order to track the memory used in RapidJson, we wrap FunctionContext into an allocator. get_json_object accepts two parameters: a json string and a selector (json path). We parse the json string into a Document tree and then perform BFS according to the selector. For example, to process get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'), we first perform '$[*]' to extract all the items in the root array. Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a' selector on all values in the queue. The final results is 1,2,3 in the queue. As there're multiple results, they should be encapsulated into an array. The output results is a string of '[1,2,3]'. More examples can be found in expr-test.cc. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f --- M be/src/exprs/CMakeLists.txt M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc A be/src/exprs/string-functions.cc M be/src/exprs/string-functions.h M be/src/util/string-util.cc M be/src/util/string-util.h M common/function-registry/impala_functions.py M testdata/workloads/functional-query/queries/QueryTest/exprs.test 9 files changed, 431 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/10 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 10 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 9: (3 comments) Thanks for your response! Looking forward for the next patch set! http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@96 PS9, Line 96: src_ > Done. Good name. I think 'pos_ ' is shorter. yeah, 'pos_' is also a good name for it. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@166 PS9, Line 166: values.erase > I think I should make the var name more meaningful. I'd like to rename 'val I was just thinking of huge json docs with big arrays, but might still not worth the effort. I'm OK with it as it is. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h File be/src/util/string-util.h: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h@53 PS9, Line 53: AdvanceIdentifier > Sure. This function is a slightly modified version of https://github.com/ap Oh I see, I still think it's not a good name for either functions unless it is the widely used term in parsers. I am not a native speaker, but advancing something suggests me 'modifying something', just like std::advance() does: https://en.cppreference.com/w/cpp/iterator/advance -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Tue, 21 Aug 2018 16:30:41 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 9: (12 comments) Zoltan, thank you for taking a look! Since the RapidJSON has been upgraded to 1.1.0, I'll update this patch soon. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@67 PS9, Line 67: thread_local FunctionContext* UdfJsonAllocator::ctx_ = nullptr; > nit: please insert empty line after class definition Done. Remove the static thread_local variable by refactoring the Malloc & Free logics as discussed before. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@69 PS9, Line 69: RapidJSON required the input strings be end of '\0'. > I don't think it's grammatically correct. Maybe: Done http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@70 PS9, Line 70: tailing > nit: trailing Done http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@96 PS9, Line 96: src_ > nit: read_position_ would be a better name I think Done. Good name. I think 'pos_ ' is shorter. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@166 PS9, Line 166: values.erase > Maybe it'd be better to have two vectors and juggling the values between th I think I should make the var name more meaningful. I'd like to rename 'values' to 'queue' since it's the queue in BFS. 'push_back' and 'erase' act more like a queue. On the other hand, the depth of the json selector is usually small, so such kind of functions won't be called too many times in parsing a json string. Refactoring these with two vectors might not have significant benifits. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@194 PS9, Line 194: (path[path_idx] != ' ') > I am not sure that there is a test for it. Either for the true and false ca Sure. Will add test coverage for this. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@195 PS9, Line 195: return -1; > Maybe taking the function context and set some error would be helpful for t Good point! I'll add more error messages. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@199 PS9, Line 199: return NULL > This function returns -1, the return NULL statement is in GetJsonObjectImpl Sorry, this comment is stale. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@217 PS9, Line 217: return -1; > Error could be set in function context. We could also use 'path_idx' in the Sure http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@251 PS9, Line 251: vector values; > The C++ standard doesn't specify the initial capacity of std::vector, but i OK. But the queue is usually short, the benefits might be limit. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@267 PS9, Line 267: (end == nullptr) > Set error in function context? Sure http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h File be/src/util/string-util.h: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h@53 PS9, Line 53: AdvanceIdentifier > Maybe 'FindIdentifierEnd' or something like that would be a better name. Sure. This function is a slightly modified version of https://github.com/apache/impala/blob/2e63752858d71cc745534367a686980e060a8180/be/src/gutil/strings/util.cc#L743 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Mon, 20 Aug 2018 00:53:47 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 9: (12 comments) Thanks for taking this on! It is a very useful functionality IMO. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@67 PS9, Line 67: thread_local FunctionContext* UdfJsonAllocator::ctx_ = nullptr; nit: please insert empty line after class definition http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@69 PS9, Line 69: RapidJSON required the input strings be end of '\0'. I don't think it's grammatically correct. Maybe: "RapidJSON requires input strings that end with a trailing '\0'" http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@70 PS9, Line 70: tailing nit: trailing http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@96 PS9, Line 96: src_ nit: read_position_ would be a better name I think http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@166 PS9, Line 166: values.erase Maybe it'd be better to have two vectors and juggling the values between them instead of this erasing and shifting. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@194 PS9, Line 194: (path[path_idx] != ' ') I am not sure that there is a test for it. Either for the true and false cases. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@195 PS9, Line 195: return -1; Maybe taking the function context and set some error would be helpful for the user. Since there is a problem with the 'path_str' we can consider it as an error IMO. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@199 PS9, Line 199: return NULL This function returns -1, the return NULL statement is in GetJsonObjectImpl(), so this comment here is a bit confusing. Maybe it could also set this error in the function context. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@217 PS9, Line 217: return -1; Error could be set in function context. We could also use 'path_idx' in the error message. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@251 PS9, Line 251: vector values; The C++ standard doesn't specify the initial capacity of std::vector, but it is likely to be zero. Calling values.reserve() with some starting value (maybe 64) could be beneficial. http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@267 PS9, Line 267: (end == nullptr) Set error in function context? http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h File be/src/util/string-util.h: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h@53 PS9, Line 53: AdvanceIdentifier Maybe 'FindIdentifierEnd' or something like that would be a better name. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Fri, 17 Aug 2018 14:21:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51 PS9, Line 51: return ctx_->Allocate(size); > Personally I would vote to upgrade (or just patch) rapidjson even if it wil I'm ok since this patch is for 3.x. We're still using the 2.x branch in production... Patch for upgrade rapidjson to v1.1.0 is ready for review: https://gerrit.cloudera.org/#/c/11124/ -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Zoltan Borok-Nagy Gerrit-Comment-Date: Sat, 04 Aug 2018 14:59:33 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51 PS9, Line 51: return ctx_->Allocate(size); > This refactor is simple but I hit a bug in rapidjson-0.11. In GenericDocume Personally I would vote to upgrade (or just patch) rapidjson even if it will require some API usage changes elsewhere in the code, unless those changes turn out to be really complicated. But, since you mentioned on the mailing list that this patch is very important to you, maybe we can use the threadlocal approach temporarily and leave a TODO to remove it when rapidjson is upgraded? -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 31 Jul 2018 15:27:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51 PS9, Line 51: return ctx_->Allocate(size); > Cool! That's a better fit. This refactor is simple but I hit a bug in rapidjson-0.11. In GenericDocument::ParseStream, the GenericReader does not use the allocator of GenericDocument, which means we lost the FunctionContext reference here. A null pointer error was thrown in GenericReader's code path. template GenericDocument& ParseStream(Stream& stream) { ValueType::SetNull(); // Remove existing root if exist GenericReader reader;<-- here's the cause One way to work around this is declaring the ctx_ variable in UdfJsonAllocator static and thread_local. We need thread_local again... This rapidjson bug is fixed in https://github.com/Tencent/rapidjson/commit/87770399a18e190411170a4c23334fa81b5f7f34 I think the current implementation is acceptable until we upgrade the rapidjson version. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 31 Jul 2018 14:45:54 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 9: (2 comments) Thanks for your cool idea! Just be able to come back to this work. Will update this patch soon. http://gerrit.cloudera.org:8080/#/c/10950/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10950/9//COMMIT_MSG@11 PS9, Line 11: old RapidJson version, we cannot get detailed errors if parse fails. > Should we consider upgrading rapidjson? It might need some efforts as Matthew mentioned in the commend of IMPALA-2014: https://issues.apache.org/jira/browse/IMPALA-2014?focusedCommentId=15913404=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15913404 Anyway, I create a JIRA for this: IMPALA-7364 http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51 PS9, Line 51: return ctx_->Allocate(size); > Yeah exactly, the memory is recycled in bulk by Impala's runtime at some po Cool! That's a better fit. -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 31 Jul 2018 11:21:07 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10950 ) Change subject: IMPALA-376: add built-in functions for parsing JSON .. Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc File be/src/exprs/string-functions.cc: http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51 PS9, Line 51: return ctx_->Allocate(size); > The temporary result allocation mechanism (used by StringVal(FunctionContex That would also avoid the necessity for the thread_local, because there's no requirement to free each allocation individually, right? -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 24 Jul 2018 00:17:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON
Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/10950 Change subject: IMPALA-376: add built-in functions for parsing JSON .. IMPALA-376: add built-in functions for parsing JSON This patch implement the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. Due to the constrain of old RapidJson version, we cannot get detailed errors if parse fails. One of the complexity of this patch is about memory management. In order to track the memory used in RapidJson, we have to wrap FunctionContext into an allocator. However, RapidJson requires the Free function be static. We can only make the FunctionContext pointer in Allocator thread_local. This is safe since each UDF thread allocates and frees its own memory. The only drawback is that thread_local can’t be cross-compiled to IR. This is acceptable since the parameters are complex strings, we won't gain signifcant performance benefit from JIT compilation. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f --- M be/src/exprs/CMakeLists.txt M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc A be/src/exprs/string-functions.cc M be/src/exprs/string-functions.h M be/src/util/string-util.cc M be/src/util/string-util.h M common/function-registry/impala_functions.py M testdata/workloads/functional-query/queries/QueryTest/exprs.test 9 files changed, 411 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/9 -- To view, visit http://gerrit.cloudera.org:8080/10950 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Gerrit-Change-Number: 10950 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang