[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 19: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 19
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 29 Sep 2018 11:59:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..

IMPALA-376: add built-in functions for parsing JSON

This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.

get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.

More examples can be found in expr-test.cc.

Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test
* Add tests in test_alloc_fail.py to check handling of out of memory

Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Reviewed-on: http://gerrit.cloudera.org:8080/10950
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
A be/src/exprs/string-functions.cc
M be/src/exprs/string-functions.h
M be/src/util/string-util.cc
M be/src/util/string-util.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-init.test
M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-update.test
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
11 files changed, 658 insertions(+), 1 deletion(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 20
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 19: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 19
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 29 Sep 2018 08:15:44 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 19:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3249/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 19
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 29 Sep 2018 08:15:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 18: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 18
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 29 Sep 2018 00:31:14 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 18:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/870/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 18
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 28 Sep 2018 23:48:56 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-28 Thread Quanlong Huang (Code Review)
Hello Greg Rahn, Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, 
Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/10950

to look at the new patch set (#18).

Change subject: IMPALA-376: add built-in functions for parsing JSON
..

IMPALA-376: add built-in functions for parsing JSON

This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.

get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.

More examples can be found in expr-test.cc.

Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test
* Add tests in test_alloc_fail.py to check handling of out of memory

Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
A be/src/exprs/string-functions.cc
M be/src/exprs/string-functions.h
M be/src/util/string-util.cc
M be/src/util/string-util.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-init.test
M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-update.test
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
11 files changed, 658 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/18
--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 18
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 17:

Looks like a clang-tidy failure:


12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:142:5: warning: 
unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter]
12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:144:5: warning: 
unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter]
12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:218:3: warning: 
unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter]
12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:243:3: warning: 
unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter]
12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:277:3: warning: 
unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter]
12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:285:3: warning: 
unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter]
12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:326:3: warning: 
unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter]
12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:327:3: warning: 
unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter]
12:13:38 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:367:9: warning: 
unused exception parameter 'e' [clang-diagnostic-unused-exception-parameter]


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 28 Sep 2018 19:01:56 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 17: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/3243/


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 28 Sep 2018 15:11:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-28 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 17:

Will this warning fail the GVO?

 /home/ubuntu/Impala/be/src/exprs/string-functions.cc:142:5: warning: unused 
exception parameter 'e' [clang-diagnostic-unused-exception-parameter]


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 28 Sep 2018 15:05:10 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 17:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3243/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 28 Sep 2018 11:30:07 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 17: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 28 Sep 2018 11:30:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-28 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 16: Code-Review+2

LGTM! Thank you Quanlong for your great work, well done!


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 16
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 28 Sep 2018 11:28:37 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-28 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 16:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/863/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 16
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 28 Sep 2018 11:14:10 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-28 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 16:

(6 comments)

Thanks for your patience too. Adjusted the patch based on your comments.

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51
PS9, Line 51:   static const bool kNeedFree = false;
> That memory should be tracked indirectly - results_pool_ is a FreePool, whi
Oh, yes. I thought it's the MemPool in udf.cc. Then I have no concerns :)


http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@72
PS15, Line 72: if (originalSize >= newSize) return originalPtr;
> Put conditional on a single line.
Done


http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@143
PS15, Line 143:   } else {  // multiple selected items, return an array string
> Can't we use RETURN_NULL_IF_OOM in some of these places?
Yes, missing these. Done


http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@267
PS15, Line 267: VERFLO
> Nit: "Expected"
Done


http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@336
PS15, Line 336: ctx->SetError(msg.c_str());
> line too long (102 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@389
PS15, Line 389: ++i;
> Maybe put ++i on the next line? formatting looks weird.
Done



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 16
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 28 Sep 2018 10:42:58 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-28 Thread Quanlong Huang (Code Review)
Hello Greg Rahn, Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, 
Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/10950

to look at the new patch set (#16).

Change subject: IMPALA-376: add built-in functions for parsing JSON
..

IMPALA-376: add built-in functions for parsing JSON

This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.

get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.

More examples can be found in expr-test.cc.

Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test
* Add tests in test_alloc_fail.py to check handling of out of memory

Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
A be/src/exprs/string-functions.cc
M be/src/exprs/string-functions.h
M be/src/util/string-util.cc
M be/src/util/string-util.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-init.test
M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-update.test
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
11 files changed, 657 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/16
--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 16
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-27 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 15: Code-Review+1

Just skimmed through what have changed since the last time I reviewed.
LGTM once the remainder comments are resolved.


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 27 Sep 2018 15:37:09 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-26 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 15: Code-Review+1

(4 comments)

Thanks for your patience and addressing the comments. LGTM once a few small 
things are fix. I'll give Zoltan a chance to take another look since he's in a 
different time zone (CET).

http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@72
PS15, Line 72: if (originalSize >= newSize)
Put conditional on a single line.


http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@143
PS15, Line 143: RETURN_IF_OOM(v.Accept(writer), StringVal::null())
Can't we use RETURN_NULL_IF_OOM in some of these places?


http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@267
PS15, Line 267: Expect
Nit: "Expected"


http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@389
PS15, Line 389:   case ' ': ++i;
Maybe put ++i on the next line? formatting looks weird.



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 26 Sep 2018 20:23:44 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-26 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 15:

(2 comments)

Responses to responses. Need to look at code changes still.

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51
PS9, Line 51:   static const bool kNeedFree = false;
> When dealing with allocation failures, I realized that memory allocated by
That memory should be tracked indirectly - results_pool_ is a FreePool, which 
is backed by a MemPool, which tracks memory.

There is an alternative implementation of FreePool in udf.cc that is only used 
in the UDF SDK - when IMPALA_UDF_SDK_BUILD. That doesn't track memory but is 
not used in an actual impala daemon.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@139
PS14, Line 139: ing is at the root, so we conv
> OK. That's really strict. Done
Yeah I agree it's a bit pedantic but it does make the code more readable 
overall.



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 26 Sep 2018 20:10:08 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 15:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/15/be/src/exprs/string-functions.cc@336
PS15, Line 336: " $$ should only be placed at 
start", AnyValUtil::ToString(path_str));
line too long (102 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 24 Sep 2018 16:34:40 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 15:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/756/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 24 Sep 2018 15:41:29 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-24 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 15:

(24 comments)

Thanks for your comments, Tim! Significant changes to the last patch:

* handle memory allocation failures
* replace std::vector with array in RapidJson so its memory are allocated by 
FuntionContext

The memory tracking still needs to discuss.

http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG@27
PS14, Line 27: * Add e2e tests in exprs.test
> Can we add a couple of tests to tests/custom_cluster/test_alloc_fail.py to
Good point! Found that RapidJSON cannot handle failure of memory allocation. I 
have to throw exceptions in JsonUdfAllocator::Malloc to stop it.

https://github.com/Tencent/rapidjson/issues/753#issuecomment-250058390
https://github.com/Tencent/rapidjson/issues/1269#issuecomment-392674252


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc@8856
PS14, Line 8856:   TestIsNull("get_json_object('{\"a\": 1}', '$.b')", 
TYPE_STRING);
> Can you add some tests for:
Done


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51
PS9, Line 51:   static const bool kNeedFree = false;
> Done
When dealing with allocation failures, I realized that memory allocated by 
FunctionContextImpl::results_pool_ (used by StringVal(FunctionContext* context, 
int len)) is not tracked. Looks like it's intentional. Do you know some 
background for this?

Though I think RapidJson won't consume too much memory than the original JSON 
strings, we still need to track the memory consumption of the UDF, especially 
when we introduce cache for the parsed JSON paths in future.


http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@57
PS13, Line 57:   void* Malloc(size_t size) {
> I think we need to copy over the previous contents of 'originalPtr'. I look
Done


http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@133
PS13, Line 133:
> Can we use sb.GetSize() instead of strlen()? It seems unnecessary to walk t
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@109
PS14, Line 109:
> These functions can all be static I think.
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@121
PS14, Line 121:
> Use StringVal::CopyFrom() instead - it handles the edge case when memory al
Done. Good point!


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@136
PS14, Line 136: if (v.IsNull()) return StringVal::null();
> Use StringVal::CopyFrom() instead
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@138
PS14, Line 138:   // RapidJSON will quote the strings. It's incompatible 
with Hive's behavior when
> SelectByKey, SelectByIndex, ExpandArrays and ProcessWildcard key could bene
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@139
PS14, Line 139: ing is at the root, so we conv
> Our style says use pass by pointer for mutable arguments, and (generally) t
OK. That's really strict. Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@142
PS14, Line 142: }
> Why not use i for the index? here and below.
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@162
PS14, Line 162: queue->PushBack(item[key_ptr], *allocator);
> Use braces here
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@172
PS14, Line 172:   SizeType old_items = queue->Size();
> Use braces here
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@187
PS14, Line 187: to& v : (
> nit: "Encountered" . Here and in a couple of other places like line 205.
Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@219
PS14, Line 219:   RETURN_IF_OOM(ExtractValues(queue, allocator), -1);
> Can't this be "return path_idx + 1". We don't use path_idx after returning.
Sure. Done


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@224
PS14, Line 224: /// path_str. Return next unprocessed index in path_str. Return 
-1 for errors.
> This is future work, but it would be nice in future to separate out the par
Sure. Created IMPALA-7610 for this improvement.



[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-24 Thread Quanlong Huang (Code Review)
Hello Greg Rahn, Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, 
Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/10950

to look at the new patch set (#15).

Change subject: IMPALA-376: add built-in functions for parsing JSON
..

IMPALA-376: add built-in functions for parsing JSON

This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.

get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.

More examples can be found in expr-test.cc.

Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test
* Add tests in test_alloc_fail.py to check handling of out of memory

Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
A be/src/exprs/string-functions.cc
M be/src/exprs/string-functions.h
M be/src/util/string-util.cc
M be/src/util/string-util.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-init.test
M testdata/workloads/functional-query/queries/QueryTest/alloc-fail-update.test
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
11 files changed, 657 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/15
--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 15
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-19 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 14:

(23 comments)

Based on the list discussion it sounds like we want to move forward with this.

http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/10950/14//COMMIT_MSG@27
PS14, Line 27: * Add e2e tests in exprs.test
Can we add a couple of tests to tests/custom_cluster/test_alloc_fail.py to 
check handling of out of memory.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/expr-test.cc@8856
PS14, Line 8856:   TestIsNull("get_json_object('{illegal_json}', '$.a')", 
TYPE_STRING);
Can you add some tests for:

 * trying to extract from just a JSON literal. E.g. numbers or strings
 * trying to extract from an empty list or object


http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@57
PS13, Line 57:   void* Realloc(void* originalPtr, size_t originalSize, size_t 
newSize) {
I think we need to copy over the previous contents of 'originalPtr'. I looked 
at the RapidJSON Realloc() implementations here and they both seem to do that: 
https://github.com/Tencent/rapidjson/blob/7e68aa0a21b7800ec98133cb106e49bd6536e25c/include/rapidjson/allocators.h


http://gerrit.cloudera.org:8080/#/c/10950/13/be/src/exprs/string-functions.cc@133
PS13, Line 133: strlen
Can we use sb.GetSize() instead of strlen()? It seems unnecessary to walk the 
string to find the null terminator.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@109
PS14, Line 109: StringVal ToStringVal(FunctionContext* ctx, JsonUdfAllocator* 
allocator,
These functions can all be static I think.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@121
PS14, Line 121:   memcpy(res.ptr, v->GetString(), v->GetStringLength());
Use StringVal::CopyFrom() instead - it handles the edge case when memory 
allocation fails.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@136
PS14, Line 136:   return result;
Use StringVal::CopyFrom() instead


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@138
PS14, Line 138:
SelectByKey, SelectByIndex, ExpandArrays and ProcessWildcard key could benefit 
from short function comments explaining how they transform 'queue'. E.g. for 
SelectByKey something like:

  // Extract all the values for 'key' where objects in 'queue' contain that 
queue.
  // Replace the contents of queue with the values found.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@139
PS14, Line 139: (vector& queue,
Our style says use pass by pointer for mutable arguments, and (generally) to 
put mutable arguments last: 
https://google.github.io/styleguide/cppguide.html#Reference_Arguments

Exception: we sometimes put mutable arguments before immutable for context data 
structures like "FunctionContext*"


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@142
PS14, Line 142:   for (int k = 0; k < old_items; ++k) {
Why not use i for the index? here and below.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@162
PS14, Line 162: for (JsonUdfValue* it = queue[k]->Begin(); it != 
queue[k]->End(); ++it)
Use braces here


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@172
PS14, Line 172: for (auto it = queue[k]->MemberBegin(); it != 
queue[k]->MemberEnd(); ++it)
Use braces here


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@187
PS14, Line 187: Encounter
nit: "Encountered" . Here and in a couple of other places like line 205.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@219
PS14, Line 219:   return ++path_idx;  // path_idx points at ']'
Can't this be "return path_idx + 1". We don't use path_idx after returning.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@224
PS14, Line 224: int ProcessNumberIndex(FunctionContext* ctx, 
vector& queue,
This is future work, but it would be nice in future to separate out the parsing 
of the path from the processing of the JSON. That way for constant path strings 
we could save re-parsing the path. The benefit probably isn't that huge but 
might be significant.


http://gerrit.cloudera.org:8080/#/c/10950/14/be/src/exprs/string-functions.cc@275
PS14, Line 275:   if (UNLIKELY(json_str.is_null || json_str.len == 0)) {
nit: could fit on 

[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 14:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/623/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sun, 09 Sep 2018 00:09:46 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-08 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 14:

Add more tests for coverage and add support for the wildcard in value selection 
(e.g. $.*).


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 08 Sep 2018 23:36:49 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-08 Thread Quanlong Huang (Code Review)
Hello Greg Rahn, Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, 
Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/10950

to look at the new patch set (#14).

Change subject: IMPALA-376: add built-in functions for parsing JSON
..

IMPALA-376: add built-in functions for parsing JSON

This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.

get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.

More examples can be found in expr-test.cc.

Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test

Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
A be/src/exprs/string-functions.cc
M be/src/exprs/string-functions.h
M be/src/util/string-util.cc
M be/src/util/string-util.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
9 files changed, 598 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/14
--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-08 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py
File common/function-registry/impala_functions.py:

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514
PS13, Line 514: get_json_object
> I'm not opposed to having both functions, Hive & ANSI, but I wanted to rais
Sure, Greg. I'll send out an email to the dev list for discussion. I'm in favor 
of implementing both functions, so we're compatible to Hive and also provide 
more powerful syntax in ANSI. I think most of the existing codes can be shared 
with ANSI json functions.



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 08 Sep 2018 23:34:37 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-05 Thread Greg Rahn (Code Review)
Greg Rahn has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py
File common/function-registry/impala_functions.py:

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514
PS13, Line 514: get_json_object
> Ah, haven't considered the ANSI standards before. My initial motivation is
I'm not opposed to having both functions, Hive & ANSI, but I wanted to raise 
awareness to see if folks felt it was necessary and having knowledge that the 
ANSI JSON functions are likely to also be implemented and that hopefully it 
could be done so in a way that makes the most sense in terms of sharing code, 
etc.



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 06 Sep 2018 04:03:31 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-05 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py
File common/function-registry/impala_functions.py:

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514
PS13, Line 514: get_json_object
> I'd prefer to make Hive compatibility be explicity brought up before decidi
Ah, haven't considered the ANSI standards before. My initial motivation is to 
replace the Hive UDF get_json_object written in Java since it's out of control 
in memory management.

There's a slight difference between the Hive get_json_object() and ANSI 
json_value(). json_value only returns scalar values, while get_json_object can 
return JSON string for a selected JSON object. Maybe the ANSI json_query() 
function is more similar.

In Hulu, most of our Impala users are already using the get_json_object 
function. Some upper systems may also use this function. Since it's already a 
Hive convention, I think it worths to keep this name.

My principle is "start from the user". Rename the function may cause trouble 
for them. However, the ANSI standards have more flexible syntax. What about 
creating a new JIRA for implementing the "returning " and "value on 
error" clauses?



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 06 Sep 2018 00:14:05 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-05 Thread Greg Rahn (Code Review)
Greg Rahn has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py
File common/function-registry/impala_functions.py:

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514
PS13, Line 514: get_json_object
> I just want to note that ANSI SQL also has a function called json_object()
I'd prefer to make Hive compatibility be explicity brought up before deciding 
on implementing it hence me favoring ANSI standards first.



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 05 Sep 2018 23:41:27 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-05 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py
File common/function-registry/impala_functions.py:

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514
PS13, Line 514: get_json_object
> Is it worth doing both? I think portability from Hive is also important (no
I just want to note that ANSI SQL also has a function called json_object() with 
quite different functionality (it creates a JSON object from a list of 
key-value pairs).

So, it can be confusing for users if we want to be portable with both Hive and 
ANSI SQL.



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 05 Sep 2018 22:09:49 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-05 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py
File common/function-registry/impala_functions.py:

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514
PS13, Line 514: get_json_object
> I also left a comment on the Jira, but I'd recommend we name and implement
Is it worth doing both? I think portability from Hive is also important (not 
just portability from other ANSI sql dbs)



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 05 Sep 2018 19:32:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-05 Thread Greg Rahn (Code Review)
Greg Rahn has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py
File common/function-registry/impala_functions.py:

http://gerrit.cloudera.org:8080/#/c/10950/13/common/function-registry/impala_functions.py@514
PS13, Line 514: get_json_object
I also left a comment on the Jira, but I'd recommend we name and implement the 
ANSI SQL function json_value(). It will be more portable and is the ANSI SQL 
standard and appears to be very similar.



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Greg Rahn 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 05 Sep 2018 17:10:20 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-05 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

Yeah, we should have full test coverage on this code, however that doesn't 
guarantee that our implementation is correct.

So maybe it'd be also beneficial to get some inspiration from the test cases of 
Apache-licensed (or some other permissive license) json path libraries. There 
are quite a few of them on GitHub.


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 05 Sep 2018 16:36:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-05 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

> Patch Set 13:
>
> I think adding more tests for the json path will be more practical. Will add 
> test cases we currently support by picking them up in 
> https://github.com/mysql/mysql-server/blob/8.0/unittest/gunit/json_path-t.cc#L468

Given that code is GPL we can't copy those test cases. Probably better to do 
our own analysis of coverage and make sure we are confident without relying on 
GPL code, even if it's "only tests"


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 05 Sep 2018 15:20:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-09-05 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

I think adding more tests for the json path will be more practical. Will add 
test cases we currently support by picking them up in 
https://github.com/mysql/mysql-server/blob/8.0/unittest/gunit/json_path-t.cc#L468


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 05 Sep 2018 12:59:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-27 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

> We can also add things to the random query generator which validates them 
> against postgres but I don't think there's a close enough postgres function 
> to validate against: 
> https://github.com/apache/impala/blob/master/tests/comparison/funcs.py#L574

Yes, postgres has a quite different syntax for JSON: 
https://www.postgresql.org/docs/9.6/static/functions-json.html
It'd be better if we can test against MySQL, since it's output is more robust 
and reasonable than Hive's. I found there's a template file about MySQL: 
https://github.com/apache/impala/blob/master/fe/src/test/resources/mysql-hive-site.xml.template
Looks like we used to depend on MySQL in the past? Is it hard to enable it 
back? Or maybe we can just pull a MySQL docker image and use it.


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 28 Aug 2018 04:51:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-27 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

I didn't do anything that sophisticated - just wrote a python test that flipped 
bytes in HDFS files: 
https://github.com/apache/impala/blob/master/tests/query_test/test_scanners_fuzz.py#L35

Taras did some randomised testing of our decimal implementation here: 
https://github.com/apache/impala/blob/master/tests/query_test/test_decimal_fuzz.py

We can also add things to the random query generator which validates them 
against postgres but I don't think there's a close enough postgres function to 
validate against: 
https://github.com/apache/impala/blob/master/tests/comparison/funcs.py#L574


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 28 Aug 2018 04:29:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-27 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

I don't feel really strongly, but this seems like the perfect example of 
something we should consider fuzz-testing -- lots of hand-written parsing code 
with user supplied input.

Tim, I seem to recall you did some fuzz testing in the past. Do we have some 
library that would make this easier? Or should we consider just a simple 
randomized test case that could produce JSON paths? (since I think we believe 
that rapidjson itself is safe since it's been fuzzed by others already)


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 28 Aug 2018 03:26:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-27 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13: Code-Review+1

Thanks Quanlong for your great work!
I give it +1 for now to let others review it if they have time.
If there will be no activity in the following days, I'll +2 it.


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 27 Aug 2018 14:10:26 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/487/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 27 Aug 2018 13:17:44 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-27 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 13:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@129
PS11, Line 129:
> It's still in PS12. Please also add a test case for the statement above.
Thank you, Zoltan! I almost introduced a bug... I removed the judgment in L117 
where it should be kept. Added test coverage for all of these.


http://gerrit.cloudera.org:8080/#/c/10950/12/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/12/be/src/exprs/string-functions.cc@235
PS12, Line 235: static const int INITIAL_QUEUE_CAPACITY = 64;
> static const int would be better.
Done



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 27 Aug 2018 12:46:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-27 Thread Quanlong Huang (Code Review)
Hello Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, Impala 
Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/10950

to look at the new patch set (#13).

Change subject: IMPALA-376: add built-in functions for parsing JSON
..

IMPALA-376: add built-in functions for parsing JSON

This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.

get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.

More examples can be found in expr-test.cc.

Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test

Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
A be/src/exprs/string-functions.cc
M be/src/exprs/string-functions.h
M be/src/util/string-util.cc
M be/src/util/string-util.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
9 files changed, 505 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/13
--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-27 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 11:

(3 comments)

Thanks for applying the changes. Found 2 smaller issues.

http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc@8816
PS11, Line 8816:   TestError("get_json_object('[1,2]', '$[-2]')");
   :   TestError("get_json_object('[1,2]', '$[0.1]')");
   :   TestError("get_json_object('[1,2]', '$[a]')");
   :
   :   TestError("get_json_object('{\"key\": {\"a\": 1}}', 
'$..a')");
   :   TestError("get_json_object('{\"key\": \"value\"}', '$$')");
   :   TestError("get_json_object('{\"a\": 1}', '$a')");
   :   TestError("get_json_object('{\"a\": 1}', 'a')");
   :   TestError("get_json_object('[1,2,3]', '$[**]')");
> Sure. I think we should align with MySQL in corner cases, since Hive's impl
Yeah, that's my impression too.


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@129
PS11, Line 129: if (it->IsNull()) continue;
> Done. Nice find!
It's still in PS12. Please also add a test case for the statement above.


http://gerrit.cloudera.org:8080/#/c/10950/12/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/12/be/src/exprs/string-functions.cc@235
PS12, Line 235:
static const int would be better.



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 27 Aug 2018 10:37:52 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 12:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/484/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 25 Aug 2018 09:44:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-25 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 12:

(12 comments)

Refactor based on comments and add tests on the error messages. Please have a 
look when you got a chance.

http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc@8804
PS11, Line 8804:   "get_json_object('{\"a\":-1, \"b\":-2, \"c\":-3}', 
'$.b')", "-2");
> Hive's output is different here, but I think this behavior is more reasonab
Nice find!


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc@8816
PS11, Line 8816:   "get_json_object('[{\"key\": \"v1\"}, {\"key\": 
\"v2\"}]', '$[*].key')",
   :   "[\"v1\",\"v2\"]");
   :   TestStringValue(
   :   "get_json_object('[{\"key\": [1,2,3]}, {\"key\": 
[4,5]}]', '$[*].key')",
   :   "[[1,2,3],[4,5]]");
   :   TestStringValue(
   :   "get_json_object('[{\"key\": [1,2,3]}, {\"key\": 
[4,5]}]', '$[*].key[*]')",
   :   "[1,2,3,4,5]");
   :   TestStringValue("get_json_object('[[0,1,2], [3,4,
> Just a note here:
Sure. I think we should align with MySQL in corner cases, since Hive's 
implementation seems not be strictly reviewed.


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@49
PS11, Line 49:
> We usually use DCHECK(false) for this in Impala.
Done


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@129
PS11, Line 129:
> I don't think we need this, because hive also keeps the null values and I t
Done. Nice find!


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@176
PS11, Line 176:   "Encounter '$1' in p
> Please add test case for wildcard + spaces.
Done


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@177
PS11, Line 177: d
> nit: add a space after the colon
Done


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@179
PS11, Line 179:
> Please use static_cast() to print it as a character instead of a numb
Nice find. I should test the error messages too. Refined the tests in 
expr-test.cc


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@254
PS11, Line 254: or
> nit: might be better to use a named constant here, like INITIAL_QUEUE_CAPAC
Done


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@266
PS11, Line 266: nt i = 1; i < path_str.len; ) {
> if it means an invalid selector then we should set an error.
Done


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@273
PS11, Line 273:
> should be 'path_str'
Oops! Done


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@303
PS11, Line 303:  it yet
> should be 'path_str'
Done.


http://gerrit.cloudera.org:8080/#/c/10950/10/common/function-registry/impala_functions.py
File common/function-registry/impala_functions.py:

http://gerrit.cloudera.org:8080/#/c/10950/10/common/function-registry/impala_functions.py@514
PS10, Line 514:
> flake8: E501 line too long (98 > 90 characters)
Done



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 25 Aug 2018 09:04:16 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-25 Thread Quanlong Huang (Code Review)
Hello Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, Impala 
Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/10950

to look at the new patch set (#12).

Change subject: IMPALA-376: add built-in functions for parsing JSON
..

IMPALA-376: add built-in functions for parsing JSON

This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.

get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.

More examples can be found in expr-test.cc.

Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test

Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
A be/src/exprs/string-functions.cc
M be/src/exprs/string-functions.h
M be/src/util/string-util.cc
M be/src/util/string-util.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
9 files changed, 496 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/12
--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-23 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 11:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc
File be/src/exprs/expr-test.cc:

http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc@8804
PS11, Line 8804:   "[[1,2,3],[4,5]]");
Hive's output is different here, but I think this behavior is more reasonable, 
so I filed HIVE-20448.

https://issues.apache.org/jira/browse/HIVE-20448


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/expr-test.cc@8816
PS11, Line 8816:   TestError("get_json_object('[1,2]', '$[-2]')");
   :   TestError("get_json_object('[1,2]', '$[0.1]')");
   :   TestError("get_json_object('[1,2]', '$[a]')");
   :
   :   TestError("get_json_object('{\"key\": {\"a\": 1}}', 
'$..a')");
   :   TestError("get_json_object('{\"key\": \"value\"}', '$$')");
   :   TestError("get_json_object('{\"a\": 1}', '$a')");
   :   TestError("get_json_object('{\"a\": 1}', 'a')");
   :   TestError("get_json_object('[1,2,3]', '$[**]')");
Just a note here:
Regarding to the json path we are more strict than Hive, but I think it is a 
good thing. Hive's behavior is quite random on wrong json paths, it either 
returns all values from the json, or returns null.

Since the json path is a constant expression that we want to evaluate on all 
json data, it is quite important to raise an error when it is not valid. MySQL 
also does that for its JSON_EXTRACT() function.


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@49
PS11, Line 49: RAPIDJSON_ASSERT(false)
We usually use DCHECK(false) for this in Impala.


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@129
PS11, Line 129: if (it->IsNull()) continue;
I don't think we need this, because hive also keeps the null values and I think 
it is the desired behavior.

 hive> select get_json_object('[0, 1, null, 2, 3]', '$[*]');
 [0,1,null,2,3]

MySQL's JSON_EXTRACT() and online jsonpath evaluators like 
https://jsonpath.curiousconcept.com/ also preserve nulls.


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@176
PS11, Line 176: if (path[path_idx] != ' ')
Please add test case for wildcard + spaces.


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@177
PS11, Line 177: "
nit: add a space after the colon


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@179
PS11, Line 179: path[path_idx]
Please use static_cast() to print it as a character instead of a number.


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@254
PS11, Line 254: 64
nit: might be better to use a named constant here, like INITIAL_QUEUE_CAPACITY 
or something like that.


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@266
PS11, Line 266: if (i == path_str.len) return StringVal::null();
if it means an invalid selector then we should set an error.


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@273
PS11, Line 273: json_str
should be 'path_str'


http://gerrit.cloudera.org:8080/#/c/10950/11/be/src/exprs/string-functions.cc@303
PS11, Line 303: json_str
should be 'path_str'



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 23 Aug 2018 13:53:44 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/448/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 22 Aug 2018 14:10:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-22 Thread Quanlong Huang (Code Review)
Hello Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong, Impala 
Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/10950

to look at the new patch set (#11).

Change subject: IMPALA-376: add built-in functions for parsing JSON
..

IMPALA-376: add built-in functions for parsing JSON

This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.

get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.

More examples can be found in expr-test.cc.

Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test

Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
A be/src/exprs/string-functions.cc
M be/src/exprs/string-functions.h
M be/src/util/string-util.cc
M be/src/util/string-util.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
9 files changed, 432 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/11
--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-22 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 10:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51
PS9, Line 51:   JsonUdfAllocator(FunctionContext* ctx): ctx_(ctx) {}
> I'm ok since this patch is for 3.x. We're still using the 2.x branch in pro
Done


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@96
PS9, Line 96: in()
> yeah, 'pos_' is also a good name for it.
Done


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@194
PS9, Line 194:
> Sure. Will add test coverage for this.
Done


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@195
PS9, Line 195: ocess numb
> Good point! I'll add more error messages.
Done


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h@53
PS9, Line 53: FindEndOfIdentifi
> Oh I see, I still think it's not a good name for either functions unless it
Done



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 22 Aug 2018 13:38:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10950/10/common/function-registry/impala_functions.py
File common/function-registry/impala_functions.py:

http://gerrit.cloudera.org:8080/#/c/10950/10/common/function-registry/impala_functions.py@514
PS10, Line 514: b
flake8: E501 line too long (98 > 90 characters)



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 22 Aug 2018 13:37:23 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-22 Thread Quanlong Huang (Code Review)
Hello Zoltan Borok-Nagy, Attila Jeges, Todd Lipcon, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/10950

to look at the new patch set (#10).

Change subject: IMPALA-376: add built-in functions for parsing JSON
..

IMPALA-376: add built-in functions for parsing JSON

This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.

get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.

More examples can be found in expr-test.cc.

Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test

Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
A be/src/exprs/string-functions.cc
M be/src/exprs/string-functions.h
M be/src/util/string-util.cc
M be/src/util/string-util.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
9 files changed, 431 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/10
--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-21 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 9:

(3 comments)

Thanks for your response! Looking forward for the next patch set!

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@96
PS9, Line 96: src_
> Done. Good name. I think 'pos_ ' is shorter.
yeah, 'pos_' is also a good name for it.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@166
PS9, Line 166: values.erase
> I think I should make the var name more meaningful. I'd like to rename 'val
I was just thinking of huge json docs with big arrays, but might still not 
worth the effort.

I'm OK with it as it is.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h@53
PS9, Line 53: AdvanceIdentifier
> Sure. This function is a slightly modified version of https://github.com/ap
Oh I see, I still think it's not a good name for either functions unless it is 
the widely used term in parsers.

I am not a native speaker, but advancing something suggests me 'modifying 
something', just like std::advance() does: 
https://en.cppreference.com/w/cpp/iterator/advance



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 21 Aug 2018 16:30:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-19 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 9:

(12 comments)

Zoltan, thank you for taking a look! Since the RapidJSON has been upgraded to 
1.1.0, I'll update this patch soon.

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@67
PS9, Line 67: thread_local FunctionContext* UdfJsonAllocator::ctx_ = nullptr;
> nit: please insert empty line after class definition
Done. Remove the static thread_local variable by refactoring the Malloc & Free 
logics as discussed before.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@69
PS9, Line 69: RapidJSON required the input strings be end of '\0'.
> I don't think it's grammatically correct. Maybe:
Done


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@70
PS9, Line 70: tailing
> nit: trailing
Done


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@96
PS9, Line 96: src_
> nit: read_position_ would be a better name I think
Done. Good name. I think 'pos_ ' is shorter.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@166
PS9, Line 166: values.erase
> Maybe it'd be better to have two vectors and juggling the values between th
I think I should make the var name more meaningful. I'd like to rename 'values' 
to 'queue' since it's the queue in BFS. 'push_back' and 'erase' act more like a 
queue. On the other hand, the depth of the json selector is usually small, so 
such kind of functions won't be called too many times in parsing a json string. 
Refactoring these with two vectors might not have significant benifits.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@194
PS9, Line 194: (path[path_idx] != ' ')
> I am not sure that there is a test for it. Either for the true and false ca
Sure. Will add test coverage for this.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@195
PS9, Line 195: return -1;
> Maybe taking the function context and set some error would be helpful for t
Good point! I'll add more error messages.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@199
PS9, Line 199: return NULL
> This function returns -1, the return NULL statement is in GetJsonObjectImpl
Sorry, this comment is stale.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@217
PS9, Line 217: return -1;
> Error could be set in function context. We could also use 'path_idx' in the
Sure


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@251
PS9, Line 251:   vector values;
> The C++ standard doesn't specify the initial capacity of std::vector, but i
OK. But the queue is usually short, the benefits might be limit.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@267
PS9, Line 267: (end == nullptr)
> Set error in function context?
Sure


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h@53
PS9, Line 53: AdvanceIdentifier
> Maybe 'FindIdentifierEnd' or something like that would be a better name.
Sure. This function is a slightly modified version of 
https://github.com/apache/impala/blob/2e63752858d71cc745534367a686980e060a8180/be/src/gutil/strings/util.cc#L743



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 20 Aug 2018 00:53:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-17 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 9:

(12 comments)

Thanks for taking this on! It is a very useful functionality IMO.

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@67
PS9, Line 67: thread_local FunctionContext* UdfJsonAllocator::ctx_ = nullptr;
nit: please insert empty line after class definition


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@69
PS9, Line 69: RapidJSON required the input strings be end of '\0'.
I don't think it's grammatically correct. Maybe:
"RapidJSON requires input strings that end with a trailing '\0'"


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@70
PS9, Line 70: tailing
nit: trailing


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@96
PS9, Line 96: src_
nit: read_position_ would be a better name I think


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@166
PS9, Line 166: values.erase
Maybe it'd be better to have two vectors and juggling the values between them 
instead of this erasing and shifting.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@194
PS9, Line 194: (path[path_idx] != ' ')
I am not sure that there is a test for it. Either for the true and false cases.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@195
PS9, Line 195: return -1;
Maybe taking the function context and set some error would be helpful for the 
user.

Since there is a problem with the 'path_str' we can consider it as an error IMO.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@199
PS9, Line 199: return NULL
This function returns -1, the return NULL statement is in GetJsonObjectImpl(), 
so this comment here is a bit confusing.

Maybe it could also set this error in the function context.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@217
PS9, Line 217: return -1;
Error could be set in function context. We could also use 'path_idx' in the 
error message.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@251
PS9, Line 251:   vector values;
The C++ standard doesn't specify the initial capacity of std::vector, but it is 
likely to be zero.

Calling values.reserve() with some starting value (maybe 64) could be 
beneficial.


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@267
PS9, Line 267: (end == nullptr)
Set error in function context?


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/util/string-util.h@53
PS9, Line 53: AdvanceIdentifier
Maybe 'FindIdentifierEnd' or something like that would be a better name.



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 17 Aug 2018 14:21:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-08-04 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51
PS9, Line 51: return ctx_->Allocate(size);
> Personally I would vote to upgrade (or just patch) rapidjson even if it wil
I'm ok since this patch is for 3.x. We're still using the 2.x branch in 
production...
Patch for upgrade rapidjson to v1.1.0 is ready for review: 
https://gerrit.cloudera.org/#/c/11124/



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sat, 04 Aug 2018 14:59:33 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-07-31 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51
PS9, Line 51: return ctx_->Allocate(size);
> This refactor is simple but I hit a bug in rapidjson-0.11. In GenericDocume
Personally I would vote to upgrade (or just patch) rapidjson even if it will 
require some API usage changes elsewhere in the code, unless those changes turn 
out to be really complicated. But, since you mentioned on the mailing list that 
this patch is very important to you, maybe we can use the threadlocal approach 
temporarily and leave a TODO to remove it when rapidjson is upgraded?



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 31 Jul 2018 15:27:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-07-31 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51
PS9, Line 51: return ctx_->Allocate(size);
> Cool! That's a better fit.
This refactor is simple but I hit a bug in rapidjson-0.11. In 
GenericDocument::ParseStream, the GenericReader does not use the allocator of 
GenericDocument, which means we lost the FunctionContext reference here. A null 
pointer error was thrown in GenericReader's code path.

template 
GenericDocument& ParseStream(Stream& stream) {
ValueType::SetNull(); // Remove existing root if exist
GenericReader reader;<-- here's the 
cause

One way to work around this is declaring the ctx_ variable in UdfJsonAllocator 
static and thread_local. We need thread_local again...

This rapidjson bug is fixed in 
https://github.com/Tencent/rapidjson/commit/87770399a18e190411170a4c23334fa81b5f7f34

I think the current implementation is acceptable until we upgrade the rapidjson 
version.



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 31 Jul 2018 14:45:54 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-07-31 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 9:

(2 comments)

Thanks for your cool idea! Just be able to come back to this work. Will update 
this patch soon.

http://gerrit.cloudera.org:8080/#/c/10950/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/10950/9//COMMIT_MSG@11
PS9, Line 11: old RapidJson version, we cannot get detailed errors if parse 
fails.
> Should we consider upgrading rapidjson?
It might need some efforts as Matthew mentioned in the commend of IMPALA-2014: 
https://issues.apache.org/jira/browse/IMPALA-2014?focusedCommentId=15913404=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15913404

Anyway, I create a JIRA for this: IMPALA-7364


http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51
PS9, Line 51: return ctx_->Allocate(size);
> Yeah exactly, the memory is recycled in bulk by Impala's runtime at some po
Cool! That's a better fit.



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 31 Jul 2018 11:21:07 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-07-23 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10950 )

Change subject: IMPALA-376: add built-in functions for parsing JSON
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc
File be/src/exprs/string-functions.cc:

http://gerrit.cloudera.org:8080/#/c/10950/9/be/src/exprs/string-functions.cc@51
PS9, Line 51: return ctx_->Allocate(size);
> The temporary result allocation mechanism (used by StringVal(FunctionContex
That would also avoid the necessity for the thread_local, because there's no 
requirement to free each allocation individually, right?



--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Attila Jeges 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 24 Jul 2018 00:17:12 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-376: add built-in functions for parsing JSON

2018-07-21 Thread Quanlong Huang (Code Review)
Quanlong Huang has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/10950


Change subject: IMPALA-376: add built-in functions for parsing JSON
..

IMPALA-376: add built-in functions for parsing JSON

This patch implement the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. Due to the constrain of
old RapidJson version, we cannot get detailed errors if parse fails.

One of the complexity of this patch is about memory management. In
order to track the memory used in RapidJson, we have to wrap
FunctionContext into an allocator. However, RapidJson requires the
Free function be static. We can only make the FunctionContext pointer
in Allocator thread_local. This is safe since each UDF thread
allocates and frees its own memory. The only drawback is that
thread_local can’t be cross-compiled to IR. This is acceptable since
the parameters are complex strings, we won't gain signifcant
performance benefit from JIT compilation.

Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test

Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
---
M be/src/exprs/CMakeLists.txt
M be/src/exprs/expr-test.cc
M be/src/exprs/string-functions-ir.cc
A be/src/exprs/string-functions.cc
M be/src/exprs/string-functions.h
M be/src/util/string-util.cc
M be/src/util/string-util.h
M common/function-registry/impala_functions.py
M testdata/workloads/functional-query/queries/QueryTest/exprs.test
9 files changed, 411 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/50/10950/9
--
To view, visit http://gerrit.cloudera.org:8080/10950
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Gerrit-Change-Number: 10950
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang