[GitHub] [druid] xvrl opened a new issue #9912: coordinator restart during index task shutdown may cause succesful task to get failed

2020-05-21 Thread GitBox


xvrl opened a new issue #9912:
URL: https://github.com/apache/druid/issues/9912


   ### Affected Version
   0.17.0
   
   ### Description
   
   We got a little unlucky and hit some edge case in the task handling during 
coordinator restart.
   
   If a task taskes a little too long to shutdown, there is a window during 
which the task status cannot get queried after it succeeds.
   
   A coordinator/overlord in the process of starting might discover the task in 
zookeeper and try to query its status. The status call will fail, resulting in 
the coordinator deciding to kill the task that is about to succeed.
   
   While the task is still in the process of shutting down, the worker will get 
the signal from the supervisor to kill the task, which closes the input stream 
on the process, triggering an unclean shutdown with a non-zero exit code of the 
task. This in turn causes the worker to believe the task failed, despite the 
task logs showing success.
   
   In our case the chance of hitting the problem was exacerbated by 
`druid.server.http.unannouncePropagationDelay` being set to 5 seconds, which 
increased the time between the chathandler no longer being reachable and the 
process exiting
   
   timeline from logs below
   
   ```
   coordinator  May 20th 2020, 17:54:19.431 Curator-PathChildrenCache-1 
INFO
Worker[druid-middle-manager-8.druid-middle-manager.druid.svc.cluster.local:8091]
 wrote RUNNING status for task [index_kafka_mydata_caf1fb5dc7be8a4_bknnbodd] on 
[TaskLocation{host='druid-middle-manager-8.druid-middle-manager.druid.svc.cluster.local',
 port=8100, tlsPort=-1}]
   coordinator  May 20th 2020, 17:54:19.592 
LeaderSelector[/druid/prod/overlord/_OVERLORD]  INFOAdding 
task[index_kafka_telemetry_prod_caf1fb5dc7be8a4_bknnbodd] to activeTasks
   
   task 2020-05-20T17:54:21.520ZTask completed with status: {\n  \"id\" 
: \"index_kafka_mydata_caf1fb5dc7be8a4_bknnbodd\",\n  \"status\" : 
\"SUCCESS\",\n  \"duration\" : 3840683 [...]
   task 2020-05-20T17:54:21.527ZStopping lifecycle [module] stage 
[SERVER]
   task 2020-05-20T17:54:21.527ZSleeping 5000 ms for unannouncement to 
propagate.
   
   coordinator  May 20th 2020, 17:54:25.328 IndexTaskClient-mydata-0
WARNException while sending request
   org.apache.druid.java.util.common.IAE: Received 400 Bad Request with body: 
{"error":"Can't find chatHandler for 
handler[index_kafka_mydata_caf1fb5dc7be8a4_bknnbodd]"}
at 
org.apache.druid.indexing.common.IndexTaskClient.submitRequest(IndexTaskClient.java:356)
at 
org.apache.druid.indexing.common.IndexTaskClient.submitRequestWithEmptyContent(IndexTaskClient.java:220)
at 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient.getStatus(SeekableStreamIndexTaskClient.java:172)
at 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient.lambda$getStatusAsync$9(SeekableStreamIndexTaskClient.java:373)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
   
   coordinator  May 20th 2020, 17:54:25.767 KafkaSupervisor-mydata  INFO
Shutdown [index_kafka_mydata_caf1fb5dc7be8a4_bknnbodd] because: [Task 
[index_kafka_mydata_caf1fb5dc7be8a4_bknnbodd] failed to return status, killing 
task]
   
   middlemanagerMay 20th 2020, 17:54:25.769 qtp1008934993-97
INFOShutdown [index_kafka_mydata_caf1fb5dc7be8a4_bknnbodd] because: [shut 
down request via HTTP endpoint]
   middlemanagerMay 20th 2020, 17:54:25.769 qtp1008934993-97
INFOClosing output stream to 
task[index_kafka_mydata_caf1fb5dc7be8a4_bknnbodd].
   
   task 2020-05-20T17:54:25.769ZTriggering JVM shutdown.
   task 2020-05-20T17:54:25.770ZRunning shutdown hook
   task 2020-05-20T17:54:25.770ZLifecycle [module] already stopped and 
stop was called. Silently skipping
   
   coordinator  May 20th 2020, 17:54:25.778 KafkaSupervisor-mydata  INFO
Removing task[index_kafka_mydata_caf1fb5dc7be8a4_bknnbodd] from activeTasks
   coordinator  May 20th 2020, 17:54:25.778 KafkaSupervisor-mydata  INFO
Sent shutdown message to worker: 
druid-middle-manager-8.druid-middle-manager.druid.svc.cluster.local:8091, 
status 200 OK, response: {"task":"index_kafka_mydata_caf1fb5dc7be8a4_bknnbodd"}
   coordinator  May 20th 2020, 17:54:25.778 KafkaSupervisor-mydata  INFO
Removing task[index_kafka_mydata_caf1fb5dc7be8a4_bknnbodd] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_mydata', 
dataSource='mydata', 
interval=2020-05-20T16:45:00.000Z/2020-05-20T17:00:00.000Z, 
version='2020-05-20T16:45:00.403Z', priority=75, revoked=false}]
   
   
   coordinator  May 20th 2020, 17:54:25.784 

[GitHub] [druid] 2bethere commented on a change in pull request #9879: Querying doc refresh tutorial

2020-05-21 Thread GitBox


2bethere commented on a change in pull request #9879:
URL: https://github.com/apache/druid/pull/9879#discussion_r429039250



##
File path: docs/querying/querying.md
##
@@ -44,7 +49,7 @@ The Content-Type/Accept Headers can also take 
'application/x-jackson-smile'.
 curl -X POST ':/druid/v2/?pretty' -H 
'Content-Type:application/json' -H 'Accept:application/x-jackson-smile' -d 
@
 ```

Review comment:
   Maybe add in a sentence here:
   "If you are running quick start, you should replace : 
with localhost: 8082"
   
   Sometimes I find it frustrating when reading tutorials but can't find the 
default parameter.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] cloventt edited a comment on issue #9911: Sorting on the value of an injective lookup returns incorrect order

2020-05-21 Thread GitBox


cloventt edited a comment on issue #9911:
URL: https://github.com/apache/druid/issues/9911#issuecomment-632469480


   This is roughly the query we are running:
   
   ```json
   {
 "queryType": "groupBy",
 "dataSource": {
   "type": "table",
   "name": "test-datasource"
 },
 "intervals": {
   "type": "LegacySegmentSpec",
   "intervals": [
 "2020-04-25T00:00:00.000Z/2020-04-26T00:00:00.000Z"
   ]
 },
 "descending": false,
 "virtualColumns": [],
 "granularity": "DAY",
 "dimensions": [
   {
 "type": "lookup",
 "dimension": "actualDimension",
 "outputName": "injectiveLookupDimension",
 "name": "test_lookup"
   }
 ],
 "aggregations": [
   {
 "type": "longSum",
 "name": "numericalValue",
 "fieldName": "numericalValue"
   }
 ],
 "postAggregations": [],
 "limitSpec": {
   "type": "default",
   "limit": 5,
   "columns": [
 {
   "dimension": "injectiveLookupDimension",
   "direction": "descending"
 }
   ]
 },
 "limit": 2147483647
   }
   ```
   We expect the results of this to be ordered by the value of 
`injectiveLookupDimension`, but instead they are ordered by `actualDimension`.
   
   The registered lookup is configured to be `injective`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] cloventt commented on issue #9911: Sorting on the value of an injective lookup returns incorrect order

2020-05-21 Thread GitBox


cloventt commented on issue #9911:
URL: https://github.com/apache/druid/issues/9911#issuecomment-632469480


   This is roughly the query we are running:
   
   ```json
   {
 "queryType": "groupBy",
 "dataSource": {
   "type": "table",
   "name": "test-datasource"
 },
 "intervals": {
   "type": "LegacySegmentSpec",
   "intervals": [
 "2020-04-25T00:00:00.000Z/2020-04-26T00:00:00.000Z"
   ]
 },
 "descending": false,
 "virtualColumns": [],
 "granularity": "DAY",
 "dimensions": [
   {
 "type": "lookup",
 "dimension": "actualDimension",
 "outputName": "injectiveLookupDimension",
 "name": "test_lookup"
   }
 ],
 "aggregations": [
   {
 "type": "longSum",
 "name": "numericalValue",
 "fieldName": "numericalValue"
   }
 ],
 "postAggregations": [],
 "limitSpec": {
   "type": "default",
   "limit": 5,
   "columns": [
 {
   "dimension": "injectiveLookupDimension",
   "direction": "descending"
 }
   ]
 },
 "limit": 2147483647
   }
   ```
   We expect the results of this to be ordered by the value of 
`injectiveLookupDimension`, but instead they are ordered by `actualDimension`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] cloventt opened a new issue #9911: Sorting on the value of an injective lookup returns incorrect order

2020-05-21 Thread GitBox


cloventt opened a new issue #9911:
URL: https://github.com/apache/druid/issues/9911


   ### Affected Version
   
   0.17.0
   
   ### Description
   
   We have been using injective lookups to perform a one-to-one mapping between 
an ID and some model data. However, when we construct a query to order results 
by the mapped value, the results are returned ordered by the original dimension 
value.
   
   For example, if injective lookups perform this mapping:
   
   ```
   // key -> value
   1 -> C
   2 -> B
   3 -> A
   ```
   
   Then when making a query ordering by `value`, we would expect to see:
   ```
   A
   B
   C
   ```
   But what we actually see is:
   ```
   C
   B
   A
   ```
   The results are ordered by the value of key, not value. This issue does not 
seem to occur for non-injective lookups. However when we change to use a 
non-injective lookup, we see a performance hit and a considerable increase in 
heap usage and GC time on the historical while the query is running.
   
   Given how injective lookups are processed, this might be expected behaviour. 
However as far as I can tell this behaviour is not mentioned in the 
documentation.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] maytasm opened a new pull request #9910: Update doc on tmp dir (java.io.tmpdir) best practice

2020-05-21 Thread GitBox


maytasm opened a new pull request #9910:
URL: https://github.com/apache/druid/pull/9910


   Update doc on tmp dir (java.io.tmpdir) best practice
   
   ### Description
   
   We have had issue where our /mnt/tmp is an NFS mount. This can causes the 
intermediate persist of our streaming ingestion task to be extremely slow. 
Update the docs to cautious other users on setting tmp dir (aka java.io.tmpdir) 
to NFS mount and the importance of having good read and write speed to this dir.
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] yuanlihan commented on issue #9904: Fail to compact overlapping segments

2020-05-21 Thread GitBox


yuanlihan commented on issue #9904:
URL: https://github.com/apache/druid/issues/9904#issuecomment-632448083


   Hi @jihoonson 
   > This log means, you got two segments of overlapping intervals and same 
version. This is never allowed since Druid doesn't know what segment to read in 
this case. Druid should never create such segments of overlapping intervals but 
same version. Can you check why and how they were created? 
   
   The two partially overlapping segments with same version are created by the 
compaction task as described in step 3 of bug duplication steps above. 
According to the log of compaction task, I found that the two input partially 
overlapping segments(with different version) are arranged to compact in two 
different `indexSpec`, and creating a `DAY` granularity segment and an `HOUR` 
granularity segment correspondingly but with same version. And I pasted the log 
of compaction task 
[here](https://gist.github.com/yuanlihan/a3f54ca6eb6ba7508fcfa653065c79eb) FYI. 
And you can also easily reproduce this bug locally according to the steps 
listed above.
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] averma111 opened a new issue #9909: Need to fix twist lock vulnerabilities for Google Cloud Kubernetes Engine

2020-05-21 Thread GitBox


averma111 opened a new issue #9909:
URL: https://github.com/apache/druid/issues/9909


   Druid Version : 0.18.1 version.
   
   There 5 high priority vulnerabilities which have poped up. They are give 
below:
   
   Jackson 
   Guava
   netty 
   log4j
   logging-log4j.
   
   I am trying to build druid after replacing the version in the pom.xml and 
trying to build the druid from source code.
   
   I need urgent help how to build druid from source code . I mean exact 
commands and steps.
   
   Any help is appretiated.
   
   Thanks,
   Ashish 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] vogievetsky commented on pull request #9887: Re-order and document format detection in web console

2020-05-21 Thread GitBox


vogievetsky commented on pull request #9887:
URL: https://github.com/apache/druid/pull/9887#issuecomment-632396654


   Oh I guess you saw a binary file that happened to have more than 3 tabs or 
commas...



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] vogievetsky merged pull request #9887: Re-order and document format detection in web console

2020-05-21 Thread GitBox


vogievetsky merged pull request #9887:
URL: https://github.com/apache/druid/pull/9887


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[druid] branch master updated: Re-order and document format detection in web console (#9887)

2020-05-21 Thread vogievetsky
This is an automated email from the ASF dual-hosted git repository.

vogievetsky pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git


The following commit(s) were added to refs/heads/master by this push:
 new 132a1c9  Re-order and document format detection in web console (#9887)
132a1c9 is described below

commit 132a1c9fe770209792973f52b8aee6f36f06aa3d
Author: Joseph Glanville 
AuthorDate: Fri May 22 06:29:39 2020 +0700

Re-order and document format detection in web console (#9887)

Motivation for this change is to not inadvertently identify binary
formats that contain uncompressed string data as TSV or CSV.

Moving detection of magic byte headers before heuristics should be more
robust in general.
---
 web-console/src/utils/ingestion-spec.tsx | 32 +++-
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/web-console/src/utils/ingestion-spec.tsx 
b/web-console/src/utils/ingestion-spec.tsx
index 0bc3ff3..9a2d700 100644
--- a/web-console/src/utils/ingestion-spec.tsx
+++ b/web-console/src/utils/ingestion-spec.tsx
@@ -2676,29 +2676,35 @@ function guessInputFormat(sampleData: string[]): 
InputFormat {
   if (sampleDatum) {
 sampleDatum = String(sampleDatum); // Really ensure it is a string
 
-if (sampleDatum.startsWith('{') && sampleDatum.endsWith('}')) {
-  return inputFormatFromType('json');
-}
-
-if (sampleDatum.split('\t').length > 3) {
-  return inputFormatFromType('tsv', !/\t\d+\t/.test(sampleDatum));
-}
-
-if (sampleDatum.split(',').length > 3) {
-  return inputFormatFromType('csv', !/,\d+,/.test(sampleDatum));
-}
+// First check for magic byte sequences as they rarely yield false 
positives
 
+// Parquet 4 byte magic header: 
https://github.com/apache/parquet-format#file-format
 if (sampleDatum.startsWith('PAR1')) {
   return inputFormatFromType('parquet');
 }
-
+// ORC 3 byte magic header: https://orc.apache.org/specification/ORCv1/
 if (sampleDatum.startsWith('ORC')) {
   return inputFormatFromType('orc');
 }
-
+// Avro OCF 4 byte magic header: 
https://avro.apache.org/docs/current/spec.html#Object+Container+Files
 if (sampleDatum.startsWith('Obj1')) {
   return inputFormatFromType('avro_ocf');
 }
+
+// After checking for magic byte sequences perform heuristics to deduce 
string formats
+
+// If the string starts and ends with curly braces assume JSON
+if (sampleDatum.startsWith('{') && sampleDatum.endsWith('}')) {
+  return inputFormatFromType('json');
+}
+// Contains more than 3 tabs assume TSV
+if (sampleDatum.split('\t').length > 3) {
+  return inputFormatFromType('tsv', !/\t\d+\t/.test(sampleDatum));
+}
+// Contains more than 3 commas assume CSV
+if (sampleDatum.split(',').length > 3) {
+  return inputFormatFromType('csv', !/,\d+,/.test(sampleDatum));
+}
   }
 
   return inputFormatFromType('regex');


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] clintropolis merged pull request #9897: Fix web console query view crashing on simple query

2020-05-21 Thread GitBox


clintropolis merged pull request #9897:
URL: https://github.com/apache/druid/pull/9897


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[druid] branch master updated (b91d500 -> 63baa29)

2020-05-21 Thread cwylie
This is an automated email from the ASF dual-hosted git repository.

cwylie pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git.


from b91d500  add some details to the build doc (#9885)
 add 63baa29  Fix web console query view crashing on simple query (#9897)

No new revisions were added by this update.

Summary of changes:
 licenses.yaml  |  2 +-
 web-console/package-lock.json  |  6 +-
 web-console/package.json   |  2 +-
 .../__snapshots__/query-view.spec.tsx.snap | 82 ++
 .../number-menu-items/number-menu-items.spec.tsx   |  8 +--
 .../string-menu-items/string-menu-items.spec.tsx   |  8 +--
 .../time-menu-items/time-menu-items.spec.tsx   |  8 +--
 .../query-view/column-tree/column-tree.spec.tsx|  6 +-
 .../query-view/query-output/query-output.spec.tsx  |  8 +--
 .../src/views/query-view/query-view.spec.tsx   |  5 ++
 web-console/src/views/query-view/query-view.tsx| 29 
 11 files changed, 120 insertions(+), 44 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] vogievetsky commented on pull request #9887: Re-order and document format detection in web console

2020-05-21 Thread GitBox


vogievetsky commented on pull request #9887:
URL: https://github.com/apache/druid/pull/9887#issuecomment-632308074


   Love it. Makes total sense. Out of curiosity was this change motivated by 
something you saw IRL that tripped up the current system?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[druid] branch master updated (ca5e66d -> b91d500)

2020-05-21 Thread ccaominh
This is an automated email from the ASF dual-hosted git repository.

ccaominh pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git.


from ca5e66d  Fixed the Copyright year of Druid (#9859)
 add b91d500  add some details to the build doc (#9885)

No new revisions were added by this update.

Summary of changes:
 dev/intellij-setup.md | 16 +++-
 docs/development/build.md |  3 +++
 website/.spelling |  4 +++-
 3 files changed, 21 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] ccaominh merged pull request #9885: add some details to the build doc

2020-05-21 Thread GitBox


ccaominh merged pull request #9885:
URL: https://github.com/apache/druid/pull/9885


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


suneet-s commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428846660



##
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##
@@ -7300,6 +7301,74 @@ public void testRegexpExtract() throws Exception
 );
   }
 
+  @Test
+  public void testRegexpExtractFilterViaNotNullCheck() throws Exception
+  {
+// Cannot vectorize due to extractionFn in dimension spec.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_EXTRACT(dim1, '^1') IS NOT NULL OR REGEXP_EXTRACT('Z' 
|| dim1, '^Z2') IS NOT NULL",
+ImmutableList.of(
+Druids.newTimeseriesQueryBuilder()
+  .dataSource(CalciteTests.DATASOURCE1)
+  .intervals(querySegmentSpec(Filtration.eternity()))
+  .granularity(Granularities.ALL)
+  .virtualColumns(
+  expressionVirtualColumn("v0", 
"regexp_extract(concat('Z',\"dim1\"),'^Z2')", ValueType.STRING)
+  )
+  .filters(
+  or(
+  not(selector("dim1", null, new 
RegexDimExtractionFn("^1", 0, true, null))),
+  not(selector("v0", null, null))
+  )
+  )
+  .aggregators(new CountAggregatorFactory("a0"))
+  .context(TIMESERIES_CONTEXT_DEFAULT)
+  .build()
+),
+ImmutableList.of(
+new Object[]{3L}
+)
+);
+  }
+
+  @Test
+  public void testRegexpLikeFilter() throws Exception
+  {
+// Cannot vectorize due to usage of regex filter.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_LIKE(dim1, '^1') OR REGEXP_LIKE('Z' || dim1, '^Z2')",

Review comment:
   Sounds good to me!
   
   > I could see having a systematic way to test for this (a query generator, 
maybe) but I don't think adding one just for this function would make sense.
   
   I've been working on the beginnings of a query generator to match all the 
different rows in `druid.foo` table - I was kinda sneakily hoping there was 
already a systematic way we add tests for all these different conditions and I 
just hadn't seen it yet :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] gianm commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

2020-05-21 Thread GitBox


gianm commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r428840570



##
File path: 
processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndexStorageAdapter.java
##
@@ -142,24 +141,8 @@ public Comparable getMaxValue(String column)
   @Override
   public ColumnCapabilities getColumnCapabilities(String column)
   {
-// Different from index.getCapabilities because, in a way, 
IncrementalIndex's string-typed dimensions
-// are always potentially multi-valued at query time. (Missing / null 
values for a row can potentially be
-// represented by an empty array; see 
StringDimensionIndexer.IndexerDimensionSelector's getRow method.)
-//
-// We don't want to represent this as having-multiple-values in 
index.getCapabilities, because that's used
-// at index-persisting time to determine if we need a multi-value column 
or not. However, that means we
-// need to tweak the capabilities here in the StorageAdapter (a query-time 
construct), so at query time
-// they appear multi-valued.
-
-final ColumnCapabilities capabilitiesFromIndex = 
index.getCapabilities(column);
-final IncrementalIndex.DimensionDesc dimensionDesc = 
index.getDimension(column);
-if (dimensionDesc != null && dimensionDesc.getCapabilities().getType() == 
ValueType.STRING) {
-  final ColumnCapabilitiesImpl retVal = 
ColumnCapabilitiesImpl.copyOf(capabilitiesFromIndex);
-  retVal.setHasMultipleValues(true);
-  return retVal;
-} else {
-  return capabilitiesFromIndex;
-}
+// snapshot the current state
+return ColumnCapabilitiesImpl.complete(index.getCapabilities(column), 
false);

Review comment:
   Yes, that's what I was thinking: HMV could be set at any time, including 
between cap-fetch and selector-creation.
   
   By the way, it looks like StringDimensionIndexer's handling of 
`hasMultipleValues` is not thread-safe (it's updated by the indexing thread and 
read by querying threads without locking).

##
File path: 
processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndexStorageAdapter.java
##
@@ -142,24 +141,8 @@ public Comparable getMaxValue(String column)
   @Override
   public ColumnCapabilities getColumnCapabilities(String column)
   {
-// Different from index.getCapabilities because, in a way, 
IncrementalIndex's string-typed dimensions
-// are always potentially multi-valued at query time. (Missing / null 
values for a row can potentially be
-// represented by an empty array; see 
StringDimensionIndexer.IndexerDimensionSelector's getRow method.)
-//
-// We don't want to represent this as having-multiple-values in 
index.getCapabilities, because that's used
-// at index-persisting time to determine if we need a multi-value column 
or not. However, that means we
-// need to tweak the capabilities here in the StorageAdapter (a query-time 
construct), so at query time
-// they appear multi-valued.
-
-final ColumnCapabilities capabilitiesFromIndex = 
index.getCapabilities(column);
-final IncrementalIndex.DimensionDesc dimensionDesc = 
index.getDimension(column);
-if (dimensionDesc != null && dimensionDesc.getCapabilities().getType() == 
ValueType.STRING) {
-  final ColumnCapabilitiesImpl retVal = 
ColumnCapabilitiesImpl.copyOf(capabilitiesFromIndex);
-  retVal.setHasMultipleValues(true);
-  return retVal;
-} else {
-  return capabilitiesFromIndex;
-}
+// snapshot the current state
+return ColumnCapabilitiesImpl.complete(index.getCapabilities(column), 
false);

Review comment:
   Yes, that's what I was thinking: HMV could be set at any time, including 
between cap-fetch and selector-creation.
   
   By the way, it looks like StringDimensionIndexer's handling of 
`hasMultipleValues` is not thread-safe (it's updated by the indexing thread and 
read by querying threads without locking). It'd be good to fix that too.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] gianm commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


gianm commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428831347



##
File path: 
processing/src/main/java/org/apache/druid/query/expression/RegexpLikeExprMacro.java
##
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.expression;
+
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.math.expr.Expr;
+import org.apache.druid.math.expr.ExprEval;
+import org.apache.druid.math.expr.ExprMacroTable;
+import org.apache.druid.math.expr.ExprType;
+
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+public class RegexpLikeExprMacro implements ExprMacroTable.ExprMacro
+{
+  private static final String FN_NAME = "regexp_like";
+
+  @Override
+  public String name()
+  {
+return FN_NAME;
+  }
+
+  @Override
+  public Expr apply(final List args)
+  {
+if (args.size() < 2 || args.size() > 3) {

Review comment:
   Good catch. I'll fix it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] gianm commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


gianm commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428827872



##
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##
@@ -7300,6 +7301,74 @@ public void testRegexpExtract() throws Exception
 );
   }
 
+  @Test
+  public void testRegexpExtractFilterViaNotNullCheck() throws Exception
+  {
+// Cannot vectorize due to extractionFn in dimension spec.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_EXTRACT(dim1, '^1') IS NOT NULL OR REGEXP_EXTRACT('Z' 
|| dim1, '^Z2') IS NOT NULL",
+ImmutableList.of(
+Druids.newTimeseriesQueryBuilder()
+  .dataSource(CalciteTests.DATASOURCE1)
+  .intervals(querySegmentSpec(Filtration.eternity()))
+  .granularity(Granularities.ALL)
+  .virtualColumns(
+  expressionVirtualColumn("v0", 
"regexp_extract(concat('Z',\"dim1\"),'^Z2')", ValueType.STRING)
+  )
+  .filters(
+  or(
+  not(selector("dim1", null, new 
RegexDimExtractionFn("^1", 0, true, null))),
+  not(selector("v0", null, null))
+  )
+  )
+  .aggregators(new CountAggregatorFactory("a0"))
+  .context(TIMESERIES_CONTEXT_DEFAULT)
+  .build()
+),
+ImmutableList.of(
+new Object[]{3L}
+)
+);
+  }
+
+  @Test
+  public void testRegexpLikeFilter() throws Exception
+  {
+// Cannot vectorize due to usage of regex filter.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_LIKE(dim1, '^1') OR REGEXP_LIKE('Z' || dim1, '^Z2')",

Review comment:
   > a multi-value column
   
   There aren't tests for multi-value columns for this specific expression, nor 
for other functions that aren't multi-value-aware. (There is a separate system 
that handles mapping over non-multi-value-aware functions.) I could see having 
a systematic way to test for this (a query generator, maybe) but I don't think 
adding one just for this function would make sense.
   
   > a numeric column
   
   There aren't tests for a numeric column. It should be a validation error but 
we don't currently have very comprehensive tests for the validator. I could add 
one in an ad-hoc way right now, and that would make sense. But I was actually 
hoping to add more systematic tests in a future patch, so could do it then too.
   
   > matching against null
   
   I added a test for matching against null to ExpressionsTest.
   
   > The docs say that "The pattern must match starting at the beginning of 
expr" but it looks like the regex pattern you are passing in is asking that it 
start at the beginning of the string via ^ in the pattern string. Can I use a $ 
in my regex to ask that it matches the end of the expr?
   
   Wow! After looking into your comment, I realized that I totally screwed up 
the docs here. I actually have a test in this patch for what happens when you 
skip the `^`, and it _does_ match substrings that are in the middle of the 
string, and the test expects that to happen, all in contradiction of what the 
docs say. I'll fix the docs. Thanks for calling this to my attention.
   
   And yes, you can use `$` to ask that it match the end. There are tests for 
this too.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


suneet-s commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428809485



##
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##
@@ -7300,6 +7301,74 @@ public void testRegexpExtract() throws Exception
 );
   }
 
+  @Test
+  public void testRegexpExtractFilterViaNotNullCheck() throws Exception
+  {
+// Cannot vectorize due to extractionFn in dimension spec.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_EXTRACT(dim1, '^1') IS NOT NULL OR REGEXP_EXTRACT('Z' 
|| dim1, '^Z2') IS NOT NULL",
+ImmutableList.of(
+Druids.newTimeseriesQueryBuilder()
+  .dataSource(CalciteTests.DATASOURCE1)
+  .intervals(querySegmentSpec(Filtration.eternity()))
+  .granularity(Granularities.ALL)
+  .virtualColumns(
+  expressionVirtualColumn("v0", 
"regexp_extract(concat('Z',\"dim1\"),'^Z2')", ValueType.STRING)
+  )
+  .filters(
+  or(
+  not(selector("dim1", null, new 
RegexDimExtractionFn("^1", 0, true, null))),
+  not(selector("v0", null, null))
+  )
+  )
+  .aggregators(new CountAggregatorFactory("a0"))
+  .context(TIMESERIES_CONTEXT_DEFAULT)
+  .build()
+),
+ImmutableList.of(
+new Object[]{3L}
+)
+);
+  }
+
+  @Test
+  public void testRegexpLikeFilter() throws Exception
+  {
+// Cannot vectorize due to usage of regex filter.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_LIKE(dim1, '^1') OR REGEXP_LIKE('Z' || dim1, '^Z2')",

Review comment:
   The docs say that "The pattern must match starting at the beginning of 
`expr`" but it looks like the regex pattern you are passing in is asking that 
it start at the beginning of the string via `^` in the pattern string. Can I 
use a `$` in my regex to ask that it matches the end of the expr?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


suneet-s commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428808032



##
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##
@@ -7300,6 +7301,74 @@ public void testRegexpExtract() throws Exception
 );
   }
 
+  @Test
+  public void testRegexpExtractFilterViaNotNullCheck() throws Exception
+  {
+// Cannot vectorize due to extractionFn in dimension spec.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_EXTRACT(dim1, '^1') IS NOT NULL OR REGEXP_EXTRACT('Z' 
|| dim1, '^Z2') IS NOT NULL",
+ImmutableList.of(
+Druids.newTimeseriesQueryBuilder()
+  .dataSource(CalciteTests.DATASOURCE1)
+  .intervals(querySegmentSpec(Filtration.eternity()))
+  .granularity(Granularities.ALL)
+  .virtualColumns(
+  expressionVirtualColumn("v0", 
"regexp_extract(concat('Z',\"dim1\"),'^Z2')", ValueType.STRING)
+  )
+  .filters(
+  or(
+  not(selector("dim1", null, new 
RegexDimExtractionFn("^1", 0, true, null))),
+  not(selector("v0", null, null))
+  )
+  )
+  .aggregators(new CountAggregatorFactory("a0"))
+  .context(TIMESERIES_CONTEXT_DEFAULT)
+  .build()
+),
+ImmutableList.of(
+new Object[]{3L}
+)
+);
+  }
+
+  @Test
+  public void testRegexpLikeFilter() throws Exception
+  {
+// Cannot vectorize due to usage of regex filter.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_LIKE(dim1, '^1') OR REGEXP_LIKE('Z' || dim1, '^Z2')",

Review comment:
   Do we have tests that check how the function performs against
   * a multi-value column
   * a numeric column
   * matching against null

##
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##
@@ -7300,6 +7301,74 @@ public void testRegexpExtract() throws Exception
 );
   }
 
+  @Test
+  public void testRegexpExtractFilterViaNotNullCheck() throws Exception
+  {
+// Cannot vectorize due to extractionFn in dimension spec.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_EXTRACT(dim1, '^1') IS NOT NULL OR REGEXP_EXTRACT('Z' 
|| dim1, '^Z2') IS NOT NULL",
+ImmutableList.of(
+Druids.newTimeseriesQueryBuilder()
+  .dataSource(CalciteTests.DATASOURCE1)
+  .intervals(querySegmentSpec(Filtration.eternity()))
+  .granularity(Granularities.ALL)
+  .virtualColumns(
+  expressionVirtualColumn("v0", 
"regexp_extract(concat('Z',\"dim1\"),'^Z2')", ValueType.STRING)
+  )
+  .filters(
+  or(
+  not(selector("dim1", null, new 
RegexDimExtractionFn("^1", 0, true, null))),
+  not(selector("v0", null, null))
+  )
+  )
+  .aggregators(new CountAggregatorFactory("a0"))
+  .context(TIMESERIES_CONTEXT_DEFAULT)
+  .build()
+),
+ImmutableList.of(
+new Object[]{3L}
+)
+);
+  }
+
+  @Test
+  public void testRegexpLikeFilter() throws Exception
+  {
+// Cannot vectorize due to usage of regex filter.
+cannotVectorize();
+
+testQuery(
+"SELECT COUNT(*)\n"
++ "FROM foo\n"
++ "WHERE REGEXP_LIKE(dim1, '^1') OR REGEXP_LIKE('Z' || dim1, '^Z2')",

Review comment:
   The docs say that "The pattern must match starting at the beginning of 
`expr`" but it looks like the regex pattern you are passing in is asking that 
it start at the beginning of the string. Can I use a `$` in my regex to ask 
that it matches the end of the expr?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


suneet-s commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428792725



##
File path: 
processing/src/main/java/org/apache/druid/query/expression/RegexpLikeExprMacro.java
##
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.expression;
+
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.math.expr.Expr;
+import org.apache.druid.math.expr.ExprEval;
+import org.apache.druid.math.expr.ExprMacroTable;
+import org.apache.druid.math.expr.ExprType;
+
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+public class RegexpLikeExprMacro implements ExprMacroTable.ExprMacro
+{
+  private static final String FN_NAME = "regexp_like";
+
+  @Override
+  public String name()
+  {
+return FN_NAME;
+  }
+
+  @Override
+  public Expr apply(final List args)
+  {
+if (args.size() < 2 || args.size() > 3) {

Review comment:
   what is the 3rd argument for? I only see the first 2 being used in this 
expr
   ```suggestion
   if (args.size() != 2) {
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] suneet-s commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


suneet-s commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428792725



##
File path: 
processing/src/main/java/org/apache/druid/query/expression/RegexpLikeExprMacro.java
##
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.expression;
+
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.java.util.common.IAE;
+import org.apache.druid.java.util.common.StringUtils;
+import org.apache.druid.math.expr.Expr;
+import org.apache.druid.math.expr.ExprEval;
+import org.apache.druid.math.expr.ExprMacroTable;
+import org.apache.druid.math.expr.ExprType;
+
+import javax.annotation.Nonnull;
+import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+public class RegexpLikeExprMacro implements ExprMacroTable.ExprMacro
+{
+  private static final String FN_NAME = "regexp_like";
+
+  @Override
+  public String name()
+  {
+return FN_NAME;
+  }
+
+  @Override
+  public Expr apply(final List args)
+  {
+if (args.size() < 2 || args.size() > 3) {

Review comment:
   what is the 3rd argument for? I only see the first 2 being used in this 
expr





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] gianm commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


gianm commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428782343



##
File path: 
sql/src/main/java/org/apache/druid/sql/calcite/expression/OperatorConversions.java
##
@@ -291,15 +294,15 @@ public OperatorBuilder operandTypes(final 
SqlTypeFamily... operandTypes)
   return this;
 }
 
-public OperatorBuilder operandTypeInference(final SqlOperandTypeInference 
operandTypeInference)
+public OperatorBuilder requiredOperands(final int requiredOperands)
 {
-  this.operandTypeInference = operandTypeInference;
+  this.requiredOperands = requiredOperands;
   return this;
 }
 
-public OperatorBuilder requiredOperands(final int requiredOperands)
+public OperatorBuilder literalOperands(final int... literalOperands)

Review comment:
   By the way, I also renamed `returnType` to `returnTypeNonNull` to make 
it clearer.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] gianm commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


gianm commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428782087



##
File path: 
sql/src/main/java/org/apache/druid/sql/calcite/expression/OperatorConversions.java
##
@@ -430,36 +434,64 @@ public void inferOperandTypes(
 
   /**
* Operand type checker that is used in 'simple' situations: there are a 
particular number of operands, with
-   * particular types, some of which may be optional or nullable.
+   * particular types, some of which may be optional or nullable, and some of 
which may be required to be literals.
*/
   private static class DefaultOperandTypeChecker implements 
SqlOperandTypeChecker
   {
 private final List operandTypes;
 private final int requiredOperands;
 private final IntSet nullableOperands;
+private final IntSet literalOperands;
 
 DefaultOperandTypeChecker(
 final List operandTypes,
 final int requiredOperands,
-final IntSet nullableOperands
+final IntSet nullableOperands,
+@Nullable final int[] literalOperands
 )
 {
   Preconditions.checkArgument(requiredOperands <= operandTypes.size() && 
requiredOperands >= 0);
   this.operandTypes = Preconditions.checkNotNull(operandTypes, 
"operandTypes");
   this.requiredOperands = requiredOperands;
   this.nullableOperands = Preconditions.checkNotNull(nullableOperands, 
"nullableOperands");
+
+  if (literalOperands == null) {
+this.literalOperands = IntSets.EMPTY_SET;
+  } else {
+this.literalOperands = new IntArraySet();
+Arrays.stream(literalOperands).forEach(this.literalOperands::add);
+  }
 }
 
 @Override
 public boolean checkOperandTypes(SqlCallBinding callBinding, boolean 
throwOnFailure)
 {
-  if (operandTypes.size() != callBinding.getOperandCount()) {
-// Just like FamilyOperandTypeChecker: assume this is an inapplicable 
sub-rule of a composite rule; don't throw
-return false;
-  }
-
   for (int i = 0; i < callBinding.operands().size(); i++) {
 final SqlNode operand = callBinding.operands().get(i);
+
+if (literalOperands.contains(i)) {
+  // Verify that 'operand' is a literal.
+  if (!SqlUtil.isLiteral(operand)) {
+return throwOrReturn(
+throwOnFailure,
+callBinding,
+cb -> cb.getValidator()
+.newValidationError(
+operand,
+
Static.RESOURCE.argumentMustBeLiteral(callBinding.getOperator().getName())
+)
+);
+  }
+
+  if (!nullableOperands.contains(i) && SqlUtil.isNullLiteral(operand, 
true)) {

Review comment:
   Sure, I think this makes sense. I changed it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] gianm commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


gianm commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428780412



##
File path: 
sql/src/main/java/org/apache/druid/sql/calcite/expression/OperatorConversions.java
##
@@ -291,15 +294,15 @@ public OperatorBuilder operandTypes(final 
SqlTypeFamily... operandTypes)
   return this;
 }
 
-public OperatorBuilder operandTypeInference(final SqlOperandTypeInference 
operandTypeInference)
+public OperatorBuilder requiredOperands(final int requiredOperands)
 {
-  this.operandTypeInference = operandTypeInference;
+  this.requiredOperands = requiredOperands;
   return this;
 }
 
-public OperatorBuilder requiredOperands(final int requiredOperands)
+public OperatorBuilder literalOperands(final int... literalOperands)

Review comment:
   I added javadocs to all these methods.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] jihoonson commented on pull request #9905: Fix compact partially overlapping segments

2020-05-21 Thread GitBox


jihoonson commented on pull request #9905:
URL: https://github.com/apache/druid/pull/9905#issuecomment-632190086


   @yuanlihan thanks for the comment. I think now I understand your use case. 
   
   > As far as I know Druid will correctly handle this in `DruidInputSource`. 
Or do I misunderstand something?
   
   Oh, now I got what you are doing here. But, I believe this issue should be 
already being handled properly by the timeline. I left a comment on 
https://github.com/apache/druid/issues/9904.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] jihoonson commented on issue #9904: Fail to compact overlapping segments

2020-05-21 Thread GitBox


jihoonson commented on issue #9904:
URL: https://github.com/apache/druid/issues/9904#issuecomment-632189936


   Hi @yuanlihan, I believe the timeline is already handling the overlapping 
segments properly. Looking at the logs above, the issue seems like this:
   
   ```
   org.apache.druid.java.util.common.UOE: Cannot add overlapping segments 
[2020-05-15T05:00:00.000Z/2020-05-15T06:00:00.000Z and 
2020-05-15T00:00:00.000Z/2020-05-16T00:00:00.000Z] with the same version 
[2020-05-17T12:14:26.722Z]
   ```
   
   This log means, you got two segments of overlapping intervals and _same_ 
version. This is never allowed since Druid doesn't know what segment to read in 
this case. Druid should never create such segments of overlapping intervals but 
same version. Can you check why and how they were created? You may need to 
check what task created that segment in the task logs and overlord logs.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] gianm commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


gianm commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428758597



##
File path: docs/querying/sql.md
##
@@ -322,17 +322,18 @@ String functions accept strings, and return a type 
appropriate to the function.
 |`LOWER(expr)`|Returns expr in all lowercase.|
 |`PARSE_LONG(string[, radix])`|Parses a string into a long (BIGINT) with the 
given radix, or 10 (decimal) if a radix is not provided.|
 |`POSITION(needle IN haystack [FROM fromIndex])`|Returns the index of needle 
within haystack, with indexes starting from 1. The search will begin at 
fromIndex, or 1 if fromIndex is not specified. If the needle is not found, 
returns 0.|
-|`REGEXP_EXTRACT(expr, pattern, [index])`|Apply regular expression pattern and 
extract a capture group, or null if there is no match. If index is unspecified 
or zero, returns the substring that matched the pattern.|
+|`REGEXP_EXTRACT(expr, pattern, [index])`|Apply regular expression `pattern` 
to `expr` and extract a capture group, or `NULL` if there is no match. If index 
is unspecified or zero, returns the substring that matched the pattern. The 
pattern must match starting at the beginning of `expr`, but does not need to 
match the entire string. Note: when `druid.generic.useDefaultValueForNull = 
true`, it is not possible to differentiate an empty-string match from a 
non-match (both will return `NULL`).|
+|`REGEXP_LIKE(expr, pattern)`|Returns whether `expr` matches regular 
expression `pattern`. The pattern must match starting at the beginning of 
`expr`, but does not need to match the entire string. Especially useful in 
WHERE clauses.|

Review comment:
   Added both notes.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] gianm commented on a change in pull request #9893: Add REGEXP_LIKE, fix bugs in REGEXP_EXTRACT.

2020-05-21 Thread GitBox


gianm commented on a change in pull request #9893:
URL: https://github.com/apache/druid/pull/9893#discussion_r428757383



##
File path: 
sql/src/main/java/org/apache/druid/sql/calcite/expression/OperatorConversions.java
##
@@ -291,15 +294,15 @@ public OperatorBuilder operandTypes(final 
SqlTypeFamily... operandTypes)
   return this;
 }
 
-public OperatorBuilder operandTypeInference(final SqlOperandTypeInference 
operandTypeInference)

Review comment:
   Ah, yeah, I removed it because it wasn't being used; I suppose we could 
add it back if needed in the future.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] vvararu opened a new issue #9908: GroupBy returns multiple records for the same group after intermediate flush to disk when querying real-time with a multi-value dimension involved

2020-05-21 Thread GitBox


vvararu opened a new issue #9908:
URL: https://github.com/apache/druid/issues/9908


   ### Affected Version
   
   Druid 0.17.0
   
   ### Description
   
    Dependencies:
   - Kafka 2.12_1.0.1. Topic with 1 partition, no replication.
   
    Steps to reproduce:
   - Start a Kafka supervisor for realtime ingestion with the following spec:
   ```
   {
 "type": "kafka",
 "dataSchema": {
   "dataSource": "test",
   "parser": {
 "type": "string",
 "parseSpec": {
   "format": "json",
   "timestampSpec": {
 "column": "timestamp",
 "format": "auto"
   },
   "dimensionsSpec": {
 "dimensions": [],
 "dimensionExclusions": [
   "test"
 ],
 "spatialDimensions": []
   }
 }
   },
   "metricsSpec": [
 {
   "type": "count",
   "name": "events"
 }
   ],
   "granularitySpec": {
 "type": "uniform",
 "segmentGranularity": "hour",
 "queryGranularity": "hour"
   }
 },
 "tuningConfig": {
   "type": "kafka",
   "maxRowsInMemory": 4,
   "maxRowsPerSegment": 500,
   "intermediatePersistPeriod": "PT1M",
   "maxPendingPersists": 0,
   "handoffConditionTimeout": 0,
   "resetOffsetAutomatically": true,
   "chatRetries": 20,
   "httpTimeout": "PT60S",
   "shutdownTimeout": "PT180S"
 },
 "ioConfig": {
   "topic": "test",
   "consumerProperties": {
 "bootstrap.servers": "bootstrap-server:9092",
 "group.id": "test",
 "auto.offset.reset": "latest",
 "max.partition.fetch.bytes": 400
   },
   "replicas": "1",
   "taskCount": "1",
   "taskDuration": "PT3600S",
   "startDelay": "PT5S",
   "period": "PT30S",
   "useEarliestOffset": false,
   "completionTimeout": "PT30M"
 }
   }
   ```
   - Push next messages to Kafka for real-time ingestion. Sending in bunches of 
1000 (or 10.000... both work) message of each with small pause between bunches. 
I've put a timestamp from future, but any timestamp should make it.
   ```
   {"multiValueDimension":["1"],"zoneId":"1","timestamp":2524608061000}
   {"zoneId":"1","timestamp":2524608061000} // observe the lack of the 
multiValueDimension
   {"zoneId":"2","timestamp":2524608061000} // observe the lack of the 
multiValueDimension
   ```
   - Do the following groupBy query
   ```
   {
"queryType": "groupBy",
"dataSource": "test",
"intervals": "2050-01-01T00:00:00.000Z/2050-01-01T01:00:00.000Z",
"dimensions": ["multiValueDimension","zoneId"],
"aggregations": [
{
"type": "longSum",
"name": "events",
"fieldName": "events"
}
],
"limit" : 2147483647,
"granularity": "all"
   }
   ```
   - At the beginning the results are correct...
   ```
   [
 {
   "version": "v1",
   "timestamp": "2050-01-01T00:00:00.000Z",
   "event": {
 "zoneId": "1",
 "events": 5000
   }
 },
 {
   "version": "v1",
   "timestamp": "2050-01-01T00:00:00.000Z",
   "event": {
 "zoneId": "2",
 "events": 4990
   }
 },
 {
   "version": "v1",
   "timestamp": "2050-01-01T00:00:00.000Z",
   "event": {
 "multiValueDimension": "1",
 "zoneId": "1",
 "events": 16000
   }
 }
   ]
   ```
   - ... but after the next intermediate persistence flush to disk...
   ```
   2020-05-21T13:58:23,854 INFO 
[[index_kafka_test_f4d3b0f570f9224_jhlpfjga]-appenderator-persist] 
org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushed 
in-memory data for 
segment[test_2050-01-01T00:00:00.000Z_2050-01-01T01:00:00.000Z_2020-05-21T13:39:14.308Z]
 spill[2] to disk in [7] ms (3 rows).
   2020-05-21T13:58:23,857 INFO 
[[index_kafka_test_f4d3b0f570f9224_jhlpfjga]-appenderator-persist] 
org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Flushed 
in-memory data with commit metadata 
[AppenderatorDriverMetadata{segments={index_kafka_test_f4d3b0f570f9224_0=[SegmentWithState{segmentIdentifier=test_2050-01-01T00:00:00.000Z_2050-01-01T01:00:00.000Z_2020-05-21T13:39:14.308Z,
 state=APPENDING}]}, 
lastSegmentIds={index_kafka_test_f4d3b0f570f9224_0=test_2050-01-01T00:00:00.000Z_2050-01-01T01:00:00.000Z_2020-05-21T13:39:14.308Z},
 callerMetadata={nextPartitions=SeekableStreamEndSequenceNumbers{stream='test', 
partitionSequenceNumberMap={0=58000] for segments: 
test_2050-01-01T00:00:00.000Z_2050-01-01T01:00:00.000Z_2020-05-21T13:39:14.308Z
   ```
   - ... same groupBy starts to return multiple records for the same groups...
   ```
   [
 {
   "version": "v1",
   "timestamp": "2050-01-01T00:00:00.000Z",
   "event": {
 "zoneId": "1",
 "events": 7000
  

[GitHub] [druid] liujianhuanzz opened a new pull request #9907: fix docs error in hadoop-based part

2020-05-21 Thread GitBox


liujianhuanzz opened a new pull request #9907:
URL: https://github.com/apache/druid/pull/9907


   ### Description
   
   There are several errors in the document.
   
   * missing type of `indexSpecForIntermediatePersists` of tuningConfig
   * table format error:  `logParseExceptions` and `maxParseExceptions` of 
tuningConfig



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] yuanlihan commented on issue #9889: Add new configuration for kill task to skip segments that has been last modified within the config value time

2020-05-21 Thread GitBox


yuanlihan commented on issue #9889:
URL: https://github.com/apache/druid/issues/9889#issuecomment-631935055


   @maytasm thanks for the explanation.
   It makes sense to me. This configuration will help a lot to reduce the 
impact of misoperation.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] maytasm commented on issue #9889: Add new configuration for kill task to skip segments that has been last modified within the config value time

2020-05-21 Thread GitBox


maytasm commented on issue #9889:
URL: https://github.com/apache/druid/issues/9889#issuecomment-631919882


   @yuanlihan 
   I am thinking that the segment will have to satisfy both conditions to be 
eligible for the kill task. The new `druid.coordinator.kill.fromLastModified` 
offers additional safety to prevent accidental killing of segments. If the user 
does not want the `druid.coordinator.kill.fromLastModified` and want the 
existing behavior then they can simply set 
druid.coordinator.kill.fromLastModified to 0. Basically, this ignores 
druid.coordinator.kill.fromLastModified and only requires segment to satisfy 
`druid.coordinator.kill.durationToRetain`.
   The benefit of `druid.coordinator.kill.fromLastModified` is for old segments 
that might have been dropped accidentally from incorrectly configured rules 
(usually rules are set for intervals of many months). For example, you might 
have a dropForever rule before a load rule (instead of the other way around). 
Those segments may be older than `druid.coordinator.kill.durationToRetain` but 
will not be kill due to `druid.coordinator.kill.fromLastModified` safety net.  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org



[GitHub] [druid] xvrl commented on pull request #9902: update kafka client version to 2.5.0

2020-05-21 Thread GitBox


xvrl commented on pull request #9902:
URL: https://github.com/apache/druid/pull/9902#issuecomment-631902458


   @clintropolis yep, I realized the rabbit hole was a little deeper. I think I 
fixed it now



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org