John, I realized I'd make a modification in order your query work. Then I updated the github project. select count(1) from view_mydata where srcday = '2016-02-05' and contains(domain_name, '\\.com$'); will work now. (just redeploy the jars)
I will try to make : select count(1) from view_mydata where srcday = '2016-02-05' and contains(domain_name, '\.com$'); working too. I keep you aware new version 2016-02-09 19:22 GMT+01:00 Nicolas Paris <[email protected]>: > John, > > About the escape, I will explore that question. > About your query, you may try this pattern : > select count(1) from view_mydata where srcday = '2016-02-05' and > contains(domain_name, '.*\\.com$'); > > > 2016-02-09 17:19 GMT+01:00 John Omernik <[email protected]>: > >> I copied both files and it appears to work, but after some testing, I am >> getting inconsistent results, see below. I ran three queries. first a like >> looking for domain names that end in .com (domain_name like '%.com' that >> returned a count of 9.8 million. Then I tried the contains, with '\.com$' >> which is ends in dot com.... this failed (this goes to my earlier comments >> about really wishing we did not do double escaping as normal... for users, >> double escaping is NOT normal, thus doing that to meet Java's issues is >> hard... not sure how to handle it, it may be a tough issue, but it really >> seems like something worth exploring). >> >> I then did contains(domain_name, '\\.com$) This took quite a bit longer, >> and returned 0, so I am not really sure how the function is working at >> this >> point. Thoughts? >> >> John >> >> >> >> > select count(1) from view_mydata where srcday = '2016-02-05' and >> domain_name like '%.com'; >> +----------+ >> | EXPR$0 | >> +----------+ >> | 9810609 | >> +----------+ >> 1 row selected (123.673 seconds) >> >> >> > select count(1) from view_mydata where srcday = '2016-02-05' and >> contains(domain_name, '\.com$'); >> Error: SYSTEM ERROR: ExpressionParsingException: Expression has syntax >> error! line 1:79:mismatched input '<EOF>' expecting CParen >> >> Fragment 1:13 >> >> [Error Id: 8e46bed4-f9ba-444f-a3aa-2f57db5ae34f on node3:31010] >> (state=,code=0) >> >> > select count(1) from view_mydata where srcday = '2016-02-05' and >> contains(domain_name, '\\.com$'); >> +---------+ >> | EXPR$0 | >> +---------+ >> | 0 | >> +---------+ >> 1 row selected (201.391 seconds) >> >> >> >> On Tue, Feb 9, 2016 at 9:34 AM, Nicolas Paris <[email protected]> >> wrote: >> >> > Hi John, >> > >> > They are actualy two jars to put in the folder (generated in /target). >> Have >> > you restarted drill after ? >> > >> > >> > >> > >> > >> > 2016-02-09 16:20 GMT+01:00 John Omernik <[email protected]>: >> > >> > > Nicolas, not really sure what's happening here. it compiled fine, but >> > when >> > > I run it I get this error. The jar is distributed to my bits, I >> validated >> > > that... it's in the DRILL_HOME/jars/3rdparty folder on every bit... >> do I >> > > need to do something more than that? >> > > >> > > >> > > >> > > select count(1) from view_myview where srcday = '2016-02-05' and >> > > contains(domain_name, 'com'); >> > > Error: SYSTEM ERROR: IllegalArgumentException: resource >> > > /org/apache/drill/contrib/function/SimpleContains.java relative to >> > > org.apache.drill.contrib.function.SimpleContains not found. >> > > >> > > Fragment 1:44 >> > > >> > > [Error Id: 30c11047-9896-4e16-a207-e3cce79c9db5 on node1:31010] >> > > >> > > (java.lang.IllegalArgumentException) resource >> > > /org/apache/drill/contrib/function/SimpleContains.java relative to >> > > org.apache.drill.contrib.function.SimpleContains not found. >> > > com.google.common.base.Preconditions.checkArgument():119 >> > > com.google.common.io.Resources.getResource():203 >> > > org.apache.drill.exec.expr.fn.FunctionInitializer.get():127 >> > > org.apache.drill.exec.expr.fn.FunctionInitializer.checkInit():99 >> > > org.apache.drill.exec.expr.fn.FunctionInitializer.getMethod():81 >> > > org.apache.drill.exec.expr.fn.DrillFuncHolder.meth():94 >> > > org.apache.drill.exec.expr.fn.DrillSimpleFuncHolder.setupBody():50 >> > > org.apache.drill.exec.expr.fn.DrillSimpleFuncHolder.renderEnd():80 >> > > >> > > >> > > >> > >> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitFunctionHolderExpression():203 >> > > >> > > >> > > >> > >> org.apache.drill.exec.expr.EvaluationVisitor$ConstantFilter.visitFunctionHolderExpression():1078 >> > > >> > > >> > > >> > >> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitFunctionHolderExpression():816 >> > > >> > > >> > > >> > >> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitFunctionHolderExpression():796 >> > > >> > org.apache.drill.common.expression.FunctionHolderExpression.accept():47 >> > > >> > > >> > > >> > >> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitBooleanAnd():690 >> > > >> > > >> > > >> > >> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitBooleanOperator():172 >> > > >> > > >> > > >> > >> org.apache.drill.exec.expr.EvaluationVisitor$ConstantFilter.visitBooleanOperator():1092 >> > > >> > > >> > > >> > >> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitBooleanOperator():836 >> > > >> > > >> > > >> > >> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitBooleanOperator():796 >> > > org.apache.drill.common.expression.BooleanOperator.accept():36 >> > > >> > > >> > > >> > >> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitReturnValueExpression():551 >> > > >> > > >> > >> org.apache.drill.exec.expr.EvaluationVisitor$EvalVisitor.visitUnknown():344 >> > > >> > > >> > > >> > >> org.apache.drill.exec.expr.EvaluationVisitor$ConstantFilter.visitUnknown():1328 >> > > >> > > >> > >> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitUnknown():1027 >> > > >> > > >> org.apache.drill.exec.expr.EvaluationVisitor$CSEFilter.visitUnknown():796 >> > > >> > > >> > > >> > >> org.apache.drill.exec.physical.impl.filter.ReturnValueExpression.accept():56 >> > > org.apache.drill.exec.expr.EvaluationVisitor.addExpr():105 >> > > org.apache.drill.exec.expr.ClassGenerator.addExpr():227 >> > > >> > > >> > > >> > >> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.generateSV2Filterer():187 >> > > >> > > >> > > >> > >> org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.setupNewSchema():109 >> > > >> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():78 >> > > org.apache.drill.exec.record.AbstractRecordBatch.next():162 >> > > org.apache.drill.exec.record.AbstractRecordBatch.next():119 >> > > org.apache.drill.exec.record.AbstractRecordBatch.next():109 >> > > >> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 >> > > >> > > >> > > >> > >> org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94 >> > > org.apache.drill.exec.record.AbstractRecordBatch.next():162 >> > > org.apache.drill.exec.record.AbstractRecordBatch.next():119 >> > > org.apache.drill.exec.record.AbstractRecordBatch.next():109 >> > > >> org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 >> > > >> > > >> > > >> > >> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():132 >> > > org.apache.drill.exec.record.AbstractRecordBatch.next():162 >> > > org.apache.drill.exec.record.AbstractRecordBatch.next():119 >> > > org.apache.drill.exec.record.AbstractRecordBatch.next():109 >> > > >> > > >> > > >> > >> org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.buildSchema():100 >> > > org.apache.drill.exec.record.AbstractRecordBatch.next():142 >> > > org.apache.drill.exec.physical.impl.BaseRootExec.next():104 >> > > >> > > >> > > >> > >> org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext():93 >> > > org.apache.drill.exec.physical.impl.BaseRootExec.next():94 >> > > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():256 >> > > org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():250 >> > > java.security.AccessController.doPrivileged():-2 >> > > javax.security.auth.Subject.doAs():415 >> > > org.apache.hadoop.security.UserGroupInformation.doAs():1595 >> > > org.apache.drill.exec.work.fragment.FragmentExecutor.run():250 >> > > org.apache.drill.common.SelfCleaningRunnable.run():38 >> > > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 >> > > java.util.concurrent.ThreadPoolExecutor$Worker.run():615 >> > > java.lang.Thread.run():745 (state=,code=0) >> > > >> > > On Fri, Feb 5, 2016 at 2:39 AM, Nicolas Paris <[email protected]> >> > wrote: >> > > >> > > > John, >> > > > >> > > > Sorry for that, this already work as expected. >> > > > Give it a try, this is so easy to deploy >> > > > >> > > > SELECT first_name FROM cp.`employee.json` WHERE >> > > contains(first_name,'\w+') >> > > > LIMIT 5; >> > > > first_name | >> > > > -----------| >> > > > Sheri | >> > > > Derrick | >> > > > Michael | >> > > > Maya | >> > > > Roberta | >> > > > >> > > > >> > > > 2016-02-04 20:41 GMT+01:00 John Omernik <[email protected]>: >> > > > >> > > > > Ya, do you see where I am coming from here? Let's let the users >> > submit >> > > > > regex in the pure form if possible, and code the nuances of java >> > regex >> > > > > behind the scenes. I think it would be a great way to make Drill >> very >> > > > > accessible and desirable. I think what happened in Hive is the >> regex >> > > > > commands started with the users having the escape and now there >> are >> > > just >> > > > to >> > > > > many things that using the escaped regex and the project doesn't >> want >> > > to >> > > > > adjust. >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > On Thu, Feb 4, 2016 at 1:38 PM, Nicolas Paris < >> [email protected]> >> > > > wrote: >> > > > > >> > > > > > You mean: >> > > > > > userRegex=>javaRegex >> > > > > > "\d" => "\\d" >> > > > > > "\w" => "\\w" >> > > > > > "\n" => "\n" >> > > > > > I can do that thanks to regex I guess. >> > > > > > I will give a try >> > > > > > >> > > > > > >> > > > > > 2016-02-04 19:37 GMT+01:00 John Omernik <[email protected]>: >> > > > > > >> > > > > > > So my question on the double escape, is there no way to handle >> > that >> > > > so >> > > > > > the >> > > > > > > user can use single escaped regex? I know many folks who use >> big >> > > data >> > > > > > > platform to test large complex regexes for things like >> security >> > > > > > appliances, >> > > > > > > and having to convert the regex seems like a lot of work if >> you >> > > > > consider >> > > > > > > every user has to do that. If there was a way to do it in >> Drill, >> > > > that >> > > > > > > would save countless people hours and save many mistakes. >> > > > > > > >> > > > > > > On Thu, Feb 4, 2016 at 12:03 PM, Nicolas Paris < >> > > [email protected]> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > John, Jason, >> > > > > > > > >> > > > > > > > 2016-02-04 18:47 GMT+01:00 John Omernik <[email protected]>: >> > > > > > > > >> > > > > > > > > I'd be curios on how you are implemeting the regex... >> using >> > > > Java's >> > > > > > > regex >> > > > > > > > > libraries? etc. >> > > > > > > > > >> > > > > > > > Yeah, I use >> > > > > > > > java.util.regex >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > I know one thing with Hive that always bothered me was the >> > need >> > > > to >> > > > > > > double >> > > > > > > > > escape things. >> > > > > > > > > >> > > > > > > > > '\d\d\d\d-\d\d-\d\d' needed to be >> > '\\d\\d\\d\\d-\\d\\d-\\d\\d' >> > > > of >> > > > > we >> > > > > > > can >> > > > > > > > > avoid that it would be AWESOME. >> > > > > > > > > >> > > > > > > > My guess is this comes from java way to handle strings. All >> > > > > langages I >> > > > > > > > have used need to double escape. >> > > > > > > > >> > > > > > > > >> > > > > > > > > On Thu, Feb 4, 2016 at 11:37 AM, Jason Altekruse < >> > > > > > > > [email protected] >> > > > > > > > > > >> > > > > > > > > wrote: >> > > > > > > > >> > > > > > > > code is here: >> > https://github.com/parisni/drill-simple-contains >> > > > > > > > It's disturbing how it is simple... >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > > I think you should actually just put the function in >> > > > > > > > > >> > > > > > > > > Drill itself. System >> > > > > > > > > > native functions are implemented in the same interface >> as >> > > UDFs, >> > > > > > > because >> > > > > > > > > our >> > > > > > > > > > mechanism for evaluating them is very efficient (we code >> > > > generate >> > > > > > > code >> > > > > > > > > > blocks by linking together the bodies of the individual >> > > > functions >> > > > > > to >> > > > > > > > > > evaluate a complete expression). >> > > > > > > > > >> > > > > > > > well the folder tree is quite impressive ( >> > > > > > > https://github.com/apache/drill >> > > > > > > > ). >> > > > > > > > >> > > > > > > > >> > > > > > > > what folder is supposed to be " >> > > > > > > > >> > > > > > > > Drill itself" >> > > > > > > > ? >> > > > > > > > >> > > > > > > > >> > > > > > > > > > You can open a JIRA, marking it a feature request. You >> can >> > > > open a >> > > > > > > poll >> > > > > > > > > > request against the apache github repo, making sure you >> > > follow >> > > > > the >> > > > > > > > > standard >> > > > > > > > > > format for your commit message, prefixing with the JIRA >> > > number >> > > > in >> > > > > > the >> > > > > > > > > > format >> > > > > > > > > > Example: >> > > > > > > > > > DRILL-XXXX: Feature description >> > > > > > > > > > >> > > > > > > > > > This will automatically link the PR to your JIRA. >> > > > > > > > > >> > > > > > > > Ok I will try thanks >> > > > > > > > >> > > > > > > > a lot >> > > > > > > > >> > > > > > > > > > - Jason >> > > > > > > > > > >> > > > > > > > > > On Thu, Feb 4, 2016 at 8:44 AM, Nicolas Paris < >> > > > > [email protected] >> > > > > > > >> > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > Jason, I have it working, >> > > > > > > > > > > >> > > > > > > > > > > Just tell me the way to proceed to PR. >> > > > > > > > > > > 1. where do I put my maven project ? Witch folder in >> my >> > > drill >> > > > > > > github >> > > > > > > > > > fork? >> > > > > > > > > > > 2. do I need a jira ? how proceed ? >> > > > > > > > > > > >> > > > > > > > > > > For now, I only published it on my github account in a >> > > > separate >> > > > > > > > project >> > > > > > > > > > > >> > > > > > > > > > > Thanks >> > > > > > > > > > > >> > > > > > > > > > > 2016-02-04 16:52 GMT+01:00 Jason Altekruse < >> > > > > > > [email protected] >> > > > > > > > >: >> > > > > > > > > > > >> > > > > > > > > > > > Awesome, thanks! >> > > > > > > > > > > > >> > > > > > > > > > > > On Thu, Feb 4, 2016 at 7:44 AM, Nicolas Paris < >> > > > > > > [email protected] >> > > > > > > > > >> > > > > > > > > > > wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > > Well I am creating a udf >> > > > > > > > > > > > > good exercise >> > > > > > > > > > > > > I hope a PR soon >> > > > > > > > > > > > > >> > > > > > > > > > > > > 2016-02-04 16:37 GMT+01:00 Jason Altekruse < >> > > > > > > > > [email protected] >> > > > > > > > > > >: >> > > > > > > > > > > > > >> > > > > > > > > > > > > > I didn't realize that we were lacking this >> > > > functionality. >> > > > > > As >> > > > > > > > the >> > > > > > > > > > > > > > repeated_contains operator handles wildcards it >> > makes >> > > > > sense >> > > > > > > to >> > > > > > > > > add >> > > > > > > > > > > > such a >> > > > > > > > > > > > > > function to drill. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > It should be simple to implement, would someone >> > like >> > > to >> > > > > > open >> > > > > > > a >> > > > > > > > > JIRA >> > > > > > > > > > > and >> > > > > > > > > > > > > > submit a PR for this? >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > - Jason >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Tue, Feb 2, 2016 at 8:56 AM, John Omernik < >> > > > > > > [email protected] >> > > > > > > > > >> > > > > > > > > > > wrote: >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > I would like to see something like this as >> well, >> > > even >> > > > > if >> > > > > > > it's >> > > > > > > > > an >> > > > > > > > > > > > > included >> > > > > > > > > > > > > > > UDF like REGEX(field, pattern) using Java's >> > library >> > > > for >> > > > > > > regex >> > > > > > > > > > like >> > > > > > > > > > > > Hive >> > > > > > > > > > > > > > > does. That would be EXTREMELY helpful. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > On Tue, Feb 2, 2016 at 6:55 AM, Nicolas Paris >> < >> > > > > > > > > > [email protected] >> > > > > > > > > > > > >> > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > ANSI SQL doesn't define regex operator. >> > > > > > > > > > > > > > > > > Drill neither. >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Drill has SQL functions extension like >> > > > > > > > "REPEATED_CONTAINS" >> > > > > > > > > > that >> > > > > > > > > > > > > looks >> > > > > > > > > > > > > > > to >> > > > > > > > > > > > > > > > handle regex. regex operator could be >> replaced >> > > with >> > > > > one >> > > > > > > new >> > > > > > > > > SQL >> > > > > > > > > > > > > > > extension ? >> > > > > > > > > > > > > > > > I guess I could create my own functions in >> > java, >> > > > > right >> > > > > > ? >> > > > > > > > > Maybe >> > > > > > > > > > > push >> > > > > > > > > > > > > it >> > > > > > > > > > > > > > > into >> > > > > > > > > > > > > > > > github then ? >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Doesn't it enough 'LIKE' operator? >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Sadly not, I'am looking for complex pattern >> > > > > matching. >> > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > -- >> > > > > > > > > > > > > > > > > Miura, Masahide >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > -----Original Message----- >> > > > > > > > > > > > > > > > > From: Nicolas Paris [mailto: >> > > [email protected]] >> > > > > > > > > > > > > > > > > Sent: Tuesday, February 02, 2016 9:04 PM >> > > > > > > > > > > > > > > > > To: [email protected] >> > > > > > > > > > > > > > > > > Subject: REGEX search Operator >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Hello, >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > I can't find any reference in the >> > documentation >> > > > > > about a >> > > > > > > > > regex >> > > > > > > > > > > > > > operator. >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > I would like to be able to query this way >> : >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > SELECT * >> > > > > > > > > > > > > > > > > FROM xxx >> > > > > > > > > > > > > > > > > WHERE text_field regexOperator >> > > > > 'regex_pattern'; >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Thanks for helping, >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >
