Chanh wants to return user_id's that don't have any record with a url containing "sell". Without a subquery/join, it can only filter per record without knowing about the rest of the user_id's record
Sidney Feiner / SW Developer M: +972.528197720 / Skype: sidney.feiner.startapp [StartApp]<http://www.startapp.com/> From: Yong Zhang [mailto:java8...@hotmail.com] Sent: Tuesday, February 21, 2017 4:10 PM To: Chanh Le <giaosu...@gmail.com>; user @spark <user@spark.apache.org> Subject: Re: How to query a query with not contain, not start_with, not end_with condition effective? Not sure if I misunderstand your question, but what's wrong doing it this way? scala> spark.version res6: String = 2.0.2 scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", "url") df: org.apache.spark.sql.DataFrame = [user_id: int, url: string] scala> df.registerTempTable("data") warning: there was one deprecation warning; re-run with -deprecation for details scala> spark.sql("select user_id from data where url not like '%sell%'").show +-------+ |user_id| +-------+ | 2| +-------+ Yong ________________________________ From: Chanh Le <giaosu...@gmail.com<mailto:giaosu...@gmail.com>> Sent: Tuesday, February 21, 2017 4:56 AM To: user @spark Subject: How to query a query with not contain, not start_with, not end_with condition effective? Hi everyone, I am working on a dataset like this user_id url 1 lao.com/buy<http://lao.com/buy> 2 bao.com/sell<http://bao.com/sell> 2 cao.com/market<http://cao.com/market> 1 lao.com/sell<http://lao.com/sell> 3 vui.com/sell<http://vui.com/sell> I have to find all user_id with url not contain sell. Which means I need to query all user_id contains sell and put it into a set then do another query to find all user_id not in that set. SELECT user_id FROM data WHERE user_id not in ( SELECT user_id FROM data WHERE url like '%sell%'; My data is about 20 million records and it's growing. When I tried in zeppelin I need to set spark.sql.crossJoin.enabled = true Then I ran the query and the driver got extremely high CPU percentage and the process get stuck and I need to kill it. I am running at client mode that submit to a Mesos cluster. I am using Spark 2.0.2 and my data store in HDFS with parquet format. Any advices for me in this situation? Thank you in advance!. Regards, Chanh