RE: How to query a query with not contain, not start_with, not end_with condition effective?

Sidney Feiner Tue, 21 Feb 2017 07:47:53 -0800

Chanh wants to return user_id's that don't have any record with a url 
containing "sell". Without a subquery/join, it can only filter per record 
without knowing about the rest of the user_id's record

Sidney Feiner   /  SW Developer
M: +972.528197720  /  Skype: sidney.feiner.startapp

[StartApp]<http://www.startapp.com/>

From: Yong Zhang [mailto:java8...@hotmail.com]
Sent: Tuesday, February 21, 2017 4:10 PM
To: Chanh Le <giaosu...@gmail.com>; user @spark <user@spark.apache.org>
Subject: Re: How to query a query with not contain, not start_with, not 
end_with condition effective?

Not sure if I misunderstand your question, but what's wrong doing it this way?

scala> spark.version
res6: String = 2.0.2
scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", 
"url")
df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]

scala> df.registerTempTable("data")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> spark.sql("select user_id from data where url not like '%sell%'").show
+-------+
|user_id|
+-------+
|      2|
+-------+

Yong

________________________________
From: Chanh Le <giaosu...@gmail.com<mailto:giaosu...@gmail.com>>
Sent: Tuesday, February 21, 2017 4:56 AM
To: user @spark
Subject: How to query a query with not contain, not start_with, not end_with 
condition effective?

Hi everyone,

I am working on a dataset like this
user_id         url
1              lao.com/buy<http://lao.com/buy>
2      bao.com/sell<http://bao.com/sell>
2              cao.com/market<http://cao.com/market>
1       lao.com/sell<http://lao.com/sell>
3              vui.com/sell<http://vui.com/sell>

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like '%sell%';

My data is about 20 million records and it's growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true
Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.
I am running at client mode that submit to a Mesos cluster.

I am using Spark 2.0.2 and my data store in HDFS with parquet format.

Any advices for me in this situation?

Thank you in advance!.

Regards,
Chanh

RE: How to query a query with not contain, not start_with, not end_with condition effective?

Reply via email to