Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
Thank you YZ,
Now I understand why it causes high CPU usage on driver side.

Thank you Ayan,
> First thing i would do is to add distinct, both inner and outer queries

I believe that would reduce number of record to join.

Regards,
Chanh

Hi everyone,

I am working on a dataset like this
user_id url 
1lao.com/buy 
2bao.com/sell 
2cao.com/market 
1lao.com/sell 
3vui.com/sell 

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id 
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true
Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.
I am running at client mode that submit to a Mesos cluster.

I am using Spark 2.0.2 and my data store in HDFS with parquet format.

Any advices for me in this situation?

Thank you in advance!.

Regards,
Chanh





> On Feb 22, 2017, at 8:52 AM, Yong Zhang  wrote:
> 
> If you read the source code of SparkStrategies
> 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106
>  
> 
> 
> If there is no joining keys, Join implementations are chosen with the 
> following precedence:
> BroadcastNestedLoopJoin: if one side of the join could be broadcasted
> CartesianProduct: for Inner join
> BroadcastNestedLoopJoin
> 
> So your case will use BroadcastNestedLoopJoin, as there is no joining keys.
> 
> In this case, if there are lots of userId where url not like '%sell%', then 
> Spark has to retrieve them back to Driver (to be broadcast), that explains 
> why the high CPU usage on the driver side. 
> 
> So if there are lots of userId where url not like '%sell%', then you can just 
> try left semi join, which Spark will use SortMerge join in this case, I guess.
> 
> Yong
> 
> From: Yong Zhang >
> Sent: Tuesday, February 21, 2017 1:17 PM
> To: Sidney Feiner; Chanh Le; user @spark
> Subject: Re: How to query a query with not contain, not start_with, not 
> end_with condition effective?
>  
> Sorry, didn't pay attention to the originally requirement.
> 
> Did you try the left outer join, or left semi join?
> 
> What is the explain plan when you use "not in"? Is it leading to a 
> broadcastNestedLoopJoin?
> 
> spark.sql("select user_id from data where user_id not in (select user_id from 
> data where url like '%sell%')").explain(true)
> 
> Yong
> 
> 
> From: Sidney Feiner  >
> Sent: Tuesday, February 21, 2017 10:46 AM
> To: Yong Zhang; Chanh Le; user @spark
> Subject: RE: How to query a query with not contain, not start_with, not 
> end_with condition effective?
>  
> Chanh wants to return user_id's that don't have any record with a url 
> containing "sell". Without a subquery/join, it can only filter per record 
> without knowing about the rest of the user_id's record
>  
> Sidney Feiner   /  SW Developer
> M: +972.528197720  /  Skype: sidney.feiner.startapp
>  
>  
>  
> From: Yong Zhang [mailto:java8...@hotmail.com ] 
> Sent: Tuesday, February 21, 2017 4:10 PM
> To: Chanh Le >; user @spark 
> >
> Subject: Re: How to query a query with not contain, not start_with, not 
> end_with condition effective?
>  
> Not sure if I misunderstand your question, but what's wrong doing it this way?
>  
> scala> spark.version
> res6: String = 2.0.2
> scala> val df = Seq((1,"lao.com/sell "), (2, 
> "lao.com/buy ")).toDF("user_id", "url")
> df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]
>  
> scala> df.registerTempTable("data")
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
>  
> scala> spark.sql("select user_id from data where url not like '%sell%'").show
> +---+
> |user_id|
> +---+
> |  2|
> +---+
>  
> Yong
>  
> From: Chanh Le >
> Sent: Tuesday, February 21, 2017 4:56 AM
> To: user @spark
> Subject: How to query a query with not contain, not start_with, not end_with 
> condition effective?
>  
> Hi everyone, 
>  
> I am working on a dataset like this
> user_id url 
> 

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
If you read the source code of SparkStrategies


https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L106


If there is no joining keys, Join implementations are chosen with the following 
precedence:

  *   BroadcastNestedLoopJoin: if one side of the join could be broadcasted
  *   CartesianProduct: for Inner join
  *   BroadcastNestedLoopJoin


So your case will use BroadcastNestedLoopJoin, as there is no joining keys.


In this case, if there are lots of userId where url not like '%sell%', then 
Spark has to retrieve them back to Driver (to be broadcast), that explains why 
the high CPU usage on the driver side.

So if there are lots of userId where url not like '%sell%', then you can just 
try left semi join, which Spark will use SortMerge join in this case, I guess.


Yong


From: Yong Zhang 
Sent: Tuesday, February 21, 2017 1:17 PM
To: Sidney Feiner; Chanh Le; user @spark
Subject: Re: How to query a query with not contain, not start_with, not 
end_with condition effective?


Sorry, didn't pay attention to the originally requirement.


Did you try the left outer join, or left semi join?

What is the explain plan when you use "not in"? Is it leading to a 
broadcastNestedLoopJoin?


spark.sql("select user_id from data where user_id not in (select user_id from 
data where url like '%sell%')").explain(true)


Yong



From: Sidney Feiner 
Sent: Tuesday, February 21, 2017 10:46 AM
To: Yong Zhang; Chanh Le; user @spark
Subject: RE: How to query a query with not contain, not start_with, not 
end_with condition effective?


Chanh wants to return user_id's that don't have any record with a url 
containing "sell". Without a subquery/join, it can only filter per record 
without knowing about the rest of the user_id's record



Sidney Feiner   /  SW Developer

M: +972.528197720  /  Skype: sidney.feiner.startapp



[StartApp]



From: Yong Zhang [mailto:java8...@hotmail.com]
Sent: Tuesday, February 21, 2017 4:10 PM
To: Chanh Le ; user @spark 
Subject: Re: How to query a query with not contain, not start_with, not 
end_with condition effective?



Not sure if I misunderstand your question, but what's wrong doing it this way?



scala> spark.version

res6: String = 2.0.2

scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", 
"url")

df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]



scala> df.registerTempTable("data")

warning: there was one deprecation warning; re-run with -deprecation for details



scala> spark.sql("select user_id from data where url not like '%sell%'").show

+---+

|user_id|

+---+

|  2|

+---+



Yong





From: Chanh Le >
Sent: Tuesday, February 21, 2017 4:56 AM
To: user @spark
Subject: How to query a query with not contain, not start_with, not end_with 
condition effective?



Hi everyone,



I am working on a dataset like this
user_id url
1  lao.com/buy
2  bao.com/sell
2  cao.com/market
1   lao.com/sell
3  vui.com/sell

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true

Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.

I am running at client mode that submit to a Mesos cluster.



I am using Spark 2.0.2 and my data store in HDFS with parquet format.



Any advices for me in this situation?



Thank you in advance!.



Regards,

Chanh


Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
Sorry, didn't pay attention to the originally requirement.


Did you try the left outer join, or left semi join?

What is the explain plan when you use "not in"? Is it leading to a 
broadcastNestedLoopJoin?


spark.sql("select user_id from data where user_id not in (select user_id from 
data where url like '%sell%')").explain(true)


Yong



From: Sidney Feiner 
Sent: Tuesday, February 21, 2017 10:46 AM
To: Yong Zhang; Chanh Le; user @spark
Subject: RE: How to query a query with not contain, not start_with, not 
end_with condition effective?


Chanh wants to return user_id's that don't have any record with a url 
containing "sell". Without a subquery/join, it can only filter per record 
without knowing about the rest of the user_id's record



Sidney Feiner   /  SW Developer

M: +972.528197720  /  Skype: sidney.feiner.startapp



[StartApp]



From: Yong Zhang [mailto:java8...@hotmail.com]
Sent: Tuesday, February 21, 2017 4:10 PM
To: Chanh Le ; user @spark 
Subject: Re: How to query a query with not contain, not start_with, not 
end_with condition effective?



Not sure if I misunderstand your question, but what's wrong doing it this way?



scala> spark.version

res6: String = 2.0.2

scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", 
"url")

df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]



scala> df.registerTempTable("data")

warning: there was one deprecation warning; re-run with -deprecation for details



scala> spark.sql("select user_id from data where url not like '%sell%'").show

+---+

|user_id|

+---+

|  2|

+---+



Yong





From: Chanh Le >
Sent: Tuesday, February 21, 2017 4:56 AM
To: user @spark
Subject: How to query a query with not contain, not start_with, not end_with 
condition effective?



Hi everyone,



I am working on a dataset like this
user_id url
1  lao.com/buy
2  bao.com/sell
2  cao.com/market
1   lao.com/sell
3  vui.com/sell

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true

Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.

I am running at client mode that submit to a Mesos cluster.



I am using Spark 2.0.2 and my data store in HDFS with parquet format.



Any advices for me in this situation?



Thank you in advance!.



Regards,

Chanh


RE: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Sidney Feiner
Chanh wants to return user_id's that don't have any record with a url 
containing "sell". Without a subquery/join, it can only filter per record 
without knowing about the rest of the user_id's record

Sidney Feiner   /  SW Developer
M: +972.528197720  /  Skype: sidney.feiner.startapp

[StartApp]

From: Yong Zhang [mailto:java8...@hotmail.com]
Sent: Tuesday, February 21, 2017 4:10 PM
To: Chanh Le ; user @spark 
Subject: Re: How to query a query with not contain, not start_with, not 
end_with condition effective?


Not sure if I misunderstand your question, but what's wrong doing it this way?


scala> spark.version
res6: String = 2.0.2
scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", 
"url")
df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]

scala> df.registerTempTable("data")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> spark.sql("select user_id from data where url not like '%sell%'").show
+---+
|user_id|
+---+
|  2|
+---+


Yong


From: Chanh Le >
Sent: Tuesday, February 21, 2017 4:56 AM
To: user @spark
Subject: How to query a query with not contain, not start_with, not end_with 
condition effective?

Hi everyone,

I am working on a dataset like this
user_id url
1  lao.com/buy
2  bao.com/sell
2  cao.com/market
1   lao.com/sell
3  vui.com/sell

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like '%sell%';

My data is about 20 million records and it's growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true
Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.
I am running at client mode that submit to a Mesos cluster.

I am using Spark 2.0.2 and my data store in HDFS with parquet format.

Any advices for me in this situation?

Thank you in advance!.

Regards,
Chanh


Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
Not sure if I misunderstand your question, but what's wrong doing it this way?


scala> spark.version
res6: String = 2.0.2
scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", 
"url")
df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]

scala> df.registerTempTable("data")
warning: there was one deprecation warning; re-run with -deprecation for details

scala> spark.sql("select user_id from data where url not like '%sell%'").show
+---+
|user_id|
+---+
|  2|
+---+


Yong



From: Chanh Le 
Sent: Tuesday, February 21, 2017 4:56 AM
To: user @spark
Subject: How to query a query with not contain, not start_with, not end_with 
condition effective?

Hi everyone,

I am working on a dataset like this
user_id url
1  lao.com/buy
2  bao.com/sell
2  cao.com/market
1   lao.com/sell
3  vui.com/sell

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true
Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.
I am running at client mode that submit to a Mesos cluster.

I am using Spark 2.0.2 and my data store in HDFS with parquet format.

Any advices for me in this situation?

Thank you in advance!.

Regards,
Chanh


Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread ayan guha
First thing i would do is to add distinct, both inner and outer queries
On Tue, 21 Feb 2017 at 8:56 pm, Chanh Le  wrote:

> Hi everyone,
>
> I am working on a dataset like this
> *user_id url *
> 1  lao.com/buy
> 2  bao.com/sell
> 2  cao.com/market
> 1   lao.com/sell
> 3  vui.com/sell
>
> I have to find all *user_id* with *url* not contain *sell*. Which means I
> need to query all *user_id* contains *sell* and put it into a set then do
> another query to find all *user_id* not in that set.
>
>
>
> *SELECT user_id FROM dataWHERE user_id not in ( SELECT user_id FROM data
> WHERE url like ‘%sell%’;*
> My data is about *20 million records and it’s growing*. When I tried in
> zeppelin I need to *set spark.sql.crossJoin.enabled = true*
> Then I ran the query and the driver got extremely high CPU percentage and
> the process get stuck and I need to kill it.
> I am running at client mode that submit to a Mesos cluster.
>
> I am using* Spark 2.0.2* and my data store in *HDFS* with *parquet format*
> .
>
> Any advices for me in this situation?
>
> Thank you in advance!.
>
> Regards,
> Chanh
>
-- 
Best Regards,
Ayan Guha


Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
I tried a new way by using JOIN

select user_id from data a
left join (select user_id from data where url like ‘%sell%') b
on a.user_id = b.user_id
where b.user_id is NULL

It’s faster and seem that Spark rather optimize for JOIN than sub query.


Regards,
Chanh


> On Feb 21, 2017, at 4:56 PM, Chanh Le  wrote:
> 
> Hi everyone,
> 
> I am working on a dataset like this
> user_id url 
> 1  lao.com/buy 
> 2  bao.com/sell 
> 2  cao.com/market 
> 1  lao.com/sell 
> 3  vui.com/sell 
> 
> I have to find all user_id with url not contain sell. Which means I need to 
> query all user_id contains sell and put it into a set then do another query 
> to find all user_id not in that set.
> SELECT user_id 
> FROM data
> WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;
> 
> My data is about 20 million records and it’s growing. When I tried in 
> zeppelin I need to set spark.sql.crossJoin.enabled = true
> Then I ran the query and the driver got extremely high CPU percentage and the 
> process get stuck and I need to kill it.
> I am running at client mode that submit to a Mesos cluster.
> 
> I am using Spark 2.0.2 and my data store in HDFS with parquet format.
> 
> Any advices for me in this situation?
> 
> Thank you in advance!.
> 
> Regards,
> Chanh



How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Chanh Le
Hi everyone,

I am working on a dataset like this
user_id url 
1lao.com/buy
2bao.com/sell
2cao.com/market
1lao.com/sell
3vui.com/sell

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id 
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true
Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.
I am running at client mode that submit to a Mesos cluster.

I am using Spark 2.0.2 and my data store in HDFS with parquet format.

Any advices for me in this situation?

Thank you in advance!.

Regards,
Chanh