Re: about broadcast join of base table in spark sql

2017-07-02 Thread Yong Zhang
Then you need to tell us the spark version, and post the execution plan here, 
so we can help you better.


Yong



From: Paley Louie <paley2...@gmail.com>
Sent: Sunday, July 2, 2017 12:36 AM
To: Yong Zhang
Cc: Bryan Jeffrey; d...@spark.org; user@spark.apache.org
Subject: Re: about broadcast join of base table in spark sql

Thank you for your reply, I have tried to add broadcast hint to the base table, 
but it just cannot be broadcast out.
On Jun 30, 2017, at 9:13 PM, Yong Zhang 
<java8...@hotmail.com<mailto:java8...@hotmail.com>> wrote:

Or since you already use the DataFrame API, instead of SQL, you can add the 
broadcast function to force it.

https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)

Yong
functions - Apache 
Spark<https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)>
spark.apache.org<http://spark.apache.org/>
Computes the numeric value of the first character of the string column, and 
returns the result as a int column.






From: Bryan Jeffrey <bryan.jeff...@gmail.com<mailto:bryan.jeff...@gmail.com>>
Sent: Friday, June 30, 2017 6:57 AM
To: d...@spark.org<mailto:d...@spark.org>; 
user@spark.apache.org<mailto:user@spark.apache.org>; paleyl
Subject: Re: about broadcast join of base table in spark sql

Hello.

If you want to allow broadcast join with larger broadcasts you can set 
spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the 
plan to allow join despite 'A' being larger than the default threshold.

Get Outlook for Android<https://aka.ms/ghei36>



From: paleyl
Sent: Wednesday, June 28, 10:42 PM
Subject: about broadcast join of base table in spark sql
To: d...@spark.org<mailto:d...@spark.org>, 
user@spark.apache.org<mailto:user@spark.apache.org>


Hi All,


Recently I meet a problem in broadcast join: I want to left join table A and B, 
A is the smaller one and the left table, so I wrote

A = A.join(B,A("key1") === B("key2"),"left")

but I found that A is not broadcast out, as the shuffle size is still very 
large.

I guess this is a designed mechanism in spark, so could anyone please tell me 
why it is designed like this? I am just very curious.


Best,


Paley



Re: about broadcast join of base table in spark sql

2017-07-02 Thread paleyl
Thank you for your reply, but when I remove the left join option(like A =
A.join(B,A("key1") === B("key2"))), it can be broadcast out. there is no
reason spark cannot get table size when left join option is chosen on.


On Sun, Jul 2, 2017 at 1:55 PM, Xiaoye Sun <sunxiaoy...@gmail.com> wrote:

> you may need to check if spark can get the size of your table. If spark
> cannot get the table size, it won't do broadcast.
>
> On Sat, Jul 1, 2017 at 11:37 PM Paley Louie <paley2...@gmail.com> wrote:
>
>> Thank you for your reply, I have tried to add broadcast hint to the base
>> table, but it just cannot be broadcast out.
>>
>> On Jun 30, 2017, at 9:13 PM, Yong Zhang <java8...@hotmail.com> wrote:
>>
>> Or since you already use the DataFrame API, instead of SQL, you can add
>> the broadcast function to force it.
>>
>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/
>> spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)
>>
>> Yong
>> functions - Apache Spark
>> <https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)>
>> spark.apache.org
>> Computes the numeric value of the first character of the string column,
>> and returns the result as a int column.
>>
>>
>>
>>
>> ------
>> *From:* Bryan Jeffrey <bryan.jeff...@gmail.com>
>> *Sent:* Friday, June 30, 2017 6:57 AM
>> *To:* d...@spark.org; user@spark.apache.org; paleyl
>> *Subject:* Re: about broadcast join of base table in spark sql
>>
>> Hello.
>>
>> If you want to allow broadcast join with larger broadcasts you can set
>> spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause
>> the plan to allow join despite 'A' being larger than the default threshold.
>>
>>
>> Get Outlook for Android <https://aka.ms/ghei36>
>>
>>
>>
>> From: paleyl
>> Sent: Wednesday, June 28, 10:42 PM
>> Subject: about broadcast join of base table in spark sql
>> To: d...@spark.org, user@spark.apache.org
>>
>>
>> Hi All,
>>
>>
>> Recently I meet a problem in broadcast join: I want to left join table A
>> and B, A is the smaller one and the left table, so I wrote
>>
>> A = A.join(B,A("key1") === B("key2"),"left")
>>
>> but I found that A is not broadcast out, as the shuffle size is still
>> very large.
>>
>> I guess this is a designed mechanism in spark, so could anyone please
>> tell me why it is designed like this? I am just very curious.
>>
>>
>> Best,
>>
>>
>> Paley
>>
>>
>>


-- 
Peili Lv

Department of Pattern Recognition and Intelligent System
School of Automation
Northwestern Polytechnical University
NPU Chang'an Campus
Building of Automation #130
http://paley.mydiscussion.net/


Re: about broadcast join of base table in spark sql

2017-07-01 Thread Xiaoye Sun
you may need to check if spark can get the size of your table. If spark
cannot get the table size, it won't do broadcast.

On Sat, Jul 1, 2017 at 11:37 PM Paley Louie <paley2...@gmail.com> wrote:

> Thank you for your reply, I have tried to add broadcast hint to the base
> table, but it just cannot be broadcast out.
>
> On Jun 30, 2017, at 9:13 PM, Yong Zhang <java8...@hotmail.com> wrote:
>
> Or since you already use the DataFrame API, instead of SQL, you can add
> the broadcast function to force it.
>
>
> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)
>
> Yong
> functions - Apache Spark
> <https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)>
> spark.apache.org
> Computes the numeric value of the first character of the string column,
> and returns the result as a int column.
>
>
>
>
> --
> *From:* Bryan Jeffrey <bryan.jeff...@gmail.com>
> *Sent:* Friday, June 30, 2017 6:57 AM
> *To:* d...@spark.org; user@spark.apache.org; paleyl
> *Subject:* Re: about broadcast join of base table in spark sql
>
> Hello.
>
> If you want to allow broadcast join with larger broadcasts you can set
> spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the
> plan to allow join despite 'A' being larger than the default threshold.
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
>
>
> From: paleyl
> Sent: Wednesday, June 28, 10:42 PM
> Subject: about broadcast join of base table in spark sql
> To: d...@spark.org, user@spark.apache.org
>
>
> Hi All,
>
>
> Recently I meet a problem in broadcast join: I want to left join table A
> and B, A is the smaller one and the left table, so I wrote
>
> A = A.join(B,A("key1") === B("key2"),"left")
>
> but I found that A is not broadcast out, as the shuffle size is still very
> large.
>
> I guess this is a designed mechanism in spark, so could anyone please tell
> me why it is designed like this? I am just very curious.
>
>
> Best,
>
>
> Paley
>
>
>


Re: about broadcast join of base table in spark sql

2017-07-01 Thread Paley Louie
Thank you for your reply, I have tried to add broadcast hint to the base table, 
but it just cannot be broadcast out.
> On Jun 30, 2017, at 9:13 PM, Yong Zhang <java8...@hotmail.com> wrote:
> 
> Or since you already use the DataFrame API, instead of SQL, you can add the 
> broadcast function to force it.
> 
> https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)
>  
> <https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)>
> 
> Yong
> functions - Apache Spark 
> <https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)>
> spark.apache.org <http://spark.apache.org/>
> Computes the numeric value of the first character of the string column, and 
> returns the result as a int column.
> 
> 
> 
> 
> From: Bryan Jeffrey <bryan.jeff...@gmail.com>
> Sent: Friday, June 30, 2017 6:57 AM
> To: d...@spark.org; user@spark.apache.org; paleyl
> Subject: Re: about broadcast join of base table in spark sql
>  
> Hello. 
> 
> If you want to allow broadcast join with larger broadcasts you can set 
> spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the 
> plan to allow join despite 'A' being larger than the default threshold. 
> 
> Get Outlook for Android <https://aka.ms/ghei36>
> 
> 
> From: paleyl
> Sent: Wednesday, June 28, 10:42 PM
> Subject: about broadcast join of base table in spark sql
> To: d...@spark.org, user@spark.apache.org
> 
> 
> Hi All,
> 
> 
> Recently I meet a problem in broadcast join: I want to left join table A and 
> B, A is the smaller one and the left table, so I wrote 
> 
> A = A.join(B,A("key1") === B("key2"),"left")
> 
> but I found that A is not broadcast out, as the shuffle size is still very 
> large.
> 
> I guess this is a designed mechanism in spark, so could anyone please tell me 
> why it is designed like this? I am just very curious.
> 
> 
> Best,
> 
> 
> Paley 



Re: about broadcast join of base table in spark sql

2017-07-01 Thread Paley Louie
Thank you for your reply, I have tried to set parameter 
spark.sql.autoBroadcastJoinThreshold to high enough value, however it does not 
work,  I think broadcast of base table is disabled in spark.


> On Jun 30, 2017, at 6:57 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote:
> 
> Hello. 
> 
> If you want to allow broadcast join with larger broadcasts you can set 
> spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the 
> plan to allow join despite 'A' being larger than the default threshold. 
> 
> Get Outlook for Android <https://aka.ms/ghei36>
> 
> 
> From: paleyl
> Sent: Wednesday, June 28, 10:42 PM
> Subject: about broadcast join of base table in spark sql
> To: d...@spark.org, user@spark.apache.org
> 
> 
> Hi All,
> 
> 
> Recently I meet a problem in broadcast join: I want to left join table A and 
> B, A is the smaller one and the left table, so I wrote 
> 
> A = A.join(B,A("key1") === B("key2"),"left")
> 
> but I found that A is not broadcast out, as the shuffle size is still very 
> large.
> 
> I guess this is a designed mechanism in spark, so could anyone please tell me 
> why it is designed like this? I am just very curious.
> 
> 
> Best,
> 
> 
> Paley 
> 
> 
> 



Re: about broadcast join of base table in spark sql

2017-06-30 Thread Yong Zhang
Or since you already use the DataFrame API, instead of SQL, you can add the 
broadcast function to force it.


https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)


Yong

functions - Apache 
Spark<https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame)>
spark.apache.org
Computes the numeric value of the first character of the string column, and 
returns the result as a int column.






From: Bryan Jeffrey <bryan.jeff...@gmail.com>
Sent: Friday, June 30, 2017 6:57 AM
To: d...@spark.org; user@spark.apache.org; paleyl
Subject: Re: about broadcast join of base table in spark sql

Hello.

If you want to allow broadcast join with larger broadcasts you can set 
spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the 
plan to allow join despite 'A' being larger than the default threshold.

Get Outlook for Android<https://aka.ms/ghei36>



From: paleyl
Sent: Wednesday, June 28, 10:42 PM
Subject: about broadcast join of base table in spark sql
To: d...@spark.org, user@spark.apache.org


Hi All,


Recently I meet a problem in broadcast join: I want to left join table A and B, 
A is the smaller one and the left table, so I wrote

A = A.join(B,A("key1") === B("key2"),"left")

but I found that A is not broadcast out, as the shuffle size is still very 
large.

I guess this is a designed mechanism in spark, so could anyone please tell me 
why it is designed like this? I am just very curious.


Best,


Paley





Re: about broadcast join of base table in spark sql

2017-06-30 Thread Bryan Jeffrey
Hello. 




If you want to allow broadcast join with larger broadcasts you can set 
spark.sql.autoBroadcastJoinThreshold to a higher value. This will cause the 
plan to allow join despite 'A' being larger than the default threshold. 




Get Outlook for Android







From: paleyl


Sent: Wednesday, June 28, 10:42 PM


Subject: about broadcast join of base table in spark sql


To: d...@spark.org, user@spark.apache.org






Hi All,






Recently I meet a problem in broadcast join: I want to left join table A and B, 
A is the smaller one and the left table, so I wrote 




A = A.join(B,A("key1") === B("key2"),"left")




but I found that A is not broadcast out, as the shuffle size is still very 
large.




I guess this is a designed mechanism in spark, so could anyone please tell me 
why it is designed like this? I am just very curious.






Best,






Paley 










Fwd: about broadcast join of base table in spark sql

2017-06-30 Thread paleyl
Hi All,

Recently I meet a problem in broadcast join: I want to left join table A
and B, A is the smaller one and the left table, so I wrote
A = A.join(B,A("key1") === B("key2"),"left")
but I found that A is not broadcast out, as the shuffle size is still very
large.
I guess this is a designed mechanism in spark, so could anyone please tell
me why it is designed like this? I am just very curious.

Best,

Paley


about broadcast join of base table in spark sql

2017-06-28 Thread paleyl
Hi All,

Recently I meet a problem in broadcast join: I want to left join table A
and B, A is the smaller one and the left table, so I wrote
A = A.join(B,A("key1") === B("key2"),"left")
but I found that A is not broadcast out, as the shuffle size is still very
large.
I guess this is a designed mechanism in spark, so could anyone please tell
me why it is designed like this? I am just very curious.

Best,

Paley