RE: Broadcasting a parquet file using spark and python

2015-12-07 Thread Shuai Zheng
Hi Michael,

 

Thanks for feedback.

 

I am using version 1.5.2 now.

 

Can you tell me how to enforce the broadcast join? I don’t want to let the 
engine to decide the execution path of join. I want to use hint or parameter to 
enforce broadcast join (because I also have some cases are inner join but I 
want to use broadcast join).

 

Or is there any ticket or roadmap for this feature?

 

Regards,

 

Shuai

 

 

From: Michael Armbrust [mailto:mich...@databricks.com] 
Sent: Saturday, December 05, 2015 4:11 PM
To: Shuai Zheng
Cc: Jitesh chandra Mishra; user
Subject: Re: Broadcasting a parquet file using spark and python

 

I believe we started supporting broadcast outer joins in Spark 1.5.  Which 
version are you using? 

 

On Fri, Dec 4, 2015 at 2:49 PM, Shuai Zheng <szheng.c...@gmail.com> wrote:

Hi all,

 

Sorry to re-open this thread.

 

I have a similar issue, one big parquet file left outer join quite a few 
smaller parquet files. But the running is extremely slow and even OOM sometimes 
(with 300M , I have two questions here:

 

1, If I use outer join, will Spark SQL auto use broadcast hashjoin?

2, If not, in the latest documents: 
http://spark.apache.org/docs/latest/sql-programming-guide.html

 


spark.sql.autoBroadcastJoinThreshold

10485760 (10 MB)

Configures the maximum size in bytes for a table that will be broadcast to all 
worker nodes when performing a join. By setting this value to -1 broadcasting 
can be disabled. Note that currently statistics are only supported for Hive 
Metastore tables where the command ANALYZE TABLE  COMPUTE STATISTICS 
noscan has been run.

 

How can I do this (run command analyze table) in Java? I know I can code it by 
myself (create a broadcast val and implement lookup by myself), but it will 
make code super ugly.

 

I hope we can have either API or hint to enforce the hashjoin (instead of this 
suspicious autoBroadcastJoinThreshold parameter). Do we have any ticket or 
roadmap for this feature?

 

Regards,

 

Shuai

 

From: Michael Armbrust [mailto:mich...@databricks.com] 
Sent: Wednesday, April 01, 2015 2:01 PM
To: Jitesh chandra Mishra
Cc: user
Subject: Re: Broadcasting a parquet file using spark and python

 

You will need to create a hive parquet table that points to the data and run 
"ANALYZE TABLE tableName noscan" so that we have statistics on the size.

 

On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra <jitesh...@gmail.com> 
wrote:

Hi Michael,

 

Thanks for your response. I am running 1.2.1. 

 

Is there any workaround to achieve the same with 1.2.1?

 

Thanks,

Jitesh

 

On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust <mich...@databricks.com> 
wrote:

In Spark 1.3 I would expect this to happen automatically when the parquet table 
is small (< 10mb, configurable with spark.sql.autoBroadcastJoinThreshold).  If 
you are running 1.3 and not seeing this, can you show the code you are using to 
create the table?

 

On Tue, Mar 31, 2015 at 3:25 AM, jitesh129 <jitesh...@gmail.com> wrote:

How can we implement a BroadcastHashJoin for spark with python?

My SparkSQL inner joins are taking a lot of time since it is performing
ShuffledHashJoin.

Tables on which join is performed are stored as parquet files.

Please help.

Thanks and regards,
Jitesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

 

 

 

 



Re: Broadcasting a parquet file using spark and python

2015-12-05 Thread Michael Armbrust
I believe we started supporting broadcast outer joins in Spark 1.5.  Which
version are you using?

On Fri, Dec 4, 2015 at 2:49 PM, Shuai Zheng <szheng.c...@gmail.com> wrote:

> Hi all,
>
>
>
> Sorry to re-open this thread.
>
>
>
> I have a similar issue, one big parquet file left outer join quite a few
> smaller parquet files. But the running is extremely slow and even OOM
> sometimes (with 300M , I have two questions here:
>
>
>
> 1, If I use outer join, will Spark SQL auto use broadcast hashjoin?
>
> 2, If not, in the latest documents:
> http://spark.apache.org/docs/latest/sql-programming-guide.html
>
>
>
> spark.sql.autoBroadcastJoinThreshold
>
> 10485760 (10 MB)
>
> Configures the maximum size in bytes for a table that will be broadcast to
> all worker nodes when performing a join. By setting this value to -1
> broadcasting can be disabled. Note that currently statistics are only
> supported for Hive Metastore tables where the command ANALYZE TABLE
>  COMPUTE STATISTICS noscan has been run.
>
>
>
> How can I do this (run command analyze table) in Java? I know I can code
> it by myself (create a broadcast val and implement lookup by myself), but
> it will make code super ugly.
>
>
>
> I hope we can have either API or hint to enforce the hashjoin (instead of
> this suspicious autoBroadcastJoinThreshold parameter). Do we have any
> ticket or roadmap for this feature?
>
>
>
> Regards,
>
>
>
> Shuai
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com]
> *Sent:* Wednesday, April 01, 2015 2:01 PM
> *To:* Jitesh chandra Mishra
> *Cc:* user
> *Subject:* Re: Broadcasting a parquet file using spark and python
>
>
>
> You will need to create a hive parquet table that points to the data and
> run "ANALYZE TABLE tableName noscan" so that we have statistics on the size.
>
>
>
> On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra <
> jitesh...@gmail.com> wrote:
>
> Hi Michael,
>
>
>
> Thanks for your response. I am running 1.2.1.
>
>
>
> Is there any workaround to achieve the same with 1.2.1?
>
>
>
> Thanks,
>
> Jitesh
>
>
>
> On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
> In Spark 1.3 I would expect this to happen automatically when the parquet
> table is small (< 10mb, configurable with 
> spark.sql.autoBroadcastJoinThreshold).
> If you are running 1.3 and not seeing this, can you show the code you are
> using to create the table?
>
>
>
> On Tue, Mar 31, 2015 at 3:25 AM, jitesh129 <jitesh...@gmail.com> wrote:
>
> How can we implement a BroadcastHashJoin for spark with python?
>
> My SparkSQL inner joins are taking a lot of time since it is performing
> ShuffledHashJoin.
>
> Tables on which join is performed are stored as parquet files.
>
> Please help.
>
> Thanks and regards,
> Jitesh
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
>
>


RE: Broadcasting a parquet file using spark and python

2015-12-04 Thread Shuai Zheng
Hi all,

 

Sorry to re-open this thread.

 

I have a similar issue, one big parquet file left outer join quite a few 
smaller parquet files. But the running is extremely slow and even OOM sometimes 
(with 300M , I have two questions here:

 

1, If I use outer join, will Spark SQL auto use broadcast hashjoin?

2, If not, in the latest documents: 
http://spark.apache.org/docs/latest/sql-programming-guide.html

 


spark.sql.autoBroadcastJoinThreshold

10485760 (10 MB)

Configures the maximum size in bytes for a table that will be broadcast to all 
worker nodes when performing a join. By setting this value to -1 broadcasting 
can be disabled. Note that currently statistics are only supported for Hive 
Metastore tables where the command ANALYZE TABLE  COMPUTE STATISTICS 
noscan has been run.

 

How can I do this (run command analyze table) in Java? I know I can code it by 
myself (create a broadcast val and implement lookup by myself), but it will 
make code super ugly.

 

I hope we can have either API or hint to enforce the hashjoin (instead of this 
suspicious autoBroadcastJoinThreshold parameter). Do we have any ticket or 
roadmap for this feature?

 

Regards,

 

Shuai

 

From: Michael Armbrust [mailto:mich...@databricks.com] 
Sent: Wednesday, April 01, 2015 2:01 PM
To: Jitesh chandra Mishra
Cc: user
Subject: Re: Broadcasting a parquet file using spark and python

 

You will need to create a hive parquet table that points to the data and run 
"ANALYZE TABLE tableName noscan" so that we have statistics on the size.

 

On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra <jitesh...@gmail.com> 
wrote:

Hi Michael,

 

Thanks for your response. I am running 1.2.1. 

 

Is there any workaround to achieve the same with 1.2.1?

 

Thanks,

Jitesh

 

On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust <mich...@databricks.com> 
wrote:

In Spark 1.3 I would expect this to happen automatically when the parquet table 
is small (< 10mb, configurable with spark.sql.autoBroadcastJoinThreshold).  If 
you are running 1.3 and not seeing this, can you show the code you are using to 
create the table?

 

On Tue, Mar 31, 2015 at 3:25 AM, jitesh129 <jitesh...@gmail.com> wrote:

How can we implement a BroadcastHashJoin for spark with python?

My SparkSQL inner joins are taking a lot of time since it is performing
ShuffledHashJoin.

Tables on which join is performed are stored as parquet files.

Please help.

Thanks and regards,
Jitesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

 

 

 



Re: Broadcasting a parquet file using spark and python

2015-04-01 Thread Michael Armbrust
You will need to create a hive parquet table that points to the data and
run ANALYZE TABLE tableName noscan so that we have statistics on the size.

On Tue, Mar 31, 2015 at 9:36 PM, Jitesh chandra Mishra jitesh...@gmail.com
wrote:

 Hi Michael,

 Thanks for your response. I am running 1.2.1.

 Is there any workaround to achieve the same with 1.2.1?

 Thanks,
 Jitesh

 On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust mich...@databricks.com
 wrote:

 In Spark 1.3 I would expect this to happen automatically when the parquet
 table is small ( 10mb, configurable with 
 spark.sql.autoBroadcastJoinThreshold).
 If you are running 1.3 and not seeing this, can you show the code you are
 using to create the table?

 On Tue, Mar 31, 2015 at 3:25 AM, jitesh129 jitesh...@gmail.com wrote:

 How can we implement a BroadcastHashJoin for spark with python?

 My SparkSQL inner joins are taking a lot of time since it is performing
 ShuffledHashJoin.

 Tables on which join is performed are stored as parquet files.

 Please help.

 Thanks and regards,
 Jitesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Broadcasting a parquet file using spark and python

2015-03-31 Thread Jitesh chandra Mishra
Hi Michael,

Thanks for your response. I am running 1.2.1.

Is there any workaround to achieve the same with 1.2.1?

Thanks,
Jitesh

On Wed, Apr 1, 2015 at 12:25 AM, Michael Armbrust mich...@databricks.com
wrote:

 In Spark 1.3 I would expect this to happen automatically when the parquet
 table is small ( 10mb, configurable with 
 spark.sql.autoBroadcastJoinThreshold).
 If you are running 1.3 and not seeing this, can you show the code you are
 using to create the table?

 On Tue, Mar 31, 2015 at 3:25 AM, jitesh129 jitesh...@gmail.com wrote:

 How can we implement a BroadcastHashJoin for spark with python?

 My SparkSQL inner joins are taking a lot of time since it is performing
 ShuffledHashJoin.

 Tables on which join is performed are stored as parquet files.

 Please help.

 Thanks and regards,
 Jitesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Broadcasting a parquet file using spark and python

2015-03-31 Thread Michael Armbrust
In Spark 1.3 I would expect this to happen automatically when the parquet
table is small ( 10mb, configurable with
spark.sql.autoBroadcastJoinThreshold).
If you are running 1.3 and not seeing this, can you show the code you are
using to create the table?

On Tue, Mar 31, 2015 at 3:25 AM, jitesh129 jitesh...@gmail.com wrote:

 How can we implement a BroadcastHashJoin for spark with python?

 My SparkSQL inner joins are taking a lot of time since it is performing
 ShuffledHashJoin.

 Tables on which join is performed are stored as parquet files.

 Please help.

 Thanks and regards,
 Jitesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Broadcasting-a-parquet-file-using-spark-and-python-tp22315.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org