Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
Those results look very good for the larger workloads (100MB and 1GB). Were
you also able to run experiments for smaller amounts of data? For instance
broadcasting a single variable to the entire cluster? In the paper you
state that HDFS-based mechanisms performed well only for small amounts of
data. Do you have an approximation for the trade-off point when HDFS-based
becomes more favorable, and BitTorrent-like performs worse? I also read
that the minimum size transmitted using a broadcast variable is 4MB. Maybe
I should look for a different way of sharing this constant?

Use case: I am looking for the most efficient way to perform a
transformation involving a constant (of which the value is determined at
runtime) for a large input file.

Scala example:
var constant1 = sc.broadcast(2) // The actual value, 2 in this case, would
be a result from a different function, generated during runtime
val result = input.map(x => x + constant1.value)

On 11 March 2015 at 21:13, Mosharaf Chowdhury 
wrote:

> The current broadcast algorithm in Spark approximates the one described
> in the Section 5 of this paper
> .
> It is expected to scale sub-linearly; i.e., O(log N), where N is the
> number of machines in your cluster.
> We evaluated up to 100 machines, and it does follow O(log N) scaling.
>
> --
> Mosharaf Chowdhury
> http://www.mosharaf.com/
>
> On Wed, Mar 11, 2015 at 3:11 PM, Tom Hubregtsen 
> wrote:
>
>> Thanks Mosharaf, for the quick response! Can you maybe give me some
>> pointers to an explanation of this strategy? Or elaborate a bit more on it?
>> Which parts are involved in which way? Where are the time penalties and how
>> scalable is this implementation?
>>
>> Thanks again,
>>
>> Tom
>>
>> On 11 March 2015 at 16:01, Mosharaf Chowdhury 
>> wrote:
>>
>>> Hi Tom,
>>>
>>> That's an outdated document from 4/5 years ago.
>>>
>>> Spark currently uses a BitTorrent like mechanism that's been tuned for
>>> datacenter environments.
>>>
>>> Mosharaf
>>> --
>>> From: Tom 
>>> Sent: ‎3/‎11/‎2015 4:58 PM
>>> To: user@spark.apache.org
>>> Subject: Which strategy is used for broadcast variables?
>>>
>>> In "Performance and Scalability of Broadcast in Spark" by Mosharaf
>>> Chowdhury
>>> I read that Spark uses HDFS for its broadcast variables. This seems
>>> highly
>>> inefficient. In the same paper alternatives are proposed, among which
>>> "Bittorent Broadcast (BTB)". While studying "Learning Spark," page 105,
>>> second paragraph about Broadcast Variables, I read " The value is sent to
>>> each node only once, using an efficient, BitTorrent-like communication
>>> mechanism."
>>>
>>> - Is the book talking about the proposed BTB from the paper?
>>>
>>> - Is this currently the default?
>>>
>>> - If not, what is?
>>>
>>> Thanks,
>>>
>>> Tom
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Which-strategy-is-used-for-broadcast-variables-tp22004.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Mosharaf Chowdhury
The current broadcast algorithm in Spark approximates the one described in
the Section 5 of this paper
.
It is expected to scale sub-linearly; i.e., O(log N), where N is the number
of machines in your cluster.
We evaluated up to 100 machines, and it does follow O(log N) scaling.

--
Mosharaf Chowdhury
http://www.mosharaf.com/

On Wed, Mar 11, 2015 at 3:11 PM, Tom Hubregtsen 
wrote:

> Thanks Mosharaf, for the quick response! Can you maybe give me some
> pointers to an explanation of this strategy? Or elaborate a bit more on it?
> Which parts are involved in which way? Where are the time penalties and how
> scalable is this implementation?
>
> Thanks again,
>
> Tom
>
> On 11 March 2015 at 16:01, Mosharaf Chowdhury 
> wrote:
>
>> Hi Tom,
>>
>> That's an outdated document from 4/5 years ago.
>>
>> Spark currently uses a BitTorrent like mechanism that's been tuned for
>> datacenter environments.
>>
>> Mosharaf
>> --
>> From: Tom 
>> Sent: ‎3/‎11/‎2015 4:58 PM
>> To: user@spark.apache.org
>> Subject: Which strategy is used for broadcast variables?
>>
>> In "Performance and Scalability of Broadcast in Spark" by Mosharaf
>> Chowdhury
>> I read that Spark uses HDFS for its broadcast variables. This seems highly
>> inefficient. In the same paper alternatives are proposed, among which
>> "Bittorent Broadcast (BTB)". While studying "Learning Spark," page 105,
>> second paragraph about Broadcast Variables, I read " The value is sent to
>> each node only once, using an efficient, BitTorrent-like communication
>> mechanism."
>>
>> - Is the book talking about the proposed BTB from the paper?
>>
>> - Is this currently the default?
>>
>> - If not, what is?
>>
>> Thanks,
>>
>> Tom
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Which-strategy-is-used-for-broadcast-variables-tp22004.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
Thanks Mosharaf, for the quick response! Can you maybe give me some
pointers to an explanation of this strategy? Or elaborate a bit more on it?
Which parts are involved in which way? Where are the time penalties and how
scalable is this implementation?

Thanks again,

Tom

On 11 March 2015 at 16:01, Mosharaf Chowdhury 
wrote:

> Hi Tom,
>
> That's an outdated document from 4/5 years ago.
>
> Spark currently uses a BitTorrent like mechanism that's been tuned for
> datacenter environments.
>
> Mosharaf
> --
> From: Tom 
> Sent: ‎3/‎11/‎2015 4:58 PM
> To: user@spark.apache.org
> Subject: Which strategy is used for broadcast variables?
>
> In "Performance and Scalability of Broadcast in Spark" by Mosharaf
> Chowdhury
> I read that Spark uses HDFS for its broadcast variables. This seems highly
> inefficient. In the same paper alternatives are proposed, among which
> "Bittorent Broadcast (BTB)". While studying "Learning Spark," page 105,
> second paragraph about Broadcast Variables, I read " The value is sent to
> each node only once, using an efficient, BitTorrent-like communication
> mechanism."
>
> - Is the book talking about the proposed BTB from the paper?
>
> - Is this currently the default?
>
> - If not, what is?
>
> Thanks,
>
> Tom
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Which-strategy-is-used-for-broadcast-variables-tp22004.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


RE: Which strategy is used for broadcast variables?

2015-03-11 Thread Mosharaf Chowdhury
Hi Tom,

That's an outdated document from 4/5 years ago. 

Spark currently uses a BitTorrent like mechanism that's been tuned for 
datacenter environments. 

Mosharaf

-Original Message-
From: "Tom" 
Sent: ‎3/‎11/‎2015 4:58 PM
To: "user@spark.apache.org" 
Subject: Which strategy is used for broadcast variables?

In "Performance and Scalability of Broadcast in Spark" by Mosharaf Chowdhury
I read that Spark uses HDFS for its broadcast variables. This seems highly
inefficient. In the same paper alternatives are proposed, among which
"Bittorent Broadcast (BTB)". While studying "Learning Spark," page 105,
second paragraph about Broadcast Variables, I read " The value is sent to
each node only once, using an efficient, BitTorrent-like communication
mechanism." 

- Is the book talking about the proposed BTB from the paper? 

- Is this currently the default? 

- If not, what is?

Thanks,

Tom



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Which-strategy-is-used-for-broadcast-variables-tp22004.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org