This is the code creating the treemap:
object CaseInsensitiveOrdered extends Ordering[String] {
def compare(x: String, y: String): Int = x.compareToIgnoreCase(y)
}
TreeMap[String, JobTitle](dico.toArray:_*)(CaseInsensitiveOrdered)
this is the map that is broadcasted.
BTW* if I remove the ordering I got coherent results* (close to the 3M)
with the ordering I am falling down to the 20K.
2013/11/19 Sriram Ramachandrasekaran <[email protected]>
> aah, yes. I missed that. I looked into the code. Both TreeBroadcast and
> HttpBroadcast don't do send or write respectively.. Will wait for other
> inputs.
>
>
> On Tue, Nov 19, 2013 at 10:40 PM, Eugen Cepoi <[email protected]>wrote:
>
>> Yes sure for usual tests it is fine, but the broadcast is only done if we
>> are not in local mode (at least seems so).
>>
>> In SparkContext we have def broadcast[T](value: T) =
>> env.broadcastManager.newBroadcast[T](value, isLocal)
>> the is local is computed from the master name ("local" or "local[...").
>> Now If we look int HttpBroadcast we see
>> if (!isLocal) {
>> HttpBroadcast.write(id, value_)
>> }
>>
>> The broadcast is not done in local. I guess this is an optimization in
>> case we run multiple threads sharing the same broadcasted variable. But
>> perhaps am I missing something?
>>
>>
>> 2013/11/19 Sriram Ramachandrasekaran <[email protected]>
>>
>>> Trying local[m], where m is the number of workers. For tests, local[2]
>>> should be ideal. This is the way to accomplish writing tests for Spark code
>>> generally.
>>>
>>>
>>> On Tue, Nov 19, 2013 at 10:03 PM, Eugen Cepoi <[email protected]>wrote:
>>>
>>>> Maybe a bug with HttpBroadcast or maybe my fault but can't find where :)
>>>>
>>>> The problem:
>>>> At the beginning a job computes a treemap(string, someobject) with a
>>>> custom order (some dummy lowercase), this treemap is broadcasted.
>>>> Then i use this map to do some matching against input rdd (excluding
>>>> those that don't exist).
>>>> What happens? In local (bc is in that case not used) or by passing
>>>> all the treemap without broadcast I got more than 3M matchings, after
>>>> broadcast it falls to 20K.
>>>>
>>>> Replacing HttpBroadcastFactory with TreeBroadcastFactory solves the
>>>> problem (I obtain expected results). I am trying to implement a test case
>>>> to reproduce it, but it is quite tricky in that case...
>>>>
>>>> BTW is there a way to reproduce the broadcast mechanism in local (I see
>>>> that the SparkEnv instance is shared as static, so I guess there is no easy
>>>> way)?
>>>>
>>>> Thanks,
>>>> Eugen
>>>>
>>>
>>>
>>>
>>> --
>>> It's just about how deep your longing is!
>>>
>>
>>
>
>
> --
> It's just about how deep your longing is!
>