Lets say I follow below approach and I got RddPair with huge size .. which
can not fit into one machine ... what to run foreach on this RDD?

On 7 April 2015 at 04:25, Jeetendra Gangele <gangele...@gmail.com> wrote:

>
>
> On 7 April 2015 at 04:03, Dean Wampler <deanwamp...@gmail.com> wrote:
>
>>
>> On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele <gangele...@gmail.com>
>> wrote:
>>
>>> Thanks a lot.That means Spark does not support the nested RDD?
>>> if I pass the javaSparkContext that also wont work. I mean passing
>>> SparkContext not possible since its not serializable
>>>
>>> That's right. RDD don't nest and SparkContexts aren't serializable.
>>
>>
>>> i have a requirement where I will get JavaRDD<VendorRecord> matchRdd
>>> and I need to return the postential matches for this record from Hbase. so
>>> for each field of VendorRecord I have to do following
>>>
>>> 1. query Hbase to get the list of potential record in RDD
>>> 2. run logistic regression on RDD return from steps 1 and each element
>>> of the passed matchRdd.
>>>
>>> If I understand you correctly, each VectorRecord could correspond to
>> 0-to-N records in HBase, which you need to fetch, true?
>>
>  yes thats correct each Vendorrecord corresponds to 0 to N matches
>
>
>> If so, you could use the RDD flatMap method, which takes a function a
>> that accepts each record, then returns a sequence of 0-to-N new records of
>> some other type, like your HBase records. However, running an HBase query
>> for each VendorRecord could be expensive. If you can turn this into a range
>> query or something like that, it would help. I haven't used HBase much, so
>> I don't have good advice on optimizing this, if necessary.
>>
>> Alternatively, can you do some sort of join on the VendorRecord RDD and
>> an RDD of query results from HBase?
>>
>  Join will give too big result RDD of query result is returning around
> 10000 for each record and i have 2 millions to process so it will be huge
> to have this. 2 m*10000 big number
>
>>
>> For #2, it sounds like you need flatMap to return records that combine
>> the input VendorRecords and fields pulled from HBase.
>>
>> Whatever you can do to make this work like "table scans" and joins will
>> probably be most efficient.
>>
>> dean
>>
>>>
>>>
>>>
>>> On 7 April 2015 at 03:33, Dean Wampler <deanwamp...@gmail.com> wrote:
>>>
>>>> The "log" instance won't be serializable, because it will have a file
>>>> handle to write to. Try defining another static method outside
>>>> matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper
>>>> might not be serializable either, but you didn't provide it. If it holds a
>>>> database connection, same problem.
>>>>
>>>> You can't suppress the warning because it's actually an error. The
>>>> VoidFunction can't be serialized to send it over the cluster's network.
>>>>
>>>> dean
>>>>
>>>> Dean Wampler, Ph.D.
>>>> Author: Programming Scala, 2nd Edition
>>>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>>>> Typesafe <http://typesafe.com>
>>>> @deanwampler <http://twitter.com/deanwampler>
>>>> http://polyglotprogramming.com
>>>>
>>>> On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele <gangele...@gmail.com
>>>> > wrote:
>>>>
>>>>> In this code in foreach I am getting task not serialized exception
>>>>>
>>>>>
>>>>> @SuppressWarnings("serial")
>>>>> public static  void  matchAndMerge(JavaRDD<VendorRecord> matchRdd,
>>>>>  final JavaSparkContext jsc) throws IOException{
>>>>> log.info("Company matcher started");
>>>>> //final JavaSparkContext jsc = getSparkContext();
>>>>>   matchRdd.foreachAsync(new VoidFunction<VendorRecord>(){
>>>>> @Override
>>>>> public void call(VendorRecord t) throws Exception {
>>>>>  if(t !=null){
>>>>> try{
>>>>> CompanyMatcherHelper.UpdateMatchedRecord(jsc,t);
>>>>>  } catch (Exception e) {
>>>>> log.error("ERROR while running Matcher for company " +
>>>>> t.getCompanyId(), e);
>>>>> }
>>>>> }
>>>>>  }
>>>>> });
>>>>>
>>>>>  }
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>


-

Reply via email to