Lets say I follow below approach and I got RddPair with huge size .. which can not fit into one machine ... what to run foreach on this RDD?
On 7 April 2015 at 04:25, Jeetendra Gangele <gangele...@gmail.com> wrote: > > > On 7 April 2015 at 04:03, Dean Wampler <deanwamp...@gmail.com> wrote: > >> >> On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele <gangele...@gmail.com> >> wrote: >> >>> Thanks a lot.That means Spark does not support the nested RDD? >>> if I pass the javaSparkContext that also wont work. I mean passing >>> SparkContext not possible since its not serializable >>> >>> That's right. RDD don't nest and SparkContexts aren't serializable. >> >> >>> i have a requirement where I will get JavaRDD<VendorRecord> matchRdd >>> and I need to return the postential matches for this record from Hbase. so >>> for each field of VendorRecord I have to do following >>> >>> 1. query Hbase to get the list of potential record in RDD >>> 2. run logistic regression on RDD return from steps 1 and each element >>> of the passed matchRdd. >>> >>> If I understand you correctly, each VectorRecord could correspond to >> 0-to-N records in HBase, which you need to fetch, true? >> > yes thats correct each Vendorrecord corresponds to 0 to N matches > > >> If so, you could use the RDD flatMap method, which takes a function a >> that accepts each record, then returns a sequence of 0-to-N new records of >> some other type, like your HBase records. However, running an HBase query >> for each VendorRecord could be expensive. If you can turn this into a range >> query or something like that, it would help. I haven't used HBase much, so >> I don't have good advice on optimizing this, if necessary. >> >> Alternatively, can you do some sort of join on the VendorRecord RDD and >> an RDD of query results from HBase? >> > Join will give too big result RDD of query result is returning around > 10000 for each record and i have 2 millions to process so it will be huge > to have this. 2 m*10000 big number > >> >> For #2, it sounds like you need flatMap to return records that combine >> the input VendorRecords and fields pulled from HBase. >> >> Whatever you can do to make this work like "table scans" and joins will >> probably be most efficient. >> >> dean >> >>> >>> >>> >>> On 7 April 2015 at 03:33, Dean Wampler <deanwamp...@gmail.com> wrote: >>> >>>> The "log" instance won't be serializable, because it will have a file >>>> handle to write to. Try defining another static method outside >>>> matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper >>>> might not be serializable either, but you didn't provide it. If it holds a >>>> database connection, same problem. >>>> >>>> You can't suppress the warning because it's actually an error. The >>>> VoidFunction can't be serialized to send it over the cluster's network. >>>> >>>> dean >>>> >>>> Dean Wampler, Ph.D. >>>> Author: Programming Scala, 2nd Edition >>>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) >>>> Typesafe <http://typesafe.com> >>>> @deanwampler <http://twitter.com/deanwampler> >>>> http://polyglotprogramming.com >>>> >>>> On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele <gangele...@gmail.com >>>> > wrote: >>>> >>>>> In this code in foreach I am getting task not serialized exception >>>>> >>>>> >>>>> @SuppressWarnings("serial") >>>>> public static void matchAndMerge(JavaRDD<VendorRecord> matchRdd, >>>>> final JavaSparkContext jsc) throws IOException{ >>>>> log.info("Company matcher started"); >>>>> //final JavaSparkContext jsc = getSparkContext(); >>>>> matchRdd.foreachAsync(new VoidFunction<VendorRecord>(){ >>>>> @Override >>>>> public void call(VendorRecord t) throws Exception { >>>>> if(t !=null){ >>>>> try{ >>>>> CompanyMatcherHelper.UpdateMatchedRecord(jsc,t); >>>>> } catch (Exception e) { >>>>> log.error("ERROR while running Matcher for company " + >>>>> t.getCompanyId(), e); >>>>> } >>>>> } >>>>> } >>>>> }); >>>>> >>>>> } >>>>> >>>> >>>> >>> >>> >>> >>> >> > > > > -