Re: task not serialize
Lets say I follow below approach and I got RddPair with huge size .. which can not fit into one machine ... what to run foreach on this RDD? On 7 April 2015 at 04:25, Jeetendra Gangele gangele...@gmail.com wrote: On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote: On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com wrote: Thanks a lot.That means Spark does not support the nested RDD? if I pass the javaSparkContext that also wont work. I mean passing SparkContext not possible since its not serializable That's right. RDD don't nest and SparkContexts aren't serializable. i have a requirement where I will get JavaRDDVendorRecord matchRdd and I need to return the postential matches for this record from Hbase. so for each field of VendorRecord I have to do following 1. query Hbase to get the list of potential record in RDD 2. run logistic regression on RDD return from steps 1 and each element of the passed matchRdd. If I understand you correctly, each VectorRecord could correspond to 0-to-N records in HBase, which you need to fetch, true? yes thats correct each Vendorrecord corresponds to 0 to N matches If so, you could use the RDD flatMap method, which takes a function a that accepts each record, then returns a sequence of 0-to-N new records of some other type, like your HBase records. However, running an HBase query for each VendorRecord could be expensive. If you can turn this into a range query or something like that, it would help. I haven't used HBase much, so I don't have good advice on optimizing this, if necessary. Alternatively, can you do some sort of join on the VendorRecord RDD and an RDD of query results from HBase? Join will give too big result RDD of query result is returning around 1 for each record and i have 2 millions to process so it will be huge to have this. 2 m*1 big number For #2, it sounds like you need flatMap to return records that combine the input VendorRecords and fields pulled from HBase. Whatever you can do to make this work like table scans and joins will probably be most efficient. dean On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote: The log instance won't be serializable, because it will have a file handle to write to. Try defining another static method outside matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper might not be serializable either, but you didn't provide it. If it holds a database connection, same problem. You can't suppress the warning because it's actually an error. The VoidFunction can't be serialized to send it over the cluster's network. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com wrote: In this code in foreach I am getting task not serialized exception @SuppressWarnings(serial) public static void matchAndMerge(JavaRDDVendorRecord matchRdd, final JavaSparkContext jsc) throws IOException{ log.info(Company matcher started); //final JavaSparkContext jsc = getSparkContext(); matchRdd.foreachAsync(new VoidFunctionVendorRecord(){ @Override public void call(VendorRecord t) throws Exception { if(t !=null){ try{ CompanyMatcherHelper.UpdateMatchedRecord(jsc,t); } catch (Exception e) { log.error(ERROR while running Matcher for company + t.getCompanyId(), e); } } } }); } -
Re: task not serialize
Foreach() runs in parallel across the cluster, like map, flatMap, etc. You'll only run into problems if you call collect(), which brings the entire RDD into memory in the driver program. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Tue, Apr 7, 2015 at 3:50 AM, Jeetendra Gangele gangele...@gmail.com wrote: Lets say I follow below approach and I got RddPair with huge size .. which can not fit into one machine ... what to run foreach on this RDD? On 7 April 2015 at 04:25, Jeetendra Gangele gangele...@gmail.com wrote: On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote: On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com wrote: Thanks a lot.That means Spark does not support the nested RDD? if I pass the javaSparkContext that also wont work. I mean passing SparkContext not possible since its not serializable That's right. RDD don't nest and SparkContexts aren't serializable. i have a requirement where I will get JavaRDDVendorRecord matchRdd and I need to return the postential matches for this record from Hbase. so for each field of VendorRecord I have to do following 1. query Hbase to get the list of potential record in RDD 2. run logistic regression on RDD return from steps 1 and each element of the passed matchRdd. If I understand you correctly, each VectorRecord could correspond to 0-to-N records in HBase, which you need to fetch, true? yes thats correct each Vendorrecord corresponds to 0 to N matches If so, you could use the RDD flatMap method, which takes a function a that accepts each record, then returns a sequence of 0-to-N new records of some other type, like your HBase records. However, running an HBase query for each VendorRecord could be expensive. If you can turn this into a range query or something like that, it would help. I haven't used HBase much, so I don't have good advice on optimizing this, if necessary. Alternatively, can you do some sort of join on the VendorRecord RDD and an RDD of query results from HBase? Join will give too big result RDD of query result is returning around 1 for each record and i have 2 millions to process so it will be huge to have this. 2 m*1 big number For #2, it sounds like you need flatMap to return records that combine the input VendorRecords and fields pulled from HBase. Whatever you can do to make this work like table scans and joins will probably be most efficient. dean On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote: The log instance won't be serializable, because it will have a file handle to write to. Try defining another static method outside matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper might not be serializable either, but you didn't provide it. If it holds a database connection, same problem. You can't suppress the warning because it's actually an error. The VoidFunction can't be serialized to send it over the cluster's network. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com wrote: In this code in foreach I am getting task not serialized exception @SuppressWarnings(serial) public static void matchAndMerge(JavaRDDVendorRecord matchRdd, final JavaSparkContext jsc) throws IOException{ log.info(Company matcher started); //final JavaSparkContext jsc = getSparkContext(); matchRdd.foreachAsync(new VoidFunctionVendorRecord(){ @Override public void call(VendorRecord t) throws Exception { if(t !=null){ try{ CompanyMatcherHelper.UpdateMatchedRecord(jsc,t); } catch (Exception e) { log.error(ERROR while running Matcher for company + t.getCompanyId(), e); } } } }); } -
Re: task not serialize
I thinking to follow the below approach(in my class hbase also return the same object which i will get in RDD) .1 First run the flatMapPairf JavaPairRDDVendorRecord, IterableVendorRecord pairvendorData =matchRdd.flatMapToPair( new PairFlatMapFunctionVendorRecord, VendorRecord, VendorRecord(){ @Override public IterableTuple2VendorRecord,VendorRecord call( VendorRecord t) throws Exception { ListTuple2VendorRecord, VendorRecord pairs = new LinkedListTuple2VendorRecord, VendorRecord(); MatcherKeys matchkeys=CompanyMatcherHelper.getBlockinkeys(t); ListVendorRecord Matchedrecords =ckdao.getMatchingRecordsWithscan(matchkeys); for(int i=0;iMatchedrecords.size();i++){ pairs.add( new Tuple2VendorRecord,VendorRecord(t,Matchedrecords.get(i))); } return pairs; } } ).groupByKey(200); Question will it store the returned RDD in one Node? or it only bring when I run the second step? in groupBy if I increase the partiotionNumber will it increase the prformance 2. Then apply mapPartition on this RDD and do logistic regression here.my my issue is my logistic regression function take On 7 April 2015 at 18:38, Dean Wampler deanwamp...@gmail.com wrote: Foreach() runs in parallel across the cluster, like map, flatMap, etc. You'll only run into problems if you call collect(), which brings the entire RDD into memory in the driver program. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Tue, Apr 7, 2015 at 3:50 AM, Jeetendra Gangele gangele...@gmail.com wrote: Lets say I follow below approach and I got RddPair with huge size .. which can not fit into one machine ... what to run foreach on this RDD? On 7 April 2015 at 04:25, Jeetendra Gangele gangele...@gmail.com wrote: On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote: On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com wrote: Thanks a lot.That means Spark does not support the nested RDD? if I pass the javaSparkContext that also wont work. I mean passing SparkContext not possible since its not serializable That's right. RDD don't nest and SparkContexts aren't serializable. i have a requirement where I will get JavaRDDVendorRecord matchRdd and I need to return the postential matches for this record from Hbase. so for each field of VendorRecord I have to do following 1. query Hbase to get the list of potential record in RDD 2. run logistic regression on RDD return from steps 1 and each element of the passed matchRdd. If I understand you correctly, each VectorRecord could correspond to 0-to-N records in HBase, which you need to fetch, true? yes thats correct each Vendorrecord corresponds to 0 to N matches If so, you could use the RDD flatMap method, which takes a function a that accepts each record, then returns a sequence of 0-to-N new records of some other type, like your HBase records. However, running an HBase query for each VendorRecord could be expensive. If you can turn this into a range query or something like that, it would help. I haven't used HBase much, so I don't have good advice on optimizing this, if necessary. Alternatively, can you do some sort of join on the VendorRecord RDD and an RDD of query results from HBase? Join will give too big result RDD of query result is returning around 1 for each record and i have 2 millions to process so it will be huge to have this. 2 m*1 big number For #2, it sounds like you need flatMap to return records that combine the input VendorRecords and fields pulled from HBase. Whatever you can do to make this work like table scans and joins will probably be most efficient. dean On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote: The log instance won't be serializable, because it will have a file handle to write to. Try defining another static method outside matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper might not be serializable either, but you didn't provide it. If it holds a database connection, same problem. You can't suppress the warning because it's actually an error. The VoidFunction can't be serialized to send it over the cluster's network. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com wrote: In this code in foreach I am getting task not serialized exception @SuppressWarnings(serial) public static void matchAndMerge(JavaRDDVendorRecord matchRdd, final JavaSparkContext jsc) throws IOException{ log.info(Company matcher started); //final JavaSparkContext jsc = getSparkContext();
task not serialize
In this code in foreach I am getting task not serialized exception @SuppressWarnings(serial) public static void matchAndMerge(JavaRDDVendorRecord matchRdd, final JavaSparkContext jsc) throws IOException{ log.info(Company matcher started); //final JavaSparkContext jsc = getSparkContext(); matchRdd.foreachAsync(new VoidFunctionVendorRecord(){ @Override public void call(VendorRecord t) throws Exception { if(t !=null){ try{ CompanyMatcherHelper.UpdateMatchedRecord(jsc,t); } catch (Exception e) { log.error(ERROR while running Matcher for company + t.getCompanyId(), e); } } } }); }
Re: task not serialize
Thanks a lot.That means Spark does not support the nested RDD? if I pass the javaSparkContext that also wont work. I mean passing SparkContext not possible since its not serializable i have a requirement where I will get JavaRDDVendorRecord matchRdd and I need to return the postential matches for this record from Hbase. so for each field of VendorRecord I have to do following 1. query Hbase to get the list of potential record in RDD 2. run logistic regression on RDD return from steps 1 and each element of the passed matchRdd. On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote: The log instance won't be serializable, because it will have a file handle to write to. Try defining another static method outside matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper might not be serializable either, but you didn't provide it. If it holds a database connection, same problem. You can't suppress the warning because it's actually an error. The VoidFunction can't be serialized to send it over the cluster's network. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com wrote: In this code in foreach I am getting task not serialized exception @SuppressWarnings(serial) public static void matchAndMerge(JavaRDDVendorRecord matchRdd, final JavaSparkContext jsc) throws IOException{ log.info(Company matcher started); //final JavaSparkContext jsc = getSparkContext(); matchRdd.foreachAsync(new VoidFunctionVendorRecord(){ @Override public void call(VendorRecord t) throws Exception { if(t !=null){ try{ CompanyMatcherHelper.UpdateMatchedRecord(jsc,t); } catch (Exception e) { log.error(ERROR while running Matcher for company + t.getCompanyId(), e); } } } }); }
Re: task not serialize
On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com wrote: Thanks a lot.That means Spark does not support the nested RDD? if I pass the javaSparkContext that also wont work. I mean passing SparkContext not possible since its not serializable That's right. RDD don't nest and SparkContexts aren't serializable. i have a requirement where I will get JavaRDDVendorRecord matchRdd and I need to return the postential matches for this record from Hbase. so for each field of VendorRecord I have to do following 1. query Hbase to get the list of potential record in RDD 2. run logistic regression on RDD return from steps 1 and each element of the passed matchRdd. If I understand you correctly, each VectorRecord could correspond to 0-to-N records in HBase, which you need to fetch, true? If so, you could use the RDD flatMap method, which takes a function a that accepts each record, then returns a sequence of 0-to-N new records of some other type, like your HBase records. However, running an HBase query for each VendorRecord could be expensive. If you can turn this into a range query or something like that, it would help. I haven't used HBase much, so I don't have good advice on optimizing this, if necessary. Alternatively, can you do some sort of join on the VendorRecord RDD and an RDD of query results from HBase? For #2, it sounds like you need flatMap to return records that combine the input VendorRecords and fields pulled from HBase. Whatever you can do to make this work like table scans and joins will probably be most efficient. dean On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote: The log instance won't be serializable, because it will have a file handle to write to. Try defining another static method outside matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper might not be serializable either, but you didn't provide it. If it holds a database connection, same problem. You can't suppress the warning because it's actually an error. The VoidFunction can't be serialized to send it over the cluster's network. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com wrote: In this code in foreach I am getting task not serialized exception @SuppressWarnings(serial) public static void matchAndMerge(JavaRDDVendorRecord matchRdd, final JavaSparkContext jsc) throws IOException{ log.info(Company matcher started); //final JavaSparkContext jsc = getSparkContext(); matchRdd.foreachAsync(new VoidFunctionVendorRecord(){ @Override public void call(VendorRecord t) throws Exception { if(t !=null){ try{ CompanyMatcherHelper.UpdateMatchedRecord(jsc,t); } catch (Exception e) { log.error(ERROR while running Matcher for company + t.getCompanyId(), e); } } } }); }
Re: task not serialize
On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote: On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com wrote: Thanks a lot.That means Spark does not support the nested RDD? if I pass the javaSparkContext that also wont work. I mean passing SparkContext not possible since its not serializable That's right. RDD don't nest and SparkContexts aren't serializable. i have a requirement where I will get JavaRDDVendorRecord matchRdd and I need to return the postential matches for this record from Hbase. so for each field of VendorRecord I have to do following 1. query Hbase to get the list of potential record in RDD 2. run logistic regression on RDD return from steps 1 and each element of the passed matchRdd. If I understand you correctly, each VectorRecord could correspond to 0-to-N records in HBase, which you need to fetch, true? yes thats correct each Vendorrecord corresponds to 0 to N matches If so, you could use the RDD flatMap method, which takes a function a that accepts each record, then returns a sequence of 0-to-N new records of some other type, like your HBase records. However, running an HBase query for each VendorRecord could be expensive. If you can turn this into a range query or something like that, it would help. I haven't used HBase much, so I don't have good advice on optimizing this, if necessary. Alternatively, can you do some sort of join on the VendorRecord RDD and an RDD of query results from HBase? Join will give too big result RDD of query result is returning around 1 for each record and i have 2 millions to process so it will be huge to have this. 2 m*1 big number For #2, it sounds like you need flatMap to return records that combine the input VendorRecords and fields pulled from HBase. Whatever you can do to make this work like table scans and joins will probably be most efficient. dean On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote: The log instance won't be serializable, because it will have a file handle to write to. Try defining another static method outside matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper might not be serializable either, but you didn't provide it. If it holds a database connection, same problem. You can't suppress the warning because it's actually an error. The VoidFunction can't be serialized to send it over the cluster's network. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com wrote: In this code in foreach I am getting task not serialized exception @SuppressWarnings(serial) public static void matchAndMerge(JavaRDDVendorRecord matchRdd, final JavaSparkContext jsc) throws IOException{ log.info(Company matcher started); //final JavaSparkContext jsc = getSparkContext(); matchRdd.foreachAsync(new VoidFunctionVendorRecord(){ @Override public void call(VendorRecord t) throws Exception { if(t !=null){ try{ CompanyMatcherHelper.UpdateMatchedRecord(jsc,t); } catch (Exception e) { log.error(ERROR while running Matcher for company + t.getCompanyId(), e); } } } }); }
Re: task not serialize
The log instance won't be serializable, because it will have a file handle to write to. Try defining another static method outside matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper might not be serializable either, but you didn't provide it. If it holds a database connection, same problem. You can't suppress the warning because it's actually an error. The VoidFunction can't be serialized to send it over the cluster's network. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com wrote: In this code in foreach I am getting task not serialized exception @SuppressWarnings(serial) public static void matchAndMerge(JavaRDDVendorRecord matchRdd, final JavaSparkContext jsc) throws IOException{ log.info(Company matcher started); //final JavaSparkContext jsc = getSparkContext(); matchRdd.foreachAsync(new VoidFunctionVendorRecord(){ @Override public void call(VendorRecord t) throws Exception { if(t !=null){ try{ CompanyMatcherHelper.UpdateMatchedRecord(jsc,t); } catch (Exception e) { log.error(ERROR while running Matcher for company + t.getCompanyId(), e); } } } }); }