Re: task not serialize

2015-04-07 Thread Jeetendra Gangele
Lets say I follow below approach and I got RddPair with huge size .. which
can not fit into one machine ... what to run foreach on this RDD?

On 7 April 2015 at 04:25, Jeetendra Gangele gangele...@gmail.com wrote:



 On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote:


 On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Thanks a lot.That means Spark does not support the nested RDD?
 if I pass the javaSparkContext that also wont work. I mean passing
 SparkContext not possible since its not serializable

 That's right. RDD don't nest and SparkContexts aren't serializable.


 i have a requirement where I will get JavaRDDVendorRecord matchRdd
 and I need to return the postential matches for this record from Hbase. so
 for each field of VendorRecord I have to do following

 1. query Hbase to get the list of potential record in RDD
 2. run logistic regression on RDD return from steps 1 and each element
 of the passed matchRdd.

 If I understand you correctly, each VectorRecord could correspond to
 0-to-N records in HBase, which you need to fetch, true?

  yes thats correct each Vendorrecord corresponds to 0 to N matches


 If so, you could use the RDD flatMap method, which takes a function a
 that accepts each record, then returns a sequence of 0-to-N new records of
 some other type, like your HBase records. However, running an HBase query
 for each VendorRecord could be expensive. If you can turn this into a range
 query or something like that, it would help. I haven't used HBase much, so
 I don't have good advice on optimizing this, if necessary.

 Alternatively, can you do some sort of join on the VendorRecord RDD and
 an RDD of query results from HBase?

  Join will give too big result RDD of query result is returning around
 1 for each record and i have 2 millions to process so it will be huge
 to have this. 2 m*1 big number


 For #2, it sounds like you need flatMap to return records that combine
 the input VendorRecords and fields pulled from HBase.

 Whatever you can do to make this work like table scans and joins will
 probably be most efficient.

 dean




 On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote:

 The log instance won't be serializable, because it will have a file
 handle to write to. Try defining another static method outside
 matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper
 might not be serializable either, but you didn't provide it. If it holds a
 database connection, same problem.

 You can't suppress the warning because it's actually an error. The
 VoidFunction can't be serialized to send it over the cluster's network.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com
  wrote:

 In this code in foreach I am getting task not serialized exception


 @SuppressWarnings(serial)
 public static  void  matchAndMerge(JavaRDDVendorRecord matchRdd,
  final JavaSparkContext jsc) throws IOException{
 log.info(Company matcher started);
 //final JavaSparkContext jsc = getSparkContext();
   matchRdd.foreachAsync(new VoidFunctionVendorRecord(){
 @Override
 public void call(VendorRecord t) throws Exception {
  if(t !=null){
 try{
 CompanyMatcherHelper.UpdateMatchedRecord(jsc,t);
  } catch (Exception e) {
 log.error(ERROR while running Matcher for company  +
 t.getCompanyId(), e);
 }
 }
  }
 });

  }














-


Re: task not serialize

2015-04-07 Thread Dean Wampler
Foreach() runs in parallel across the cluster, like map, flatMap, etc.
You'll only run into problems if you call collect(), which brings the
entire RDD into memory in the driver program.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Tue, Apr 7, 2015 at 3:50 AM, Jeetendra Gangele gangele...@gmail.com
wrote:

 Lets say I follow below approach and I got RddPair with huge size .. which
 can not fit into one machine ... what to run foreach on this RDD?

 On 7 April 2015 at 04:25, Jeetendra Gangele gangele...@gmail.com wrote:



 On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote:


 On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Thanks a lot.That means Spark does not support the nested RDD?
 if I pass the javaSparkContext that also wont work. I mean passing
 SparkContext not possible since its not serializable

 That's right. RDD don't nest and SparkContexts aren't serializable.


 i have a requirement where I will get JavaRDDVendorRecord matchRdd
 and I need to return the postential matches for this record from Hbase. so
 for each field of VendorRecord I have to do following

 1. query Hbase to get the list of potential record in RDD
 2. run logistic regression on RDD return from steps 1 and each element
 of the passed matchRdd.

 If I understand you correctly, each VectorRecord could correspond to
 0-to-N records in HBase, which you need to fetch, true?

  yes thats correct each Vendorrecord corresponds to 0 to N matches


 If so, you could use the RDD flatMap method, which takes a function a
 that accepts each record, then returns a sequence of 0-to-N new records of
 some other type, like your HBase records. However, running an HBase query
 for each VendorRecord could be expensive. If you can turn this into a range
 query or something like that, it would help. I haven't used HBase much, so
 I don't have good advice on optimizing this, if necessary.

 Alternatively, can you do some sort of join on the VendorRecord RDD and
 an RDD of query results from HBase?

  Join will give too big result RDD of query result is returning around
 1 for each record and i have 2 millions to process so it will be huge
 to have this. 2 m*1 big number


 For #2, it sounds like you need flatMap to return records that combine
 the input VendorRecords and fields pulled from HBase.

 Whatever you can do to make this work like table scans and joins will
 probably be most efficient.

 dean




 On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote:

 The log instance won't be serializable, because it will have a file
 handle to write to. Try defining another static method outside
 matchAndMerge that encapsulates the call to log.error. 
 CompanyMatcherHelper
 might not be serializable either, but you didn't provide it. If it holds a
 database connection, same problem.

 You can't suppress the warning because it's actually an error. The
 VoidFunction can't be serialized to send it over the cluster's network.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele 
 gangele...@gmail.com wrote:

 In this code in foreach I am getting task not serialized exception


 @SuppressWarnings(serial)
 public static  void  matchAndMerge(JavaRDDVendorRecord matchRdd,
  final JavaSparkContext jsc) throws IOException{
 log.info(Company matcher started);
 //final JavaSparkContext jsc = getSparkContext();
   matchRdd.foreachAsync(new VoidFunctionVendorRecord(){
 @Override
 public void call(VendorRecord t) throws Exception {
  if(t !=null){
 try{
 CompanyMatcherHelper.UpdateMatchedRecord(jsc,t);
  } catch (Exception e) {
 log.error(ERROR while running Matcher for company  +
 t.getCompanyId(), e);
 }
 }
  }
 });

  }














 -



Re: task not serialize

2015-04-07 Thread Jeetendra Gangele
I thinking to follow the below approach(in my class hbase also return the
same object which i will get in RDD)
.1 First run the  flatMapPairf

JavaPairRDDVendorRecord, IterableVendorRecord pairvendorData
=matchRdd.flatMapToPair( new PairFlatMapFunctionVendorRecord,
VendorRecord, VendorRecord(){

@Override
public IterableTuple2VendorRecord,VendorRecord call(
VendorRecord t) throws Exception {
ListTuple2VendorRecord, VendorRecord pairs = new
LinkedListTuple2VendorRecord, VendorRecord();
MatcherKeys matchkeys=CompanyMatcherHelper.getBlockinkeys(t);
ListVendorRecord Matchedrecords
=ckdao.getMatchingRecordsWithscan(matchkeys);
for(int i=0;iMatchedrecords.size();i++){
pairs.add( new Tuple2VendorRecord,VendorRecord(t,Matchedrecords.get(i)));
}
 return pairs;
}
 }
).groupByKey(200);
Question will it store the returned RDD in one Node? or it only bring when
I run the second step?  in groupBy if I increase the partiotionNumber will
it increase the prformance

 2. Then apply mapPartition on this RDD and do logistic regression here.my
my issue is my logistic regression function take






On 7 April 2015 at 18:38, Dean Wampler deanwamp...@gmail.com wrote:

 Foreach() runs in parallel across the cluster, like map, flatMap, etc.
 You'll only run into problems if you call collect(), which brings the
 entire RDD into memory in the driver program.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Tue, Apr 7, 2015 at 3:50 AM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Lets say I follow below approach and I got RddPair with huge size ..
 which can not fit into one machine ... what to run foreach on this RDD?

 On 7 April 2015 at 04:25, Jeetendra Gangele gangele...@gmail.com wrote:



 On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote:


 On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com
  wrote:

 Thanks a lot.That means Spark does not support the nested RDD?
 if I pass the javaSparkContext that also wont work. I mean passing
 SparkContext not possible since its not serializable

 That's right. RDD don't nest and SparkContexts aren't serializable.


 i have a requirement where I will get JavaRDDVendorRecord matchRdd
 and I need to return the postential matches for this record from Hbase. so
 for each field of VendorRecord I have to do following

 1. query Hbase to get the list of potential record in RDD
 2. run logistic regression on RDD return from steps 1 and each element
 of the passed matchRdd.

 If I understand you correctly, each VectorRecord could correspond to
 0-to-N records in HBase, which you need to fetch, true?

  yes thats correct each Vendorrecord corresponds to 0 to N matches


 If so, you could use the RDD flatMap method, which takes a function a
 that accepts each record, then returns a sequence of 0-to-N new records of
 some other type, like your HBase records. However, running an HBase query
 for each VendorRecord could be expensive. If you can turn this into a range
 query or something like that, it would help. I haven't used HBase much, so
 I don't have good advice on optimizing this, if necessary.

 Alternatively, can you do some sort of join on the VendorRecord RDD and
 an RDD of query results from HBase?

  Join will give too big result RDD of query result is returning around
 1 for each record and i have 2 millions to process so it will be huge
 to have this. 2 m*1 big number


 For #2, it sounds like you need flatMap to return records that combine
 the input VendorRecords and fields pulled from HBase.

 Whatever you can do to make this work like table scans and joins will
 probably be most efficient.

 dean




 On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote:

 The log instance won't be serializable, because it will have a file
 handle to write to. Try defining another static method outside
 matchAndMerge that encapsulates the call to log.error. 
 CompanyMatcherHelper
 might not be serializable either, but you didn't provide it. If it holds 
 a
 database connection, same problem.

 You can't suppress the warning because it's actually an error. The
 VoidFunction can't be serialized to send it over the cluster's network.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele 
 gangele...@gmail.com wrote:

 In this code in foreach I am getting task not serialized exception


 @SuppressWarnings(serial)
 public static  void  matchAndMerge(JavaRDDVendorRecord matchRdd,
  final JavaSparkContext jsc) throws IOException{
 log.info(Company matcher started);
 //final JavaSparkContext jsc = getSparkContext();
   

task not serialize

2015-04-06 Thread Jeetendra Gangele
In this code in foreach I am getting task not serialized exception


@SuppressWarnings(serial)
public static  void  matchAndMerge(JavaRDDVendorRecord matchRdd,  final
JavaSparkContext jsc) throws IOException{
log.info(Company matcher started);
//final JavaSparkContext jsc = getSparkContext();
  matchRdd.foreachAsync(new VoidFunctionVendorRecord(){
@Override
public void call(VendorRecord t) throws Exception {
 if(t !=null){
try{
CompanyMatcherHelper.UpdateMatchedRecord(jsc,t);
 } catch (Exception e) {
log.error(ERROR while running Matcher for company  + t.getCompanyId(), e);
}
}
 }
});

 }


Re: task not serialize

2015-04-06 Thread Jeetendra Gangele
Thanks a lot.That means Spark does not support the nested RDD?
if I pass the javaSparkContext that also wont work. I mean passing
SparkContext not possible since its not serializable

i have a requirement where I will get JavaRDDVendorRecord matchRdd and I
need to return the postential matches for this record from Hbase. so for
each field of VendorRecord I have to do following

1. query Hbase to get the list of potential record in RDD
2. run logistic regression on RDD return from steps 1 and each element of
the passed matchRdd.




On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote:

 The log instance won't be serializable, because it will have a file
 handle to write to. Try defining another static method outside
 matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper
 might not be serializable either, but you didn't provide it. If it holds a
 database connection, same problem.

 You can't suppress the warning because it's actually an error. The
 VoidFunction can't be serialized to send it over the cluster's network.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 In this code in foreach I am getting task not serialized exception


 @SuppressWarnings(serial)
 public static  void  matchAndMerge(JavaRDDVendorRecord matchRdd,  final
 JavaSparkContext jsc) throws IOException{
 log.info(Company matcher started);
 //final JavaSparkContext jsc = getSparkContext();
   matchRdd.foreachAsync(new VoidFunctionVendorRecord(){
 @Override
 public void call(VendorRecord t) throws Exception {
  if(t !=null){
 try{
 CompanyMatcherHelper.UpdateMatchedRecord(jsc,t);
  } catch (Exception e) {
 log.error(ERROR while running Matcher for company  + t.getCompanyId(),
 e);
 }
 }
  }
 });

  }





Re: task not serialize

2015-04-06 Thread Dean Wampler
On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com
wrote:

 Thanks a lot.That means Spark does not support the nested RDD?
 if I pass the javaSparkContext that also wont work. I mean passing
 SparkContext not possible since its not serializable

 ​That's right. RDD don't nest and SparkContexts aren't serializable.
​


 i have a requirement where I will get JavaRDDVendorRecord matchRdd and
 I need to return the postential matches for this record from Hbase. so for
 each field of VendorRecord I have to do following

 1. query Hbase to get the list of potential record in RDD
 2. run logistic regression on RDD return from steps 1 and each element of
 the passed matchRdd.

 ​If I understand you correctly, each VectorRecord could correspond to
0-to-N records in HBase, which you need to fetch, true?

If so, you could use the RDD flatMap method, which take​s a function a that
accepts each record, then returns a sequence of 0-to-N new records of some
other type, like your HBase records. However, running an HBase query for
each VendorRecord could be expensive. If you can turn this into a range
query or something like that, it would help. I haven't used HBase much, so
I don't have good advice on optimizing this, if necessary.

Alternatively, can you do some sort of join on the VendorRecord RDD and an
RDD of query results from HBase?

For #2, it sounds like you need flatMap to return records that combine the
input VendorRecords and fields pulled from HBase.

Whatever you can do to make this work like table scans and joins will
probably be most efficient.

dean




 On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote:

 The log instance won't be serializable, because it will have a file
 handle to write to. Try defining another static method outside
 matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper
 might not be serializable either, but you didn't provide it. If it holds a
 database connection, same problem.

 You can't suppress the warning because it's actually an error. The
 VoidFunction can't be serialized to send it over the cluster's network.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 In this code in foreach I am getting task not serialized exception


 @SuppressWarnings(serial)
 public static  void  matchAndMerge(JavaRDDVendorRecord matchRdd,
  final JavaSparkContext jsc) throws IOException{
 log.info(Company matcher started);
 //final JavaSparkContext jsc = getSparkContext();
   matchRdd.foreachAsync(new VoidFunctionVendorRecord(){
 @Override
 public void call(VendorRecord t) throws Exception {
  if(t !=null){
 try{
 CompanyMatcherHelper.UpdateMatchedRecord(jsc,t);
  } catch (Exception e) {
 log.error(ERROR while running Matcher for company  + t.getCompanyId(),
 e);
 }
 }
  }
 });

  }









Re: task not serialize

2015-04-06 Thread Jeetendra Gangele
On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote:


 On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Thanks a lot.That means Spark does not support the nested RDD?
 if I pass the javaSparkContext that also wont work. I mean passing
 SparkContext not possible since its not serializable

 That's right. RDD don't nest and SparkContexts aren't serializable.
 


 i have a requirement where I will get JavaRDDVendorRecord matchRdd and
 I need to return the postential matches for this record from Hbase. so for
 each field of VendorRecord I have to do following

 1. query Hbase to get the list of potential record in RDD
 2. run logistic regression on RDD return from steps 1 and each element of
 the passed matchRdd.

 If I understand you correctly, each VectorRecord could correspond to
 0-to-N records in HBase, which you need to fetch, true?

 yes thats correct each Vendorrecord corresponds to 0 to N matches


 If so, you could use the RDD flatMap method, which takes a function a
 that accepts each record, then returns a sequence of 0-to-N new records of
 some other type, like your HBase records. However, running an HBase query
 for each VendorRecord could be expensive. If you can turn this into a range
 query or something like that, it would help. I haven't used HBase much, so
 I don't have good advice on optimizing this, if necessary.

 Alternatively, can you do some sort of join on the VendorRecord RDD and an
 RDD of query results from HBase?

 Join will give too big result RDD of query result is returning around
1 for each record and i have 2 millions to process so it will be huge
to have this. 2 m*1 big number


 For #2, it sounds like you need flatMap to return records that combine the
 input VendorRecords and fields pulled from HBase.

 Whatever you can do to make this work like table scans and joins will
 probably be most efficient.

 dean




 On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote:

 The log instance won't be serializable, because it will have a file
 handle to write to. Try defining another static method outside
 matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper
 might not be serializable either, but you didn't provide it. If it holds a
 database connection, same problem.

 You can't suppress the warning because it's actually an error. The
 VoidFunction can't be serialized to send it over the cluster's network.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 In this code in foreach I am getting task not serialized exception


 @SuppressWarnings(serial)
 public static  void  matchAndMerge(JavaRDDVendorRecord matchRdd,
  final JavaSparkContext jsc) throws IOException{
 log.info(Company matcher started);
 //final JavaSparkContext jsc = getSparkContext();
   matchRdd.foreachAsync(new VoidFunctionVendorRecord(){
 @Override
 public void call(VendorRecord t) throws Exception {
  if(t !=null){
 try{
 CompanyMatcherHelper.UpdateMatchedRecord(jsc,t);
  } catch (Exception e) {
 log.error(ERROR while running Matcher for company  +
 t.getCompanyId(), e);
 }
 }
  }
 });

  }










Re: task not serialize

2015-04-06 Thread Dean Wampler
The log instance won't be serializable, because it will have a file
handle to write to. Try defining another static method outside
matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper
might not be serializable either, but you didn't provide it. If it holds a
database connection, same problem.

You can't suppress the warning because it's actually an error. The
VoidFunction can't be serialized to send it over the cluster's network.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
http://polyglotprogramming.com

On Mon, Apr 6, 2015 at 4:30 PM, Jeetendra Gangele gangele...@gmail.com
wrote:

 In this code in foreach I am getting task not serialized exception


 @SuppressWarnings(serial)
 public static  void  matchAndMerge(JavaRDDVendorRecord matchRdd,  final
 JavaSparkContext jsc) throws IOException{
 log.info(Company matcher started);
 //final JavaSparkContext jsc = getSparkContext();
   matchRdd.foreachAsync(new VoidFunctionVendorRecord(){
 @Override
 public void call(VendorRecord t) throws Exception {
  if(t !=null){
 try{
 CompanyMatcherHelper.UpdateMatchedRecord(jsc,t);
  } catch (Exception e) {
 log.error(ERROR while running Matcher for company  + t.getCompanyId(),
 e);
 }
 }
  }
 });

  }