Re: Serialization issue when using HBase with Spark

Yanbo Sun, 14 Dec 2014 00:36:07 -0800

In #1, class HTable can not be serializable.
You also need to check you self defined function getUserActions and make sure 
it is a member function of one class who implement serializable interface.


发自我的 iPad

> 在 2014年12月12日，下午4:35，yangliuyu <yangli...@163.com> 写道：
> 
> The scenario is using HTable instance to scan multiple rowkey range in Spark
> tasks look likes below:
> Option 1:
> val users = input
>      .map { case (deviceId, uid) =>
> uid}.distinct().sortBy(x=>x).mapPartitions(iterator=>{
>      val conf = HBaseConfiguration.create()
>      val table = new HTable(conf, "actions")
>      val result = iterator.map{ userId=>
>        (userId, getUserActions(table, userId, timeStart, timeStop))
>      }
>      table.close()
>      result
>    })
> 
> But got the exception:
> org.apache.spark.SparkException: Task not serializable
>        at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
>        at
> org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
>        at org.apache.spark.SparkContext.clean(SparkContext.scala:1264)
>        at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:597)
>        at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:60)...
> ...
> Caused by: java.io.NotSerializableException:
> org.apache.hadoop.conf.Configuration
>        at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
>        at
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>        at
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
> 
> The reason not using sc.newAPIHadoopRDD is it only support one scan each
> time.
> val hbaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],    
>      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],    
>      classOf[org.apache.hadoop.hbase.client.Result]) 
> 
> And if using MultiTableInputFormat, driver is not possible put all rowkeys
> into HBaseConfiguration
> Option 2:
> sc.newAPIHadoopRDD(conf, classOf[MultiTableInputFormat],
>      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
>      classOf[org.apache.hadoop.hbase.client.Result])
> 
> It may divide all rowkey ranges into several parts then use option 2, but I
> prefer option 1. So is there any solution for option 1? 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Serialization-issue-when-using-HBase-with-Spark-tp20655.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Serialization issue when using HBase with Spark

Reply via email to