Hi,
I remember seeing a similar performance problem with Apache Shark last year
when compared to Hive, though that was in a company specific port of the
code. Unfortunately I no longer have access to that code. The problem then
was reflection based class creation in the critical path of reading each
record. Just make sure the code flow for each parse() doesn't do something
like that.

I would look to see if lines like this "getFieldObjectInspector().
asInstanceOf[StringObjectInspector]" are part of the Hive code as well, or
else they look like they'll slow down the parsing if they're being run for
each record.

Pramod

On Fri, Apr 17, 2015 at 12:25 PM, gle <glendaleon...@gmail.com> wrote:

> Hi,
>
> I'm new to Spark and am working on a proof of concept.  I'm using Spark
> 1.3.0 and running in local mode.
>
> I can read and parse an RCFile using Spark however the performance is not
> as
> good as I hoped.
> I'm testing using ~800k rows and it is taking about 30 mins to process.
>
> Is there a better way to load and process a RCFile?  The only reference to
> RCFile in 'Learning Spark' is in the SparkSQL chapter.  Is using SparkSQL
> for RCFiles the recommendation and I should avoid using Spark core
> functionality for RCFiles?
>
> I'm using the following code to build RDD[Record]
>
>     val records: RDD[Record] = sc.hadoopFile(rcFile,
>
> classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]],
>                                                   classOf[LongWritable],
>
> classOf[BytesRefArrayWritable])
>                                                   .map(x =>  (
>                                                      x._1.get, parse( x._2
> )
>                                                     )
>                                                   ).map(pair => pair._2)
> the function parse is defined as:
>
>   def parse(braw: BytesRefArrayWritable ): Record = {
>     val serDe = new ColumnarSerDe()
>     var tbl: Properties = new Properties();
>     tbl.setProperty("serialization.format", "9")
>     tbl.setProperty("columns", "time,id,name,application")
>     tbl.setProperty("columns.types", "string:int:string:string")
>     tbl.setProperty("serialization.null.format", "NULL");
>     serDe.initialize(new Configuration(), tbl);
>
>     val oi = serDe.getObjectInspector()
>     val soi: StructObjectInspector = oi.asInstanceOf[StructObjectInspector]
>     val fieldRefs: Buffer[_ <:StructField]  =
> soi.getAllStructFieldRefs().asScala
>     val row = serDe.deserialize(braw)
>
>     val timeRec = soi.getStructFieldData(row, fieldRefs(0))
>     val idRec = soi.getStructFieldData(row, fieldRefs(1))
>     val nameRec = soi.getStructFieldData(row, fieldRefs(2))
>     val applicationRec = soi.getStructFieldData(row, fieldRefs(3))
>
>     var timeOI =
> fieldRefs(0).getFieldObjectInspector().asInstanceOf[StringObjectInspector];
>     var time = timeOI.getPrimitiveJavaObject(timeRec);
>     var idOI =
> fieldRefs(1).getFieldObjectInspector().asInstanceOf[IntObjectInspector];
>     var id = idOI.get(idRec);
>     var nameOI =
> fieldRefs(2).getFieldObjectInspector().asInstanceOf[StringObjectInspector];
>     var name = nameOI.getPrimitiveJavaObject(nameRec);
>     var appOI =
> fieldRefs(3).getFieldObjectInspector().asInstanceOf[StringObjectInspector];
>     var app = appOI.getPrimitiveJavaObject(applicationRec);
>
>     new Record(time, id, name, app)
>   }
>
>
> Thanks in advance,
> Glenda
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Code-to-read-RCFiles-tp14934p22545.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to