Hi, I'm new to Spark and am working on a proof of concept. I'm using Spark 1.3.0 and running in local mode.
I can read and parse an RCFile using Spark however the performance is not as good as I hoped. I'm testing using ~800k rows and it is taking about 30 mins to process. Is there a better way to load and process a RCFile? The only reference to RCFile in 'Learning Spark' is in the SparkSQL chapter. Is using SparkSQL for RCFiles the recommendation and I should avoid using Spark core functionality for RCFiles? I'm using the following code to build RDD[Record] val records: RDD[Record] = sc.hadoopFile(rcFile, classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]], classOf[LongWritable], classOf[BytesRefArrayWritable]) .map(x => ( x._1.get, parse( x._2 ) ) ).map(pair => pair._2) the function parse is defined as: def parse(braw: BytesRefArrayWritable ): Record = { val serDe = new ColumnarSerDe() var tbl: Properties = new Properties(); tbl.setProperty("serialization.format", "9") tbl.setProperty("columns", "time,id,name,application") tbl.setProperty("columns.types", "string:int:string:string") tbl.setProperty("serialization.null.format", "NULL"); serDe.initialize(new Configuration(), tbl); val oi = serDe.getObjectInspector() val soi: StructObjectInspector = oi.asInstanceOf[StructObjectInspector] val fieldRefs: Buffer[_ <:StructField] = soi.getAllStructFieldRefs().asScala val row = serDe.deserialize(braw) val timeRec = soi.getStructFieldData(row, fieldRefs(0)) val idRec = soi.getStructFieldData(row, fieldRefs(1)) val nameRec = soi.getStructFieldData(row, fieldRefs(2)) val applicationRec = soi.getStructFieldData(row, fieldRefs(3)) var timeOI = fieldRefs(0).getFieldObjectInspector().asInstanceOf[StringObjectInspector]; var time = timeOI.getPrimitiveJavaObject(timeRec); var idOI = fieldRefs(1).getFieldObjectInspector().asInstanceOf[IntObjectInspector]; var id = idOI.get(idRec); var nameOI = fieldRefs(2).getFieldObjectInspector().asInstanceOf[StringObjectInspector]; var name = nameOI.getPrimitiveJavaObject(nameRec); var appOI = fieldRefs(3).getFieldObjectInspector().asInstanceOf[StringObjectInspector]; var app = appOI.getPrimitiveJavaObject(applicationRec); new Record(time, id, name, app) } Thanks in advance, Glenda -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Code-to-read-RCFiles-tp14934p22545.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org