Re: Cleaning/transforming json befor converting to SchemaRDD

2014-11-04 Thread Yin Huai
Hi Daniel, Right now, you need to do the transformation manually. The feature you need is under development (https://issues.apache.org/jira/browse/SPARK-4190). Thanks, Yin On Tue, Nov 4, 2014 at 2:44 AM, Gerard Maas wrote: > You could transform the json to a case class instead of serializing

Re: Cleaning/transforming json befor converting to SchemaRDD

2014-11-03 Thread Gerard Maas
You could transform the json to a case class instead of serializing it back to a String. The resulting RDD[MyCaseClass] is then directly usable as a SchemaRDD using the register function implicitly provided by 'import sqlContext.schemaRDD'. Then the rest of your pipeline will remain the same. -kr,

Cleaning/transforming json befor converting to SchemaRDD

2014-11-03 Thread Daniel Mahler
I am trying to convert terabytes of json log files into parquet files. but I need to clean it a little first. I end up doing the following txt = sc.textFile(inpath).coalesce(800) val json = (for { line <- txt JObject(child) = parse(line) child2 = (for { JFiel