Hi, Thanks for the response. I was looking for a java solution. I will check the scala and python ones.
Regards, Anand.C From: Todd Nist [mailto:tsind...@gmail.com] Sent: Tuesday, May 19, 2015 6:17 PM To: Chandra Mohan, Ananda Vel Murugan Cc: ayan guha; user Subject: Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row I believe your looking for df.na.fill in scala, in pySpark Module it is fillna (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html) from the docs: df4.fillna({'age': 50, 'name': 'unknown'}).show() age height name 10 80 Alice 5 null Bob 50 null Tom 50 null unknown On Mon, May 18, 2015 at 11:01 PM, Chandra Mohan, Ananda Vel Murugan <ananda.muru...@honeywell.com<mailto:ananda.muru...@honeywell.com>> wrote: Hi, Thanks for the response. But I could not see fillna function in DataFrame class. [cid:image001.png@01D092DA.4DF87A00] Is it available in some specific version of Spark sql. This is what I have in my pom.xml <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version>1.3.1</version> </dependency> Regards, Anand.C From: ayan guha [mailto:guha.a...@gmail.com<mailto:guha.a...@gmail.com>] Sent: Monday, May 18, 2015 5:19 PM To: Chandra Mohan, Ananda Vel Murugan; user Subject: Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row Hi Give a try with dtaFrame.fillna function to fill up missing column Best Ayan On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan <ananda.muru...@honeywell.com<mailto:ananda.muru...@honeywell.com>> wrote: Hi, I am using spark-sql to read a CSV file and write it as parquet file. I am building the schema using the following code. String schemaString = "a b c"; List<StructField> fields = new ArrayList<StructField>(); MetadataBuilder mb = new MetadataBuilder(); mb.putBoolean("nullable", true); Metadata m = mb.build(); for (String fieldName: schemaString.split(" ")) { fields.add(new StructField(fieldName,DataTypes.DoubleType,true, m)); } StructType schema = DataTypes.createStructType(fields); Some of the rows in my input csv does not contain three columns. After building my JavaRDD<Row>, I create data frame as shown below using the RDD and schema. DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema); Finally I try to save it as Parquet file darDataFrame.saveAsParquetFile("/home/anand/output.parquet”) I get this error when saving it as Parquet file java.lang.IndexOutOfBoundsException: Trying to write more fields than contained in row (3 > 2) I understand the reason behind this error. Some of my rows in Row RDD does not contain three elements as some rows in my input csv does not contain three columns. But while building the schema, I am specifying every field as nullable. So I believe, it should not throw this error. Can anyone help me fix this error. Thank you. Regards, Anand.C -- Best Regards, Ayan Guha