RE: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-19 Thread Chandra Mohan, Ananda Vel Murugan
Hi,

Thanks for the response. I was looking for a java solution. I will check the 
scala and python ones.

Regards,
Anand.C

From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Tuesday, May 19, 2015 6:17 PM
To: Chandra Mohan, Ananda Vel Murugan
Cc: ayan guha; user
Subject: Re: Spark sql error while writing Parquet file- Trying to write more 
fields than contained in row

I believe your looking for  df.na.fill in scala, in pySpark Module it is fillna 
(http://spark.apache.org/docs/latest/api/python/pyspark.sql.html)

from the docs:

df4.fillna({'age': 50, 'name': 'unknown'}).show()

age height name

10  80 Alice

5   null   Bob

50  null   Tom

50  null   unknown

On Mon, May 18, 2015 at 11:01 PM, Chandra Mohan, Ananda Vel Murugan 
mailto:ananda.muru...@honeywell.com>> wrote:
Hi,

Thanks for the response. But I could not see fillna function in DataFrame class.

[cid:image001.png@01D092DA.4DF87A00]


Is it available in some specific version of Spark sql. This is what I have in 
my pom.xml


  org.apache.spark
  spark-sql_2.10
  1.3.1
   

Regards,
Anand.C

From: ayan guha [mailto:guha.a...@gmail.com<mailto:guha.a...@gmail.com>]
Sent: Monday, May 18, 2015 5:19 PM
To: Chandra Mohan, Ananda Vel Murugan; user
Subject: Re: Spark sql error while writing Parquet file- Trying to write more 
fields than contained in row

Hi

Give a try with dtaFrame.fillna function to fill up missing column

Best
Ayan

On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan 
mailto:ananda.muru...@honeywell.com>> wrote:
Hi,

I am using spark-sql to read a CSV file and write it as parquet file. I am 
building the schema using the following code.

String schemaString = "a b c";
   List fields = new ArrayList();
   MetadataBuilder mb = new MetadataBuilder();
   mb.putBoolean("nullable", true);
   Metadata m = mb.build();
   for (String fieldName: schemaString.split(" ")) {
fields.add(new StructField(fieldName,DataTypes.DoubleType,true, 
m));
   }
   StructType schema = DataTypes.createStructType(fields);

Some of the rows in my input csv does not contain three columns. After building 
my JavaRDD, I create data frame as shown below using the RDD and schema.

DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema);

Finally I try to save it as Parquet file

darDataFrame.saveAsParquetFile("/home/anand/output.parquet”)

I get this error when saving it as Parquet file

java.lang.IndexOutOfBoundsException: Trying to write more fields than contained 
in row (3 > 2)

I understand the reason behind this error. Some of my rows in Row RDD does not 
contain three elements as some rows in my input csv does not contain three 
columns. But while building the schema, I am specifying every field as 
nullable. So I believe, it should not throw this error. Can anyone help me fix 
this error. Thank you.

Regards,
Anand.C





--
Best Regards,
Ayan Guha



Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-19 Thread Todd Nist
I believe your looking for  df.na.fill in scala, in pySpark Module it is
fillna (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html)

from the docs:

df4.fillna({'age': 50, 'name': 'unknown'}).show()age height name10  80
Alice5   null   Bob50  null   Tom50  null   unknown


On Mon, May 18, 2015 at 11:01 PM, Chandra Mohan, Ananda Vel Murugan <
ananda.muru...@honeywell.com> wrote:

>  Hi,
>
>
>
> Thanks for the response. But I could not see fillna function in DataFrame
> class.
>
>
>
>
>
>
>
> Is it available in some specific version of Spark sql. This is what I have
> in my pom.xml
>
>
>
> 
>
>   org.apache.spark
>
>   spark-sql_2.10
>
>   1.3.1
>
>
>
>
>
> Regards,
>
> Anand.C
>
>
>
> *From:* ayan guha [mailto:guha.a...@gmail.com]
> *Sent:* Monday, May 18, 2015 5:19 PM
> *To:* Chandra Mohan, Ananda Vel Murugan; user
> *Subject:* Re: Spark sql error while writing Parquet file- Trying to
> write more fields than contained in row
>
>
>
> Hi
>
>
>
> Give a try with dtaFrame.fillna function to fill up missing column
>
>
>
> Best
>
> Ayan
>
>
>
> On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan <
> ananda.muru...@honeywell.com> wrote:
>
> Hi,
>
>
>
> I am using spark-sql to read a CSV file and write it as parquet file. I am
> building the schema using the following code.
>
>
>
> String schemaString = "a b c";
>
>List fields = *new* ArrayList();
>
>MetadataBuilder mb = *new* MetadataBuilder();
>
>mb.putBoolean("nullable", *true*);
>
>Metadata m = mb.build();
>
>*for* (String fieldName: schemaString.split(" ")) {
>
> fields.add(*new* StructField(fieldName,DataTypes.
> *DoubleType*,*true*, m));
>
>}
>
>StructType schema = DataTypes.*createStructType*(fields);
>
>
>
> Some of the rows in my input csv does not contain three columns. After
> building my JavaRDD, I create data frame as shown below using the
> RDD and schema.
>
>
>
> DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema);
>
>
>
> Finally I try to save it as Parquet file
>
>
>
> darDataFrame.saveAsParquetFile("/home/anand/output.parquet”)
>
>
>
> I get this error when saving it as Parquet file
>
>
>
> java.lang.IndexOutOfBoundsException: Trying to write more fields than
> contained in row (3 > 2)
>
>
>
> I understand the reason behind this error. Some of my rows in Row RDD does
> not contain three elements as some rows in my input csv does not contain
> three columns. But while building the schema, I am specifying every field
> as nullable. So I believe, it should not throw this error. Can anyone help
> me fix this error. Thank you.
>
>
>
> Regards,
>
> Anand.C
>
>
>
>
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>


RE: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-18 Thread Chandra Mohan, Ananda Vel Murugan
Hi,

Thanks for the response. But I could not see fillna function in DataFrame class.

[cid:image001.png@01D0920E.32B14460]


Is it available in some specific version of Spark sql. This is what I have in 
my pom.xml


  org.apache.spark
  spark-sql_2.10
  1.3.1
   

Regards,
Anand.C

From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Monday, May 18, 2015 5:19 PM
To: Chandra Mohan, Ananda Vel Murugan; user
Subject: Re: Spark sql error while writing Parquet file- Trying to write more 
fields than contained in row

Hi

Give a try with dtaFrame.fillna function to fill up missing column

Best
Ayan

On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan 
mailto:ananda.muru...@honeywell.com>> wrote:
Hi,

I am using spark-sql to read a CSV file and write it as parquet file. I am 
building the schema using the following code.

String schemaString = "a b c";
   List fields = new ArrayList();
   MetadataBuilder mb = new MetadataBuilder();
   mb.putBoolean("nullable", true);
   Metadata m = mb.build();
   for (String fieldName: schemaString.split(" ")) {
fields.add(new StructField(fieldName,DataTypes.DoubleType,true, 
m));
   }
   StructType schema = DataTypes.createStructType(fields);

Some of the rows in my input csv does not contain three columns. After building 
my JavaRDD, I create data frame as shown below using the RDD and schema.

DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema);

Finally I try to save it as Parquet file

darDataFrame.saveAsParquetFile("/home/anand/output.parquet”)

I get this error when saving it as Parquet file

java.lang.IndexOutOfBoundsException: Trying to write more fields than contained 
in row (3 > 2)

I understand the reason behind this error. Some of my rows in Row RDD does not 
contain three elements as some rows in my input csv does not contain three 
columns. But while building the schema, I am specifying every field as 
nullable. So I believe, it should not throw this error. Can anyone help me fix 
this error. Thank you.

Regards,
Anand.C





--
Best Regards,
Ayan Guha


Re: Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-18 Thread ayan guha
Hi

Give a try with dtaFrame.fillna function to fill up missing column

Best
Ayan

On Mon, May 18, 2015 at 8:29 PM, Chandra Mohan, Ananda Vel Murugan <
ananda.muru...@honeywell.com> wrote:

>  Hi,
>
>
>
> I am using spark-sql to read a CSV file and write it as parquet file. I am
> building the schema using the following code.
>
>
>
> String schemaString = "a b c";
>
>List fields = *new* ArrayList();
>
>MetadataBuilder mb = *new* MetadataBuilder();
>
>mb.putBoolean("nullable", *true*);
>
>Metadata m = mb.build();
>
>*for* (String fieldName: schemaString.split(" ")) {
>
> fields.add(*new* StructField(fieldName,DataTypes.
> *DoubleType*,*true*, m));
>
>}
>
>StructType schema = DataTypes.*createStructType*(fields);
>
>
>
> Some of the rows in my input csv does not contain three columns. After
> building my JavaRDD, I create data frame as shown below using the
> RDD and schema.
>
>
>
> DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema);
>
>
>
> Finally I try to save it as Parquet file
>
>
>
> darDataFrame.saveAsParquetFile("/home/anand/output.parquet”)
>
>
>
> I get this error when saving it as Parquet file
>
>
>
> java.lang.IndexOutOfBoundsException: Trying to write more fields than
> contained in row (3 > 2)
>
>
>
> I understand the reason behind this error. Some of my rows in Row RDD does
> not contain three elements as some rows in my input csv does not contain
> three columns. But while building the schema, I am specifying every field
> as nullable. So I believe, it should not throw this error. Can anyone help
> me fix this error. Thank you.
>
>
>
> Regards,
>
> Anand.C
>
>
>
>
>



-- 
Best Regards,
Ayan Guha


Spark sql error while writing Parquet file- Trying to write more fields than contained in row

2015-05-18 Thread Chandra Mohan, Ananda Vel Murugan
Hi,

I am using spark-sql to read a CSV file and write it as parquet file. I am 
building the schema using the following code.

String schemaString = "a b c";
   List fields = new ArrayList();
   MetadataBuilder mb = new MetadataBuilder();
   mb.putBoolean("nullable", true);
   Metadata m = mb.build();
   for (String fieldName: schemaString.split(" ")) {
fields.add(new StructField(fieldName,DataTypes.DoubleType,true, 
m));
   }
   StructType schema = DataTypes.createStructType(fields);

Some of the rows in my input csv does not contain three columns. After building 
my JavaRDD, I create data frame as shown below using the RDD and schema.

DataFrame darDataFrame = sqlContext.createDataFrame(rowRDD, schema);

Finally I try to save it as Parquet file

darDataFrame.saveAsParquetFile("/home/anand/output.parquet")

I get this error when saving it as Parquet file

java.lang.IndexOutOfBoundsException: Trying to write more fields than contained 
in row (3 > 2)

I understand the reason behind this error. Some of my rows in Row RDD does not 
contain three elements as some rows in my input csv does not contain three 
columns. But while building the schema, I am specifying every field as 
nullable. So I believe, it should not throw this error. Can anyone help me fix 
this error. Thank you.

Regards,
Anand.C