Thanks

 

 

So what I did was

 

scala> val df = 
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").option("header", "true").load("/data/stg/table2")

df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment date: 
string, Net: string, VAT: string, Total: string]

 

 

scala> df.printSchema

root

|-- Invoice Number: string (nullable = true)

|-- Payment date: string (nullable = true)

|-- Net: string (nullable = true)

|-- VAT: string (nullable = true)

|-- Total: string (nullable = true)

 

 

So all the columns are Strings

 

Then I tried to exclude null rows by filtering on all columns not being null 
and map the rest

 

scala> val a = df.filter(col("Invoice Number").isNotNull and col("Payment 
date").isNotNull and col("Net").isNotNull and col("VAT").isNotNull and 
col("Total").isNotNull).map(x => 
(x.getString(1),x.getString(2).substring(1).replace(",", 
"").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, 
x.getString(4).substring(1).replace(",", "").toDouble))

 

a: org.apache.spark.rdd.RDD[(String, Double, Double, Double)] = 
MapPartitionsRDD[176] at map at <console>:21

 

This still comes up with “String index out of range: “ error

 

16/02/20 11:50:51 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID 18)

java.lang.StringIndexOutOfBoundsException: String index out of range: -1

 

My questions are:

 

1.    Doing the map,  map(x => (x.getString(1)  -- Can I replace x.getString(1) 
with the actual column name say “Invoice Number” and so forth for other columns 
as well?

2.       Sounds like it crashes because of these columns below at the end

[421,02/10/2015,?1,187.50,?237.50,?1,425.00]  \\ example good one

[,,,,] \\ bad one, empty one

[Net income,,?182,531.25,?14,606.25,?197,137.50] 

[,,,,] \\ bad one, empty one

[year 2014,,?113,500.00,?0.00,?113,500.00]

[Year 2015,,?69,031.25,?14,606.25,?83,637.50]

 

3.    Also to clarify I want to drop those two empty line -> [,,,,]  if I can. 
Unfortunately  drop() call does not work

a.drop()

<console>:24: error: value drop is not a member of 
org.apache.spark.rdd.RDD[(String, Double, Double, Double)]

              a.drop()

                ^

 

Thanka again,

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

From: Michał Zieliński [mailto:zielinski.mich...@gmail.com] 
Sent: 20 February 2016 08:59
To: Mich Talebzadeh <m...@peridale.co.uk>
Cc: user @spark <user@spark.apache.org>
Subject: Re: Checking for null values when mapping

 

You can use filter and isNotNull on  
<https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.Column> 
Column before the map.

 

On 20 February 2016 at 08:24, Mich Talebzadeh <m...@peridale.co.uk 
<mailto:m...@peridale.co.uk> > wrote:

 

I have a DF like below reading a csv file

 

 

val df = 
HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").option("header", "true").load("/data/stg/table2")

 

val a = df.map(x => (x.getString(0), x.getString(1), 
x.getString(2).substring(1).replace(",", 
"").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, 
x.getString(4).substring(1).replace(",", "").toDouble))

 

 

For most rows I am reading from csv file the above mapping works fine. However, 
at the bottom of csv there are couple of empty columns as below

 

[421,02/10/2015,?1,187.50,?237.50,?1,425.00]

[,,,,]

[Net income,,?182,531.25,?14,606.25,?197,137.50]

[,,,,]

[year 2014,,?113,500.00,?0.00,?113,500.00]

[Year 2015,,?69,031.25,?14,606.25,?83,637.50]

 

However, I get 

 

a.collect.foreach(println)

16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 161)

java.lang.StringIndexOutOfBoundsException: String index out of range: -1

 

I suspect the cause is substring operation  say x.getString(2).substring(1) on 
empty values that according to web will throw this type of error

 

 

The easiest solution seems to be to check whether x above is not null and do 
the substring operation. Can this be done without using a UDF?

 

Thanks

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

 

Reply via email to