Re: Treadting NaN fields in Spark

2016-09-29 Thread Mich Talebzadeh
Thanks Michael.

I realised that just checking for Volume > 0 should do

val rs = df2.filter($"Volume".cast("Integer") > 0)

will do,

Your point on

Again why not remove the rows where the volume of trades is 0?

Are you referring to below

scala> val rs = df2.filter($"Volume".cast("Integer") === 0).drop().show
+-+--+--+++---+--+--+
|Stock|Ticker| TradeDate|Open|High|Low| Close|Volume|
+-+--+--+++---+--+--+
|Tesco PLC|  TSCO| 23-Dec-11|   -|   -|  -|391.00| 0|
|Tesco PLC|  TSCO| 26-Aug-11|   -|   -|  -|365.60| 0|
|Tesco PLC|  TSCO| 28-Apr-11|   -|   -|  -|403.55| 0|
|Tesco PLC|  TSCO| 21-Apr-11|   -|   -|  -|395.30| 0|
|Tesco PLC|  TSCO| 24-Dec-10|   -|   -|  -|439.00| 0|
+-+--+--+++---+--+--+

Cheers



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 16:55, Michael Segel 
wrote:

>
> On Sep 29, 2016, at 10:29 AM, Mich Talebzadeh 
> wrote:
>
> Good points :) it took take "-" as a negative number -123456?
>
> Yeah… you have to go down a level and start to remember that you’re
> dealing with a stream or buffer of bytes below any casting.
>
> At this moment in time this is what the code does
>
>
>1. csv is imported into HDFS as is. No cleaning done for rogue columns
>done at shell level
>2. Spark programs does the following filtration:
>3. val rs = df2.filter($"Open" !== "-").filter($"Volume".cast("Integer")
>> 0)
>
> So my first line of defence is to check for !== "-" which is a dash,
> commonly used for not available. The next filter is for volume column > 0
> (there was trades on this stock), otherwise the calculation could skew the
> results.  Note that a filter with AND with !== will not work.
>
>
> You can’t rely on the ‘-‘ to represent NaN or NULL.
>
> The issue is that you’re going from a loose typing to a stronger typing
> (String to Double).
> So pretty much any byte buffer could be interpreted as a String, but iff
> the String value is too long to be a Double, you will fail the NaN test.
> (Or its a NULL value/string)
> As to filtering… you would probably want to filter on volume being == 0.
>  (Its possible to actually have a negative volume.
> Or you could set the opening, low, high to the close if the volume is 0
> regardless of the values in those columns.
>
> Note: This would be a transformation of the data and should be done during
> ingestion so you’re doing it only once.
>
> Or you could just remove the rows since no trades occurred and then either
> reflect it in your graph as gaps or the graph interpolates it out .
>
>
> scala> val rs = df2.filter($"Open" !== "-" && $"Volume".cast("Integer") >
> 0)
> :40: error: value && is not a member of String
>val rs = df2.filter($"Open" !== "-" && $"Volume".cast("Integer") >
> 0)
>
> Will throw an error.
> But this equality === works!
>
> scala> val rs = df2.filter($"Open" *=== "-"* && $"Volume".cast("Integer")
> > 0)
> rs: org.apache.spark.sql.Dataset[columns] = [Stock: string, Ticker:
> string ... 6 more fields]
>
>
> Another alternative is to check for all digits here
>
>  scala> def isAllPostiveNumber (price: String) = price forall
> Character.isDigit
> isAllPostiveNumber: (price: String)Boolean
>
> Not really a good idea. You’re walking thru each byte in a stream and
> checking to see if its a digit. What if its a NULL string? What do you set
> the value to?
> This doesn’t scale well…
>
> Again why not remove the rows where the volume of trades is 0?
>
> Retuns Boolean true or false.  But does not work unless someone tells me
> what is wrong with this below!
>
> scala> val rs = df2.filter(isAllPostiveNumber("Open") => true)
>
> scala> val rs = df2.filter(isAllPostiveNumber("Open") => true)
> :1: error: not a legal formal parameter.
> Note: Tuples cannot be directly destructured in method or function
> parameters.
>   Either create a single parameter accepting the Tuple1,
>   or consider a pattern matching anonymous function: `{ case (param1,
> param1) => ... }
> val rs = df2.filter(isAllPostiveNumber("Open") => true)
>
>
> Thanks
>
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. 

Fwd: tod...@yahoo-inc.com is no longer with Yahoo! (was: Re: Treadting NaN fields in Spark)

2016-09-29 Thread Michael Segel
Hi,
Hate to be a pain… but could someone remove this email address (see below) from 
the spark mailing list(s)
It seems that ‘Elvis’ has left the building and forgot to change his mail 
subscriptions…

Begin forwarded message:

From: Yahoo! No Reply 
mailto:postmas...@yahoo-inc.com>>
Subject: tod...@yahoo-inc.com<mailto:tod...@yahoo-inc.com> is no longer with 
Yahoo! (was: Re: Treadting NaN fields in Spark)
Date: September 29, 2016 at 10:56:10 AM CDT
To: mailto:msegel_had...@hotmail.com>>


This is an automatically generated message.

tod...@yahoo-inc.com<mailto:tod...@yahoo-inc.com> is no longer with Yahoo! Inc.

Your message will not be forwarded.

If you have a sales inquiry, please email 
yahoosa...@yahoo-inc.com<mailto:yahoosa...@yahoo-inc.com> and someone will 
follow up with you shortly.

If you require assistance with a legal matter, please send a message to 
legal-noti...@yahoo-inc.com<mailto:legal-noti...@yahoo-inc.com>

Thank you!



Re: Treadting NaN fields in Spark

2016-09-29 Thread Michael Segel

On Sep 29, 2016, at 10:29 AM, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:

Good points :) it took take "-" as a negative number -123456?

Yeah… you have to go down a level and start to remember that you’re dealing 
with a stream or buffer of bytes below any casting.

At this moment in time this is what the code does


  1.  csv is imported into HDFS as is. No cleaning done for rogue columns done 
at shell level
  2.  Spark programs does the following filtration:
  3.  val rs = df2.filter($"Open" !== "-").filter($"Volume".cast("Integer") > 0)

So my first line of defence is to check for !== "-" which is a dash, commonly 
used for not available. The next filter is for volume column > 0 (there was 
trades on this stock), otherwise the calculation could skew the results.  Note 
that a filter with AND with !== will not work.


You can’t rely on the ‘-‘ to represent NaN or NULL.

The issue is that you’re going from a loose typing to a stronger typing (String 
to Double).
So pretty much any byte buffer could be interpreted as a String, but iff the 
String value is too long to be a Double, you will fail the NaN test. (Or its a 
NULL value/string)
As to filtering… you would probably want to filter on volume being == 0.  (Its 
possible to actually have a negative volume.
Or you could set the opening, low, high to the close if the volume is 0 
regardless of the values in those columns.

Note: This would be a transformation of the data and should be done during 
ingestion so you’re doing it only once.

Or you could just remove the rows since no trades occurred and then either 
reflect it in your graph as gaps or the graph interpolates it out .


scala> val rs = df2.filter($"Open" !== "-" && $"Volume".cast("Integer") > 0)
:40: error: value && is not a member of String
   val rs = df2.filter($"Open" !== "-" && $"Volume".cast("Integer") > 0)

Will throw an error.

But this equality === works!

scala> val rs = df2.filter($"Open" === "-" && $"Volume".cast("Integer") > 0)
rs: org.apache.spark.sql.Dataset[columns] = [Stock: string, Ticker: string ... 
6 more fields]


Another alternative is to check for all digits here

 scala> def isAllPostiveNumber (price: String) = price forall Character.isDigit
isAllPostiveNumber: (price: String)Boolean

Not really a good idea. You’re walking thru each byte in a stream and checking 
to see if its a digit. What if its a NULL string? What do you set the value to?
This doesn’t scale well…

Again why not remove the rows where the volume of trades is 0?

Retuns Boolean true or false.  But does not work unless someone tells me what 
is wrong with this below!

scala> val rs = df2.filter(isAllPostiveNumber("Open") => true)

scala> val rs = df2.filter(isAllPostiveNumber("Open") => true)
:1: error: not a legal formal parameter.
Note: Tuples cannot be directly destructured in method or function parameters.
  Either create a single parameter accepting the Tuple1,
  or consider a pattern matching anonymous function: `{ case (param1, 
param1) => ... }
val rs = df2.filter(isAllPostiveNumber("Open") => true)


Thanks











Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 29 September 2016 at 13:45, Michael Segel 
mailto:msegel_had...@hotmail.com>> wrote:
Hi,

Just a few thoughts so take it for what its worth…

Databases have static schemas and will reject a row’s column on insert.

In your case… you have one data set where you have a column which is supposed 
to be a number but you have it as a string.
You want to convert this to a double in your final data set.


It looks like your problem is that your original data set that you ingested 
used a ‘-‘ (dash) to represent missing data, rather than a NULL value.
In fact, looking at the rows… you seem to have a stock that didn’t trade for a 
given day. (All have Volume as 0. ) Why do you need this?  Wouldn’t you want to 
represent this as null or no row for a given date?

The reason your ‘-‘ check failed when isnan() is that ‘-‘ actually could be 
represented as a number.

If you replaced the ‘-‘ with a String that is wider than the width of a double 
… the isnan should flag the row.

(I still need more coffee, so I could be wrong) ;-)

HTH

-Mike

On Sep 28, 2016, at 5:56 AM, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:


This is an issue in most databases. Specifically if a field is NaN.. --> (NaN, 
standing for not a number, is a numeric data type value representing an 
undefined or unrepresentable value, especially in floating-point ca

Re: Treadting NaN fields in Spark

2016-09-29 Thread Mich Talebzadeh
Good points :) it took take "-" as a negative number -123456?

At this moment in time this is what the code does


   1. csv is imported into HDFS as is. No cleaning done for rogue columns
   done at shell level
   2. Spark programs does the following filtration:
   3. val rs = df2.filter($"Open" !== "-").filter($"Volume".cast("Integer")
   > 0)

So my first line of defence is to check for !== "-" which is a dash,
commonly used for not available. The next filter is for volume column > 0
(there was trades on this stock), otherwise the calculation could skew the
results.  Note that a filter with AND with !== will not work.

scala> val rs = df2.filter($"Open" !== "-" && $"Volume".cast("Integer") > 0)
:40: error: value && is not a member of String
   val rs = df2.filter($"Open" !== "-" && $"Volume".cast("Integer") > 0)

Will throw an error.
But this equality === works!

scala> val rs = df2.filter($"Open" *=== "-"* && $"Volume".cast("Integer") >
0)
rs: org.apache.spark.sql.Dataset[columns] = [Stock: string, Ticker: string
... 6 more fields]


Another alternative is to check for all digits here

 scala> def isAllPostiveNumber (price: String) = price forall
Character.isDigit
isAllPostiveNumber: (price: String)Boolean
Retuns Boolean true or false.  But does not work unless someone tells me
what is wrong with this below!

scala> val rs = df2.filter(isAllPostiveNumber("Open") => true)

scala> val rs = df2.filter(isAllPostiveNumber("Open") => true)
:1: error: not a legal formal parameter.
Note: Tuples cannot be directly destructured in method or function
parameters.
  Either create a single parameter accepting the Tuple1,
  or consider a pattern matching anonymous function: `{ case (param1,
param1) => ... }
val rs = df2.filter(isAllPostiveNumber("Open") => true)


Thanks












Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 September 2016 at 13:45, Michael Segel 
wrote:

> Hi,
>
> Just a few thoughts so take it for what its worth…
>
> Databases have static schemas and will reject a row’s column on insert.
>
> In your case… you have one data set where you have a column which is
> supposed to be a number but you have it as a string.
> You want to convert this to a double in your final data set.
>
>
> It looks like your problem is that your original data set that you
> ingested used a ‘-‘ (dash) to represent missing data, rather than a NULL
> value.
> In fact, looking at the rows… you seem to have a stock that didn’t trade
> for a given day. (All have Volume as 0. ) Why do you need this?  Wouldn’t
> you want to represent this as null or no row for a given date?
>
> The reason your ‘-‘ check failed when isnan() is that ‘-‘ actually could
> be represented as a number.
>
> If you replaced the ‘-‘ with a String that is wider than the width of a
> double … the isnan should flag the row.
>
> (I still need more coffee, so I could be wrong) ;-)
>
> HTH
>
> -Mike
>
> On Sep 28, 2016, at 5:56 AM, Mich Talebzadeh 
> wrote:
>
>
> This is an issue in most databases. Specifically if a field is NaN.. --> (
> *NaN*, standing for not a number, is a numeric data type value
> representing an undefined or unrepresentable value, especially in
> floating-point calculations)
>
> There is a method called isnan() in Spark that is supposed to handle this
> scenario . However, it does not return correct values! For example I
> defined column "Open" as String  (it should be Float) and it has the
> following 7 rogue entries out of 1272 rows in a csv
>
> df2.filter( $"OPen" === 
> "-").select((changeToDate("TradeDate").as("TradeDate")),
> 'Open, 'High, 'Low, 'Close, 'Volume).show
>
> +--+++---+-+--+
> | TradeDate|Open|High|Low|Close|Volume|
> +--+++---+-+--+
> |2011-12-23|   -|   -|  -|40.56| 0|
> |2011-04-21|   -|   -|  -|45.85| 0|
> |2010-12-30|   -|   -|  -|38.10| 0|
> |2010-12-23|   -|   -|  -|38.36| 0|
> |2008-04-30|   -|   -|  -|32.39| 0|
> |2008-04-29|   -|   -|  -|33.05| 0|
> |2008-04-28|   -|   -|  -|32.60| 0|
> +--+++---+-+--+
>
> However, the following does not work!
>
>  df2.filter(isnan($"Open")).show
> +-+--+-+++---+-+--+
> |Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
> +-+--+-+++---+-+--+
> +-+--+-+++---+-+--+
>
> Any suggestions?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www

Re: Treadting NaN fields in Spark

2016-09-29 Thread Peter Figliozzi
"isnan" ends up using a case class, subclass of UnaryExpression, called
"IsNaN" which evaluates each row of the column like this:

   - *False* if the value is Null
   - Check the "Expression.Type" (apparently a Spark thing, not a Scala
   thing.. still learning here)
   - DoubleType:  cast to Double and retrieve .isNaN
   - FloatType: cast to Float and retrieve .isNaN
   - Casting done by value.asInstanceOf[T]

What's interesting is the "inputTypes" for this class are only DoubleType
and FloatType.  Unfortunately, I haven't figured out how the code would
handle a String.  Maybe someone could tell us how these Expressions work?

In any case, we're not getting *True* back unless the value x casted to a
Double actually returns Double.NaN.  Strings casted to Double return errors
(not Double.NaN) and the '-' character casted to Double returns 45 (!).

On Thu, Sep 29, 2016 at 7:45 AM, Michael Segel 
wrote:

> Hi,
>
> Just a few thoughts so take it for what its worth…
>
> Databases have static schemas and will reject a row’s column on insert.
>
> In your case… you have one data set where you have a column which is
> supposed to be a number but you have it as a string.
> You want to convert this to a double in your final data set.
>
>
> It looks like your problem is that your original data set that you
> ingested used a ‘-‘ (dash) to represent missing data, rather than a NULL
> value.
> In fact, looking at the rows… you seem to have a stock that didn’t trade
> for a given day. (All have Volume as 0. ) Why do you need this?  Wouldn’t
> you want to represent this as null or no row for a given date?
>
> The reason your ‘-‘ check failed when isnan() is that ‘-‘ actually could
> be represented as a number.
>
> If you replaced the ‘-‘ with a String that is wider than the width of a
> double … the isnan should flag the row.
>
> (I still need more coffee, so I could be wrong) ;-)
>
> HTH
>
> -Mike
>
> On Sep 28, 2016, at 5:56 AM, Mich Talebzadeh 
> wrote:
>
>
> This is an issue in most databases. Specifically if a field is NaN.. --> (
> *NaN*, standing for not a number, is a numeric data type value
> representing an undefined or unrepresentable value, especially in
> floating-point calculations)
>
> There is a method called isnan() in Spark that is supposed to handle this
> scenario . However, it does not return correct values! For example I
> defined column "Open" as String  (it should be Float) and it has the
> following 7 rogue entries out of 1272 rows in a csv
>
> df2.filter( $"OPen" === 
> "-").select((changeToDate("TradeDate").as("TradeDate")),
> 'Open, 'High, 'Low, 'Close, 'Volume).show
>
> +--+++---+-+--+
> | TradeDate|Open|High|Low|Close|Volume|
> +--+++---+-+--+
> |2011-12-23|   -|   -|  -|40.56| 0|
> |2011-04-21|   -|   -|  -|45.85| 0|
> |2010-12-30|   -|   -|  -|38.10| 0|
> |2010-12-23|   -|   -|  -|38.36| 0|
> |2008-04-30|   -|   -|  -|32.39| 0|
> |2008-04-29|   -|   -|  -|33.05| 0|
> |2008-04-28|   -|   -|  -|32.60| 0|
> +--+++---+-+--+
>
> However, the following does not work!
>
>  df2.filter(isnan($"Open")).show
> +-+--+-+++---+-+--+
> |Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
> +-+--+-+++---+-+--+
> +-+--+-+++---+-+--+
>
> Any suggestions?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>


Re: Treadting NaN fields in Spark

2016-09-29 Thread Michael Segel
Hi,

Just a few thoughts so take it for what its worth…

Databases have static schemas and will reject a row’s column on insert.

In your case… you have one data set where you have a column which is supposed 
to be a number but you have it as a string.
You want to convert this to a double in your final data set.


It looks like your problem is that your original data set that you ingested 
used a ‘-‘ (dash) to represent missing data, rather than a NULL value.
In fact, looking at the rows… you seem to have a stock that didn’t trade for a 
given day. (All have Volume as 0. ) Why do you need this?  Wouldn’t you want to 
represent this as null or no row for a given date?

The reason your ‘-‘ check failed when isnan() is that ‘-‘ actually could be 
represented as a number.

If you replaced the ‘-‘ with a String that is wider than the width of a double 
… the isnan should flag the row.

(I still need more coffee, so I could be wrong) ;-)

HTH

-Mike

On Sep 28, 2016, at 5:56 AM, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:


This is an issue in most databases. Specifically if a field is NaN.. --> (NaN, 
standing for not a number, is a numeric data type value representing an 
undefined or unrepresentable value, especially in floating-point calculations)

There is a method called isnan() in Spark that is supposed to handle this 
scenario . However, it does not return correct values! For example I defined 
column "Open" as String  (it should be Float) and it has the following 7 rogue 
entries out of 1272 rows in a csv

df2.filter( $"OPen" === 
"-").select((changeToDate("TradeDate").as("TradeDate")), 'Open, 'High, 'Low, 
'Close, 'Volume).show

+--+++---+-+--+
| TradeDate|Open|High|Low|Close|Volume|
+--+++---+-+--+
|2011-12-23|   -|   -|  -|40.56| 0|
|2011-04-21|   -|   -|  -|45.85| 0|
|2010-12-30|   -|   -|  -|38.10| 0|
|2010-12-23|   -|   -|  -|38.36| 0|
|2008-04-30|   -|   -|  -|32.39| 0|
|2008-04-29|   -|   -|  -|33.05| 0|
|2008-04-28|   -|   -|  -|32.60| 0|
+--+++---+-+--+

However, the following does not work!

 df2.filter(isnan($"Open")).show
+-+--+-+++---+-+--+
|Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
+-+--+-+++---+-+--+
+-+--+-+++---+-+--+

Any suggestions?

Thanks


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.






Re: Treadting NaN fields in Spark

2016-09-28 Thread Marco Mistroni
Hi Dr Mich,
  how bout reading all csv as  string and then applying an UDF sort of like
this?

  import scala.util.control.Exception.allCatch

  def getDouble(doubleStr:String):Double =
allCatch opt doubleStr.toDouble match {
case Some(doubleNum) => doubleNum
case _ => Double.NaN
  }


out of curiosity are you reading data from Yahoo Finance? if so, are you
downloading a whole .csv file?
i m doing similar thing but i am using instead a library from
com.github.tototoshi.csv._  to read csv files as a list of string, then i
have control on how to render each row. but presumably if you have over
1k worth of data perhaps this solution will not assist

hth
 marco




On Wed, Sep 28, 2016 at 3:44 PM, Peter Figliozzi 
wrote:

> In Scala, x.isNaN returns true for Double.NaN, but false for any
> character.  I guess the `isnan` function you are using works by ultimately
> looking at x.isNan.
>
> On Wed, Sep 28, 2016 at 5:56 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>> This is an issue in most databases. Specifically if a field is NaN.. --> (
>> *NaN*, standing for not a number, is a numeric data type value
>> representing an undefined or unrepresentable value, especially in
>> floating-point calculations)
>>
>> There is a method called isnan() in Spark that is supposed to handle this
>> scenario . However, it does not return correct values! For example I
>> defined column "Open" as String  (it should be Float) and it has the
>> following 7 rogue entries out of 1272 rows in a csv
>>
>> df2.filter( $"OPen" === 
>> "-").select((changeToDate("TradeDate").as("TradeDate")),
>> 'Open, 'High, 'Low, 'Close, 'Volume).show
>>
>> +--+++---+-+--+
>> | TradeDate|Open|High|Low|Close|Volume|
>> +--+++---+-+--+
>> |2011-12-23|   -|   -|  -|40.56| 0|
>> |2011-04-21|   -|   -|  -|45.85| 0|
>> |2010-12-30|   -|   -|  -|38.10| 0|
>> |2010-12-23|   -|   -|  -|38.36| 0|
>> |2008-04-30|   -|   -|  -|32.39| 0|
>> |2008-04-29|   -|   -|  -|33.05| 0|
>> |2008-04-28|   -|   -|  -|32.60| 0|
>> +--+++---+-+--+
>>
>> However, the following does not work!
>>
>>  df2.filter(isnan($"Open")).show
>> +-+--+-+++---+-+--+
>> |Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
>> +-+--+-+++---+-+--+
>> +-+--+-+++---+-+--+
>>
>> Any suggestions?
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>


Re: Treadting NaN fields in Spark

2016-09-28 Thread Peter Figliozzi
In Scala, x.isNaN returns true for Double.NaN, but false for any
character.  I guess the `isnan` function you are using works by ultimately
looking at x.isNan.

On Wed, Sep 28, 2016 at 5:56 AM, Mich Talebzadeh 
wrote:

>
> This is an issue in most databases. Specifically if a field is NaN.. --> (
> *NaN*, standing for not a number, is a numeric data type value
> representing an undefined or unrepresentable value, especially in
> floating-point calculations)
>
> There is a method called isnan() in Spark that is supposed to handle this
> scenario . However, it does not return correct values! For example I
> defined column "Open" as String  (it should be Float) and it has the
> following 7 rogue entries out of 1272 rows in a csv
>
> df2.filter( $"OPen" === 
> "-").select((changeToDate("TradeDate").as("TradeDate")),
> 'Open, 'High, 'Low, 'Close, 'Volume).show
>
> +--+++---+-+--+
> | TradeDate|Open|High|Low|Close|Volume|
> +--+++---+-+--+
> |2011-12-23|   -|   -|  -|40.56| 0|
> |2011-04-21|   -|   -|  -|45.85| 0|
> |2010-12-30|   -|   -|  -|38.10| 0|
> |2010-12-23|   -|   -|  -|38.36| 0|
> |2008-04-30|   -|   -|  -|32.39| 0|
> |2008-04-29|   -|   -|  -|33.05| 0|
> |2008-04-28|   -|   -|  -|32.60| 0|
> +--+++---+-+--+
>
> However, the following does not work!
>
>  df2.filter(isnan($"Open")).show
> +-+--+-+++---+-+--+
> |Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
> +-+--+-+++---+-+--+
> +-+--+-+++---+-+--+
>
> Any suggestions?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>


Treadting NaN fields in Spark

2016-09-28 Thread Mich Talebzadeh
This is an issue in most databases. Specifically if a field is NaN.. --> (
*NaN*, standing for not a number, is a numeric data type value representing
an undefined or unrepresentable value, especially in floating-point
calculations)

There is a method called isnan() in Spark that is supposed to handle this
scenario . However, it does not return correct values! For example I
defined column "Open" as String  (it should be Float) and it has the
following 7 rogue entries out of 1272 rows in a csv

df2.filter( $"OPen" ===
"-").select((changeToDate("TradeDate").as("TradeDate")),
'Open, 'High, 'Low, 'Close, 'Volume).show

+--+++---+-+--+
| TradeDate|Open|High|Low|Close|Volume|
+--+++---+-+--+
|2011-12-23|   -|   -|  -|40.56| 0|
|2011-04-21|   -|   -|  -|45.85| 0|
|2010-12-30|   -|   -|  -|38.10| 0|
|2010-12-23|   -|   -|  -|38.36| 0|
|2008-04-30|   -|   -|  -|32.39| 0|
|2008-04-29|   -|   -|  -|33.05| 0|
|2008-04-28|   -|   -|  -|32.60| 0|
+--+++---+-+--+

However, the following does not work!

 df2.filter(isnan($"Open")).show
+-+--+-+++---+-+--+
|Stock|Ticker|TradeDate|Open|High|Low|Close|Volume|
+-+--+-+++---+-+--+
+-+--+-+++---+-+--+

Any suggestions?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.