Hi,

While dealing with missing values with R and SparkR I observed the
following. Please tell me if I am right or wrong?


Missing values in native R are represented with a logical constant-NA.
SparkR DataFrames represents missing values with NULL. If you use
createDataFrame() to turn a local R data.frame into a distributed SparkR
DataFrame, SparkR will automatically convert NA to NULL.

                            However, if you are creating a SparkR DataFrame
by reading in data from a file using read.df(), you may have strings of
"NA", but not R logical constant NA missing value representations. String
"NA" is not automatically converted to NULL.

On Tue, Jan 26, 2016 at 2:07 AM, Deborah Siegel <deborah.sie...@gmail.com>
wrote:

> Maybe not ideal, but since read.df is inferring all columns from the csv
> containing "NA" as type of strings, one could filter them rather than using
> dropna().
>
> filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")
> head(filtered_aq)
>
> Perhaps it would be better to have an option for read.df to convert any
> "NA" it encounters into null types, like createDataFrame does for <NA>, and
> then one would be able to use dropna() etc.
>
>
>
> On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <raj.deves...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Yes you are right.
>>
>> I think the problem is with reading of csv files. read.df is not
>> considering NAs in the CSV file
>>
>> So what would be a workable solution in dealing with NAs in csv files?
>>
>>
>>
>> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <deborah.sie...@gmail.com
>> > wrote:
>>
>>> Hi Devesh,
>>>
>>> I'm not certain why that's happening, and it looks like it doesn't
>>> happen if you use createDataFrame directly:
>>> aq <- createDataFrame(sqlContext,airquality)
>>> head(dropna(aq,how="any"))
>>>
>>> If I had to guess.. dropna(), I believe, drops null values. I suppose
>>> its possible that createDataFrame converts R's <NA> values to null, so
>>> dropna() works with that. But perhaps read.df() does not convert R <NA>s to
>>> null, as those are most likely interpreted as strings when they come in
>>> from the csv. Just a guess, can anyone confirm?
>>>
>>> Deb
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <
>>> raj.deves...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have applied the following code on airquality dataset available in R
>>>> , which has some missing values. I want to omit the rows which has NAs
>>>>
>>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
>>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>>>>
>>>> sc <- sparkR.init("local",sparkHome =
>>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>>>>
>>>> sqlContext <- sparkRSQL.init(sc)
>>>>
>>>> path<-"/Users/devesh/work/airquality/"
>>>>
>>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
>>>> header="true", inferSchema="true")
>>>>
>>>> head(dropna(aq,how="any"))
>>>>
>>>> I am getting the output as
>>>>
>>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5
>>>> 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA
>>>> 14.9 66 5 6
>>>>
>>>> The NAs still exist in the output. Am I missing something here?
>>>>
>>>> --
>>>> Warm regards,
>>>> Devesh.
>>>>
>>>
>>>
>>
>>
>> --
>> Warm regards,
>> Devesh.
>>
>
>


-- 
Warm regards,
Devesh.

Reply via email to