Hi, While dealing with missing values with R and SparkR I observed the following. Please tell me if I am right or wrong?
Missing values in native R are represented with a logical constant-NA. SparkR DataFrames represents missing values with NULL. If you use createDataFrame() to turn a local R data.frame into a distributed SparkR DataFrame, SparkR will automatically convert NA to NULL. However, if you are creating a SparkR DataFrame by reading in data from a file using read.df(), you may have strings of "NA", but not R logical constant NA missing value representations. String "NA" is not automatically converted to NULL. On Tue, Jan 26, 2016 at 2:07 AM, Deborah Siegel <deborah.sie...@gmail.com> wrote: > Maybe not ideal, but since read.df is inferring all columns from the csv > containing "NA" as type of strings, one could filter them rather than using > dropna(). > > filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA") > head(filtered_aq) > > Perhaps it would be better to have an option for read.df to convert any > "NA" it encounters into null types, like createDataFrame does for <NA>, and > then one would be able to use dropna() etc. > > > > On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <raj.deves...@gmail.com> > wrote: > >> Hi, >> >> Yes you are right. >> >> I think the problem is with reading of csv files. read.df is not >> considering NAs in the CSV file >> >> So what would be a workable solution in dealing with NAs in csv files? >> >> >> >> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <deborah.sie...@gmail.com >> > wrote: >> >>> Hi Devesh, >>> >>> I'm not certain why that's happening, and it looks like it doesn't >>> happen if you use createDataFrame directly: >>> aq <- createDataFrame(sqlContext,airquality) >>> head(dropna(aq,how="any")) >>> >>> If I had to guess.. dropna(), I believe, drops null values. I suppose >>> its possible that createDataFrame converts R's <NA> values to null, so >>> dropna() works with that. But perhaps read.df() does not convert R <NA>s to >>> null, as those are most likely interpreted as strings when they come in >>> from the csv. Just a guess, can anyone confirm? >>> >>> Deb >>> >>> >>> >>> >>> >>> >>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh < >>> raj.deves...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I have applied the following code on airquality dataset available in R >>>> , which has some missing values. I want to omit the rows which has NAs >>>> >>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" >>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"') >>>> >>>> sc <- sparkR.init("local",sparkHome = >>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6") >>>> >>>> sqlContext <- sparkRSQL.init(sc) >>>> >>>> path<-"/Users/devesh/work/airquality/" >>>> >>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv", >>>> header="true", inferSchema="true") >>>> >>>> head(dropna(aq,how="any")) >>>> >>>> I am getting the output as >>>> >>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 >>>> 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA >>>> 14.9 66 5 6 >>>> >>>> The NAs still exist in the output. Am I missing something here? >>>> >>>> -- >>>> Warm regards, >>>> Devesh. >>>> >>> >>> >> >> >> -- >> Warm regards, >> Devesh. >> > > -- Warm regards, Devesh.