Hm.. As far as I remember, you can set the value to treat as null with *nullValue* option. Although I am hitting network issues with Github so I can't check this now but please try that option as described in https://github.com/databricks/spark-csv.
2016-01-28 0:55 GMT+09:00 Felix Cheung <felixcheun...@hotmail.com>: > That's correct - and because spark-csv as Spark package is not > specifically aware of R's notion of NA and interprets it as a string value. > > On the other hand, R native NA is converted to NULL on Spark when creating > a Spark DataFrame from a R data.frame. > https://eradiating.wordpress.com/2016/01/04/whats-new-in-sparkr-1-6-0/ > > > > _____________________________ > From: Devesh Raj Singh <raj.deves...@gmail.com> > Sent: Wednesday, January 27, 2016 3:19 AM > Subject: Re: NA value handling in sparkR > To: Deborah Siegel <deborah.sie...@gmail.com> > Cc: <user@spark.apache.org> > > > > Hi, > > While dealing with missing values with R and SparkR I observed the > following. Please tell me if I am right or wrong? > > > Missing values in native R are represented with a logical constant-NA. > SparkR DataFrames represents missing values with NULL. If you use > createDataFrame() to turn a local R data.frame into a distributed SparkR > DataFrame, SparkR will automatically convert NA to NULL. > > However, if you are creating a SparkR > DataFrame by reading in data from a file using read.df(), you may have > strings of "NA", but not R logical constant NA missing value > representations. String "NA" is not automatically converted to NULL. > > On Tue, Jan 26, 2016 at 2:07 AM, Deborah Siegel <deborah.sie...@gmail.com> > wrote: > >> Maybe not ideal, but since read.df is inferring all columns from the csv >> containing "NA" as type of strings, one could filter them rather than using >> dropna(). >> >> filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA") >> head(filtered_aq) >> >> Perhaps it would be better to have an option for read.df to convert any >> "NA" it encounters into null types, like createDataFrame does for <NA>, and >> then one would be able to use dropna() etc. >> >> >> >> On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh <raj.deves...@gmail.com >> > wrote: >> >>> Hi, >>> >>> Yes you are right. >>> >>> I think the problem is with reading of csv files. read.df is not >>> considering NAs in the CSV file >>> >>> So what would be a workable solution in dealing with NAs in csv files? >>> >>> >>> >>> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel < >>> deborah.sie...@gmail.com> wrote: >>> >>>> Hi Devesh, >>>> >>>> I'm not certain why that's happening, and it looks like it doesn't >>>> happen if you use createDataFrame directly: >>>> aq <- createDataFrame(sqlContext,airquality) >>>> head(dropna(aq,how="any")) >>>> >>>> If I had to guess.. dropna(), I believe, drops null values. I suppose >>>> its possible that createDataFrame converts R's <NA> values to null, so >>>> dropna() works with that. But perhaps read.df() does not convert R <NA>s to >>>> null, as those are most likely interpreted as strings when they come in >>>> from the csv. Just a guess, can anyone confirm? >>>> >>>> Deb >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh < >>>> raj.deves...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I have applied the following code on airquality dataset available in R >>>>> , which has some missing values. I want to omit the rows which has NAs >>>>> >>>>> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" >>>>> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"') >>>>> >>>>> sc <- sparkR.init("local",sparkHome = >>>>> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6") >>>>> >>>>> sqlContext <- sparkRSQL.init(sc) >>>>> >>>>> path<-"/Users/devesh/work/airquality/" >>>>> >>>>> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv", >>>>> header="true", inferSchema="true") >>>>> >>>>> head(dropna(aq,how="any")) >>>>> >>>>> I am getting the output as >>>>> >>>>> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 >>>>> 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 >>>>> NA 14.9 66 5 6 >>>>> >>>>> The NAs still exist in the output. Am I missing something here? >>>>> >>>>> -- >>>>> Warm regards, >>>>> Devesh. >>>>> >>>> >>>> >>> >>> >>> -- >>> Warm regards, >>> Devesh. >>> >> >> > > > -- > Warm regards, > Devesh. > > >