Yeah, I ask because you might notice that by default the column types for CSV 
tables read in by read.df() are only strings (due to limitations in type 
inferencing in the DataBricks package). There was a separate discussion about 
schema inferencing, and Shivaram recently merged support for specifying your 
own schema as an argument to read.df(). The schema is defined as a structType. 
To see how this schema is declared, check out Hossein Falaki’s response in this 
thread [1].

— Alek

[1] -- 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkR-DataFrame-Column-Casts-esp-from-CSV-Files-td12589.html

From: Wei Zhou <zhweisop...@gmail.com<mailto:zhweisop...@gmail.com>>
Date: Thursday, June 25, 2015 at 4:38 PM
To: Aleksander Eskilson 
<alek.eskil...@cerner.com<mailto:alek.eskil...@cerner.com>>
Cc: "shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>" 
<shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: sparkR could not find function "textFile"

I tried out the solution using spark-csv package, and it worked fine now :) 
Thanks. Yes, I'm playing with a file with all columns as String, but the real 
data I want to process are all doubles. I'm just exploring what sparkR can do 
versus regular scala spark, as I am by heart a R person.

2015-06-25 14:26 GMT-07:00 Eskilson,Aleksander 
<alek.eskil...@cerner.com<mailto:alek.eskil...@cerner.com>>:
Sure, I had a similar question that Shivaram was able fast for me, the solution 
is implemented using a separate DataBrick’s library. Check out this thread from 
the email archives [1], and the read.df() command [2]. CSV files can be a bit 
tricky, especially with inferring their schemas. Are you using just strings as 
your column types right now?

Alek

[1] -- 
http://apache-spark-developers-list.1001551.n3.nabble.com/CSV-Support-in-SparkR-td12559.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Ddevelopers-2Dlist.1001551.n3.nabble.com_CSV-2DSupport-2Din-2DSparkR-2Dtd12559.html&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=0lRWA2D_zUlsV0-irNhLjFMY7BTVQLq6cg_YOwZyndc&s=MeTdQL6Tu4ePptdhzETIQCfYoKV4uviQnm4tHwbEPt4&e=>
[2] -- 
https://spark.apache.org/docs/latest/api/R/read.df.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_api_R_read.df.html&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=0lRWA2D_zUlsV0-irNhLjFMY7BTVQLq6cg_YOwZyndc&s=gefwwtL7oNXhDYn7tlpO3OcFgaZJ9ep3-lzQn2gP6bo&e=>

From: Wei Zhou <zhweisop...@gmail.com<mailto:zhweisop...@gmail.com>>
Date: Thursday, June 25, 2015 at 4:15 PM
To: "shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>" 
<shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>>
Cc: Aleksander Eskilson 
<alek.eskil...@cerner.com<mailto:alek.eskil...@cerner.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: sparkR could not find function "textFile"

Thanks to both Shivaram and Alek. Then if I want to create DataFrame from comma 
separated flat files, what would you recommend me to do? One way I can think of 
is first reading the data as you would do in r, using read.table(), and then 
create spark DataFrame out of that R dataframe, but it is obviously not 
scalable.


2015-06-25 13:59 GMT-07:00 Shivaram Venkataraman 
<shiva...@eecs.berkeley.edu<mailto:shiva...@eecs.berkeley.edu>>:
The `head` function is not supported for the RRDD that is returned by 
`textFile`. You can run `take(lines, 5L)`. I should add a warning here that the 
RDD API in SparkR is private because we might not support it in the upcoming 
releases. So if you can use the DataFrame API for your application you should 
try that out.

Thanks
Shivaram

On Thu, Jun 25, 2015 at 1:49 PM, Wei Zhou 
<zhweisop...@gmail.com<mailto:zhweisop...@gmail.com>> wrote:
Hi Alek,

Just a follow up question. This is what I did in sparkR shell:

lines <- SparkR:::textFile(sc, "./README.md")
head(lines)

And I am getting error:

"Error in x[seq_len(n)] : object of type 'S4' is not subsettable"

I'm wondering what did I do wrong. Thanks in advance.

Wei

2015-06-25 13:44 GMT-07:00 Wei Zhou 
<zhweisop...@gmail.com<mailto:zhweisop...@gmail.com>>:
Hi Alek,

Thanks for the explanation, it is very helpful.

Cheers,
Wei

2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander 
<alek.eskil...@cerner.com<mailto:alek.eskil...@cerner.com>>:
Hi there,

The tutorial you’re reading there was written before the merge of SparkR for 
Spark 1.4.0
For the merge, the RDD API (which includes the textFile() function) was made 
private, as the devs felt many of its functions were too low level. They 
focused instead on finishing the DataFrame API which supports local, HDFS, and 
Hive/HBase file reads. In the meantime, the devs are trying to determine which 
functions of the RDD API, if any, should be made public again. You can see the 
rationale behind this decision on the issue’s JIRA [1].

You can still make use of those now private RDD functions by prepending the 
function call with the SparkR private namespace, for example, you’d use
SparkR:::textFile(…).

Hope that helps,
Alek

[1] -- 
https://issues.apache.org/jira/browse/SPARK-7230<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D7230&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=x60a-3ztBe4XOw2bOnEI9-Mc6mENXT8PVxYvsmTLVG8&s=HpX1Cpayu5Mwu9JVt2znimJyUwtV3vcPurUO9ZJhASo&e=>

From: Wei Zhou <zhweisop...@gmail.com<mailto:zhweisop...@gmail.com>>
Date: Thursday, June 25, 2015 at 3:33 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: sparkR could not find function "textFile"

Hi all,

I am exploring sparkR by activating the shell and following the tutorial here 
https://amplab-extras.github.io/SparkR-pkg/<https://urldefense.proofpoint.com/v2/url?u=https-3A__amplab-2Dextras.github.io_SparkR-2Dpkg_&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aL4A2Pv9tHbhgJUX-EnuYx2HntTnrqVpegm6Ag-FwnQ&s=qfOET1UvP0ECAKgnTJw8G13sFTi_PhiJ8Q89fMSgH_Q&e=>

And when I tried to read in a local file with textFile(sc, "file_location"), it 
gives an error could not find function "textFile".

By reading through sparkR doc for 1.4, it seems that we need sqlContext to 
import data, for example.

people <- read.df(sqlContext, "./examples/src/main/resources/people.json", 
"json"

)
And we need to specify the file type.

My question is does sparkR stop supporting general type file importing? If not, 
would appreciate any help on how to do this.

PS, I am trying to recreate the word count example in sparkR, and want to 
import README.md file, or just any file into sparkR.

Thanks in advance.

Best,
Wei

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024<tel:%28%2B1%29%20%28816%29221-1024>.





Reply via email to