You can use the Spark CSV reader to do read in flat CSV files to a data frame. See https://gist.github.com/shivaram/d0cd4aa5c4381edd6f85 for an example
Shivaram On Thu, Jun 25, 2015 at 2:15 PM, Wei Zhou <zhweisop...@gmail.com> wrote: > Thanks to both Shivaram and Alek. Then if I want to create DataFrame from > comma separated flat files, what would you recommend me to do? One way I > can think of is first reading the data as you would do in r, using > read.table(), and then create spark DataFrame out of that R dataframe, but > it is obviously not scalable. > > > 2015-06-25 13:59 GMT-07:00 Shivaram Venkataraman < > shiva...@eecs.berkeley.edu>: > >> The `head` function is not supported for the RRDD that is returned by >> `textFile`. You can run `take(lines, 5L)`. I should add a warning here that >> the RDD API in SparkR is private because we might not support it in the >> upcoming releases. So if you can use the DataFrame API for your application >> you should try that out. >> >> Thanks >> Shivaram >> >> On Thu, Jun 25, 2015 at 1:49 PM, Wei Zhou <zhweisop...@gmail.com> wrote: >> >>> Hi Alek, >>> >>> Just a follow up question. This is what I did in sparkR shell: >>> >>> lines <- SparkR:::textFile(sc, "./README.md") >>> head(lines) >>> >>> And I am getting error: >>> >>> "Error in x[seq_len(n)] : object of type 'S4' is not subsettable" >>> >>> I'm wondering what did I do wrong. Thanks in advance. >>> >>> Wei >>> >>> 2015-06-25 13:44 GMT-07:00 Wei Zhou <zhweisop...@gmail.com>: >>> >>>> Hi Alek, >>>> >>>> Thanks for the explanation, it is very helpful. >>>> >>>> Cheers, >>>> Wei >>>> >>>> 2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander < >>>> alek.eskil...@cerner.com>: >>>> >>>>> Hi there, >>>>> >>>>> The tutorial you’re reading there was written before the merge of >>>>> SparkR for Spark 1.4.0 >>>>> For the merge, the RDD API (which includes the textFile() function) >>>>> was made private, as the devs felt many of its functions were too low >>>>> level. They focused instead on finishing the DataFrame API which supports >>>>> local, HDFS, and Hive/HBase file reads. In the meantime, the devs are >>>>> trying to determine which functions of the RDD API, if any, should be made >>>>> public again. You can see the rationale behind this decision on the >>>>> issue’s >>>>> JIRA [1]. >>>>> >>>>> You can still make use of those now private RDD functions by >>>>> prepending the function call with the SparkR private namespace, for >>>>> example, you’d use >>>>> SparkR:::textFile(…). >>>>> >>>>> Hope that helps, >>>>> Alek >>>>> >>>>> [1] -- https://issues.apache.org/jira/browse/SPARK-7230 >>>>> >>>>> From: Wei Zhou <zhweisop...@gmail.com> >>>>> Date: Thursday, June 25, 2015 at 3:33 PM >>>>> To: "user@spark.apache.org" <user@spark.apache.org> >>>>> Subject: sparkR could not find function "textFile" >>>>> >>>>> Hi all, >>>>> >>>>> I am exploring sparkR by activating the shell and following the >>>>> tutorial here https://amplab-extras.github.io/SparkR-pkg/ >>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__amplab-2Dextras.github.io_SparkR-2Dpkg_&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aL4A2Pv9tHbhgJUX-EnuYx2HntTnrqVpegm6Ag-FwnQ&s=qfOET1UvP0ECAKgnTJw8G13sFTi_PhiJ8Q89fMSgH_Q&e=> >>>>> >>>>> And when I tried to read in a local file with textFile(sc, >>>>> "file_location"), it gives an error could not find function "textFile". >>>>> >>>>> By reading through sparkR doc for 1.4, it seems that we need >>>>> sqlContext to import data, for example. >>>>> >>>>> people <- read.df(sqlContext, >>>>> "./examples/src/main/resources/people.json", "json" >>>>> >>>>> ) >>>>> And we need to specify the file type. >>>>> >>>>> My question is does sparkR stop supporting general type file >>>>> importing? If not, would appreciate any help on how to do this. >>>>> >>>>> PS, I am trying to recreate the word count example in sparkR, and >>>>> want to import README.md file, or just any file into sparkR. >>>>> >>>>> Thanks in advance. >>>>> >>>>> Best, >>>>> Wei >>>>> >>>>> CONFIDENTIALITY NOTICE This message and any included attachments >>>>> are from Cerner Corporation and are intended only for the addressee. The >>>>> information contained in this message is confidential and may constitute >>>>> inside or non-public information under international, federal, or state >>>>> securities laws. Unauthorized forwarding, printing, copying, distribution, >>>>> or use of such information is strictly prohibited and may be unlawful. If >>>>> you are not the addressee, please promptly delete this message and notify >>>>> the sender of the delivery error by e-mail or you may call Cerner's >>>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) >>>>> (816)221-1024. >>>>> >>>> >>>> >>> >> >