I tried out the solution using spark-csv package, and it worked fine now :)
Thanks. Yes, I'm playing with a file with all columns as String, but the
real data I want to process are all doubles. I'm just exploring what sparkR
can do versus regular scala spark, as I am by heart a R person.

2015-06-25 14:26 GMT-07:00 Eskilson,Aleksander <alek.eskil...@cerner.com>:

>  Sure, I had a similar question that Shivaram was able fast for me, the
> solution is implemented using a separate DataBrick’s library. Check out
> this thread from the email archives [1], and the read.df() command [2]. CSV
> files can be a bit tricky, especially with inferring their schemas. Are you
> using just strings as your column types right now?
>
>  Alek
>
>  [1] --
> http://apache-spark-developers-list.1001551.n3.nabble.com/CSV-Support-in-SparkR-td12559.html
> [2] -- https://spark.apache.org/docs/latest/api/R/read.df.html
>
>   From: Wei Zhou <zhweisop...@gmail.com>
> Date: Thursday, June 25, 2015 at 4:15 PM
> To: "shiva...@eecs.berkeley.edu" <shiva...@eecs.berkeley.edu>
> Cc: Aleksander Eskilson <alek.eskil...@cerner.com>, "user@spark.apache.org"
> <user@spark.apache.org>
> Subject: Re: sparkR could not find function "textFile"
>
>   Thanks to both Shivaram and Alek. Then if I want to create DataFrame
> from comma separated flat files, what would you recommend me to do? One way
> I can think of is first reading the data as you would do in r, using
> read.table(), and then create spark DataFrame out of that R dataframe, but
> it is obviously not scalable.
>
>
> 2015-06-25 13:59 GMT-07:00 Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu>:
>
>> The `head` function is not supported for the RRDD that is returned by
>> `textFile`. You can run `take(lines, 5L)`. I should add a warning here that
>> the RDD API in SparkR is private because we might not support it in the
>> upcoming releases. So if you can use the DataFrame API for your application
>> you should try that out.
>>
>>  Thanks
>>  Shivaram
>>
>> On Thu, Jun 25, 2015 at 1:49 PM, Wei Zhou <zhweisop...@gmail.com> wrote:
>>
>>> Hi Alek,
>>>
>>>  Just a follow up question. This is what I did in sparkR shell:
>>>
>>>  lines <- SparkR:::textFile(sc, "./README.md")
>>>  head(lines)
>>>
>>>  And I am getting error:
>>>
>>> "Error in x[seq_len(n)] : object of type 'S4' is not subsettable"
>>>
>>> I'm wondering what did I do wrong. Thanks in advance.
>>>
>>> Wei
>>>
>>> 2015-06-25 13:44 GMT-07:00 Wei Zhou <zhweisop...@gmail.com>:
>>>
>>>> Hi Alek,
>>>>
>>>>  Thanks for the explanation, it is very helpful.
>>>>
>>>>  Cheers,
>>>> Wei
>>>>
>>>> 2015-06-25 13:40 GMT-07:00 Eskilson,Aleksander <
>>>> alek.eskil...@cerner.com>:
>>>>
>>>>>  Hi there,
>>>>>
>>>>>  The tutorial you’re reading there was written before the merge of
>>>>> SparkR for Spark 1.4.0
>>>>> For the merge, the RDD API (which includes the textFile() function)
>>>>> was made private, as the devs felt many of its functions were too low
>>>>> level. They focused instead on finishing the DataFrame API which supports
>>>>> local, HDFS, and Hive/HBase file reads. In the meantime, the devs are
>>>>> trying to determine which functions of the RDD API, if any, should be made
>>>>> public again. You can see the rationale behind this decision on the 
>>>>> issue’s
>>>>> JIRA [1].
>>>>>
>>>>>  You can still make use of those now private RDD functions by
>>>>> prepending the function call with the SparkR private namespace, for
>>>>> example, you’d use
>>>>> SparkR:::textFile(…).
>>>>>
>>>>>  Hope that helps,
>>>>> Alek
>>>>>
>>>>>  [1] -- https://issues.apache.org/jira/browse/SPARK-7230
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D7230&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=x60a-3ztBe4XOw2bOnEI9-Mc6mENXT8PVxYvsmTLVG8&s=HpX1Cpayu5Mwu9JVt2znimJyUwtV3vcPurUO9ZJhASo&e=>
>>>>>
>>>>>   From: Wei Zhou <zhweisop...@gmail.com>
>>>>> Date: Thursday, June 25, 2015 at 3:33 PM
>>>>> To: "user@spark.apache.org" <user@spark.apache.org>
>>>>> Subject: sparkR could not find function "textFile"
>>>>>
>>>>>   Hi all,
>>>>>
>>>>>  I am exploring sparkR by activating the shell and following the
>>>>> tutorial here https://amplab-extras.github.io/SparkR-pkg/
>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__amplab-2Dextras.github.io_SparkR-2Dpkg_&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aL4A2Pv9tHbhgJUX-EnuYx2HntTnrqVpegm6Ag-FwnQ&s=qfOET1UvP0ECAKgnTJw8G13sFTi_PhiJ8Q89fMSgH_Q&e=>
>>>>>
>>>>>  And when I tried to read in a local file with textFile(sc,
>>>>> "file_location"), it gives an error could not find function "textFile".
>>>>>
>>>>>  By reading through sparkR doc for 1.4, it seems that we need
>>>>> sqlContext to import data, for example.
>>>>>
>>>>> people <- read.df(sqlContext, 
>>>>> "./examples/src/main/resources/people.json", "json"
>>>>>
>>>>> )
>>>>> And we need to specify the file type.
>>>>>
>>>>>  My question is does sparkR stop supporting general type file
>>>>> importing? If not, would appreciate any help on how to do this.
>>>>>
>>>>>  PS, I am trying to recreate the word count example in sparkR, and
>>>>> want to import README.md file, or just any file into sparkR.
>>>>>
>>>>>  Thanks in advance.
>>>>>
>>>>>  Best,
>>>>> Wei
>>>>>
>>>>>     CONFIDENTIALITY NOTICE This message and any included attachments
>>>>> are from Cerner Corporation and are intended only for the addressee. The
>>>>> information contained in this message is confidential and may constitute
>>>>> inside or non-public information under international, federal, or state
>>>>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>>>>> or use of such information is strictly prohibited and may be unlawful. If
>>>>> you are not the addressee, please promptly delete this message and notify
>>>>> the sender of the delivery error by e-mail or you may call Cerner's
>>>>> corporate offices in Kansas City, Missouri, U.S.A at (+1)
>>>>> (816)221-1024.
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to