Ah - how have you configured your machine for spark? Inside of a docker container?
The .numRows will actually need to run through the entire file in sequence (it calls rdd.count() under the hood) - 10 minutes sounds a little long but not unreasonable if on a single machine. On Thu, Nov 21, 2013 at 9:24 AM, sudhir vaidya <[email protected]> wrote: > Hey Evan, > > I do get the output when i load the file. I also see an output when i do > the "x.take(5)" command. > > But x.numRows takes a long time to execute.. i waited for like 10 mins ... > and had to do a Ctrl + C. My take on that is.. since the file is around 40 > Gigs and I am running it on a quadcore machine (not a very high end machine > and its just one machine and not a cluster).. maybe it takes a lot more > time... I am not sure though... > > Regards, > Sudhir > > > On Thu, Nov 21, 2013 at 11:18 AM, Evan R. Sparks <[email protected]>wrote: > >> What happens when you do: >> val x = mc.loadFile("/enwiki_txt") >> >> and then >> x.numRows >> or >> x.take(5) >> >> Do you see output there? >> >> >> >> On Wed, Nov 20, 2013 at 4:41 PM, sudhir vaidya <[email protected]>wrote: >> >>> I am a beginner and have started to go through the Mlbase exercises. >>> >>> But i get a java.io.indexoutofbounds.exception when i run the first >>> command of step 2.1 here : >>> >>> http://ampcamp.berkeley.edu/3/exercises/mli-document-categorization.html >>> >>> All i am doing is Copying the command and pasting it to the spark shell >>> interface. >>> >>> I tried splitting the command by loading the data set initially and >>> filtering subsequently.. but that didnt work. >>> >>> I also tried to change value of "r(0)" to "r(1)" in that step. But i >>> still get the same error. >>> >>> Any help is really appreciated. >>> >>> -Sudhir >>> >>> >>> >>> >>> >> >
