org.apache.spark.SparkException: java.io.FileNotFoundException: does not exist)

2014-09-16 Thread Hui Li
Hi,

I am new to SPARK. I just set up a small cluster and wanted to run some
simple MLLIB examples. By following the instructions of
https://spark.apache.org/docs/0.9.0/mllib-guide.html#binary-classification-1,
I could successfully run everything until the step of SVMWithSGD, I got
error the following message. I don't know why the
 file:/root/test/sample_svm_data.txt does not exist since I already read it
out, printed it and converted into the labeled data and passed the parsed
data to the function SvmWithSGD.

Any one have the same issue with me?

Thanks,

Emily

 val model = SVMWithSGD.train(parsedData, numIterations)
14/09/16 10:55:21 INFO SparkContext: Starting job: first at
GeneralizedLinearAlgorithm.scala:121
14/09/16 10:55:21 INFO DAGScheduler: Got job 11 (first at
GeneralizedLinearAlgorithm.scala:121) with 1 output partitions
(allowLocal=true)
14/09/16 10:55:21 INFO DAGScheduler: Final stage: Stage 11 (first at
GeneralizedLinearAlgorithm.scala:121)
14/09/16 10:55:21 INFO DAGScheduler: Parents of final stage: List()
14/09/16 10:55:21 INFO DAGScheduler: Missing parents: List()
14/09/16 10:55:21 INFO DAGScheduler: Computing the requested partition
locally
14/09/16 10:55:21 INFO HadoopRDD: Input split:
file:/root/test/sample_svm_data.txt:0+19737
14/09/16 10:55:21 INFO SparkContext: Job finished: first at
GeneralizedLinearAlgorithm.scala:121, took 0.002697478 s
14/09/16 10:55:21 INFO SparkContext: Starting job: count at
DataValidators.scala:37
14/09/16 10:55:21 INFO DAGScheduler: Got job 12 (count at
DataValidators.scala:37) with 2 output partitions (allowLocal=false)
14/09/16 10:55:21 INFO DAGScheduler: Final stage: Stage 12 (count at
DataValidators.scala:37)
14/09/16 10:55:21 INFO DAGScheduler: Parents of final stage: List()
14/09/16 10:55:21 INFO DAGScheduler: Missing parents: List()
14/09/16 10:55:21 INFO DAGScheduler: Submitting Stage 12 (FilteredRDD[26]
at filter at DataValidators.scala:37), which has no missing parents
14/09/16 10:55:21 INFO DAGScheduler: Submitting 2 missing tasks from Stage
12 (FilteredRDD[26] at filter at DataValidators.scala:37)
14/09/16 10:55:21 INFO TaskSchedulerImpl: Adding task set 12.0 with 2 tasks
14/09/16 10:55:21 INFO TaskSetManager: Starting task 12.0:0 as TID 24 on
executor 2: eecvm0206.demo.sas.com (PROCESS_LOCAL)
14/09/16 10:55:21 INFO TaskSetManager: Serialized task 12.0:0 as 1733 bytes
in 0 ms
14/09/16 10:55:21 INFO TaskSetManager: Starting task 12.0:1 as TID 25 on
executor 5: eecvm0203.demo.sas.com (PROCESS_LOCAL)
14/09/16 10:55:21 INFO TaskSetManager: Serialized task 12.0:1 as 1733 bytes
in 0 ms
14/09/16 10:55:21 WARN TaskSetManager: Lost TID 24 (task 12.0:0)
14/09/16 10:55:21 WARN TaskSetManager: Loss was due to
java.io.FileNotFoundException
java.io.FileNotFoundException: File file:/root/test/sample_svm_data.txt
does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:137)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at
org.apache.hadoop.mapred.LineRecordReader.init(LineRecordReader.java:108)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at
org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:109)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at

A problem for running MLLIB in amazon clound

2014-09-08 Thread Hui Li
I am running a very simple example using the SVMWithSGD on Amazon EMR. I
haven't got any result after one hour long.

My instance-type is: m3.large
  instance-count is: 3
Dataset is the data provided by the MLLIB in apache: sample_svm_data

The number of iteration is: 2
and all other options are defaulted value with the number of iterations
equal to 2.

Is there anyone who can help me out with this?

Thanks,

Hui