Hmm, well if that’s so then you are also able to see the data since you’re 
reading and writing to the same S3 location in either case. The only difference 
is the Spark master and therefore perhaps a Spark issue?  Not sure I can help 
much more. I don’t have access to the same setup as you have. Is the Spark 
community able to help or at least throw the ball back in my court?

Does the debug output indicate that the read and computation went ok? Does it 
look the same as running local? No new warnings earlier in the run? BTW to get 
local to use multiple cores run with master set to something like “local[4]”.

On Sep 16, 2014, at 1:22 PM, Phil Wills <[email protected]> wrote:

No, by local I mean running on one a large ec2 box spun up by the same
script, but running the 'mahout spark-itemsimilarity' command without a
master specified, so that it runs locally to that box, so I'm confident
about the versions being the same in local to that box and distributed
across the cluster modes. Apologies for the lack of clarity.

Phil

On Tue, Sep 16, 2014 at 7:48 PM, Pat Ferrel <[email protected]> wrote:

> By local I assume you are talking about your dev machine, not one of the
> cluster machines.
> 
> Excuse me if I’m stating the obvious but you are using two completely
> different Spark and Hadoop installations, one local and one remote. They
> could be completely different codebases. Just because you have configured
> Spark and Hadoop to execute locally doesn’t mean they work remotely. It
> sounds like you are using the CLI on your dev machine, which is set to run
> locally, and passing a remote Spark master URI and S3 URI to the local
> Mahout script. I would install and set up Mahout on your cluster master,
> make sure MAHOUT_LOCAL is not set there since you will be using a cluster,
> and execute the CLI from there.
> 
> Furthermore are you sure that the remote Spark cluster can see the S3
> data? Ssh to the Spark master and do something like “hadoop fs -ls” or
> supply the URI to verify that the Hadoop config on the remote cluster,
> which is what the remote Spark will use, can get to the data.
> 
> 
> On Sep 15, 2014, at 2:28 PM, Phil Wills <[email protected]> wrote:
> 
> The data and s3n file system is OK, since when I run 'locally' that's just
> without a master specified, but otherwise identically, it works fine. I've
> been using the spark-ec2 scripts to retrieve spark and hadoop, so had
> assumed that meant they were operating compatible versions, but I'm not
> specifying which hadoop to use explicitly, so I don't know if that has an
> effect.
> 
> Phil
> 
> On Mon, Sep 15, 2014 at 7:25 PM, Pat Ferrel <[email protected]> wrote:
> 
>> It should handle this input—no surprise.
>> 
>> Spark must be compiled for the correct version of Hadoop that you are
>> using (Mahout also). I’d make sure Spark is working properly with your
> HDFS
>> by trying one of their examples if you haven’t already. Running locally
> may
>> not be using the same version of Hadoop, have you checked that?
>> 
>> A filenamePattern of ‘.*’ will get all files in
>> s3n://recommendation-logs/2014/09/06 and you have it set to search
>> recursively. Check to make sure this is what you want. Did you use the
> same
>> dir structure as you have on s3n when you ran locally? Since this driver
>> looks at text files it can think it is working on data if it finds “[\t,
> ]”
>> a tab, comma, or space in the line when it’s reading garbage so you
> should
>> be sure it is working on only the files you want. Tell it to look for
> only
>> a tab if that’s what you are using or use a regex to match the entire
>> filename like “^part.*” or “.*log”.
>> 
>> I have not tested with s3n:// URIs. I assume you can read all these with
>> the hadoop tools like “hadoop fs -ls
> s3n://recommendation-logs/2014/09/06”?
>> 
>> off-list I’ll send a link to epinions data formatted for Mahout. You can
>> try putting that in HDFS via sn3 and running it because I have tested
> that
>> on a cluster. It is all in one file though so if there is a problem in
> file
>> discovery it won’t show up.
>> 
>> 
>> On Sep 15, 2014, at 9:10 AM, Phil Wills <[email protected]> wrote:
>> 
>> Tried running locally on a reasonably beefy machine and it worked fine.
>> Which is the toy data, you're referring to?
>> 
>> JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64 SPARK_HOME=/root/spark
>> MAHOUT_HOME=. bin/mahout spark-itemsimilarity --input
>> s3n://recommendation-logs/2014/09/06 --output
>> s3n://recommendation-outputs/2014/09/06 --filenamePattern '.*'
> --recursive
>> --master spark://ec2-54-75-13-36.eu-west-1.compute.amazonaws.com:7077
>> --sparkExecutorMem 6g
>> 
>> and the working version running locally on a beefier box:
>> 
>> JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64 SPARK_HOME=/root/spark
>> MAHOUT_HOME=. MAHOUT_HEAPSIZE=16000 bin/mahout spark-itemsimilarity
> --input
>> s3n://ophan-recommendation-logs/2014/09/06 --output
>> s3n://ophan-recommendation-outputs/2014/09/06 --filenamePattern '.*'
>> --recursive  --sparkExecutorMem 16g
>> 
>> Sample input:
>> 
>> nnS1dIIBBtTnehVD79lgYeBw
>> 
>> 
> http://www.example.com/world/2014/sep/05/malaysia-airlines-mh370-six-months-chinese-families-lack-answers
>> 
>> ikFSk14vHrTPqjSISvMihDUg
>> 
>> 
> http://www.example.com/world/2014/sep/05/obama-core-coalition-10-countries-to-fight-isis
>> 
>> edqu8kfgsFSg2w3MhV5rUwuQ
>> 
>> 
> http://www.example.com/lifeandstyle/wordofmouth/2014/sep/05/food-and-drink2?CMP=fb_gu
>> 
>> pfnmfONG1DQWG_EOOIxUASow
>> 
>> 
> http://www.example.com/world/live/2014/sep/05/unresponsive-plane-f15-jets-aircraft-live-updates
>> 
>> pfUil_W0s2TZSqojMQrVcxVw        http://www.
>> 
>> 
> example.com/football/blog/2014/sep/05/jose-mourinho-bargain-loic-remy-chelsea-france
>> 
>> nxTJnpyenFSP-tqWSLHQdW8w
>> 
> http://www.example.com/books/2014/sep/05/were-we-happier-in-the-stone-age
>> 
>> lba37jwJVQS5GbiSuus1i6tA
>> 
>> 
> http://www.example.com/stage/2014/sep/05/titus-andronicus-review-visually-striking-but-flawed
>> 
>> bEHaOzZPbtQz-X2K1wortBQQ
>> 
>> 
> http://www.example.com/cities/2014/sep/05/death-america-suburban-dream-ferguson-missouri-resegregation
>> 
>> gjTGzDXiDOT5W2SThhm0tUmg
>> 
>> 
> http://www.example.com/world/2014/sep/05/man-jailed-phoning-texting-ex-21807-times
>> 
>> pfFbQ5ddvBRhm0XLZbN6Xd2A
>> 
>> 
> http://www.example.com/sport/2014/sep/05/gloucester-northampton-premiership-rugby
>> 
>> 
>> 
>> On Sun, Sep 14, 2014 at 4:06 PM, Pat Ferrel <[email protected]>
> wrote:
>> 
>>> I wonder if it’s trying to write an empty rdd to a text file. Can you
>> give
>>> the CLI options and a snippet of data?
>>> 
>>> Also have you successfully run this on the toy data in the resource dir?
>>> There is a script to run it locally that you can adapt for running on a
>>> cluster. This will eliminate any cluster problem.
>>> 
>>> 
>>> On Sep 13, 2014, at 1:13 PM, Phil Wills <[email protected]> wrote:
>>> 
>>> Here's the master log from the line with the stack trace to termination:
>>> 
>>> 14/09/12 15:54:55 INFO scheduler.DAGScheduler: Failed to run
>> saveAsTextFile
>>> at TextDelimitedReaderWriter.scala:288
>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due
>>> to stage failure: Task 8.0:3 failed 4 times, most recent failure: TID
> 448
>>> on host ip-10-105-176-77.eu-west-1.compute.internal failed for unknown
>>> reason
>>> Driver stacktrace:
>>> at org.apache.spark.scheduler.DAGScheduler.org
>>> 
>>> 
>> 
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
>>> at
>>> 
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>>> at
>>> 
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
>>> at
>>> 
>>> 
>> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
>>> at
>>> 
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>>> at
>>> 
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>>> at scala.Option.foreach(Option.scala:236)
>>> at
>>> 
>>> 
>> 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
>>> at
>>> 
>>> 
>> 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>> at
>>> 
>>> 
>> 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> at
>>> 
>>> 
>> 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> at
>>> 
>>> 
>> 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> 14/09/12 15:54:55 INFO scheduler.DAGScheduler: Executor lost: 8 (epoch
>> 20)
>>> 14/09/12 15:54:55 INFO storage.BlockManagerMasterActor: Trying to remove
>>> executor 8 from BlockManagerMaster.
>>> 14/09/12 15:54:55 INFO storage.BlockManagerMaster: Removed 8
> successfully
>>> in removeExecutor
>>> 14/09/12 15:54:55 INFO storage.BlockManagerInfo: Registering block
>> manager
>>> ip-10-105-176-77.eu-west-1.compute.internal:58803 with 3.4 GB RAM
>>> 14/09/12 15:54:55 INFO cluster.SparkDeploySchedulerBackend: Registered
>>> executor:
>>> Actor[akka.tcp://[email protected]
>>> :56590/user/Executor#1456047585]
>>> with ID 9
>>> 
>>> On Sat, Sep 13, 2014 at 4:21 PM, Pat Ferrel <[email protected]>
>> wrote:
>>> 
>>>> It’s not an error I’ve seen but they can tend to be pretty cryptic.
>> Could
>>>> you post more of the stack trace?
>>>> 
>>>> On Sep 12, 2014, at 2:55 PM, Phil Wills <[email protected]> wrote:
>>>> 
>>>> I've tried on 1.0.1 and 1.0.2, updating the pom to 1.0.2 when running
> on
>>>> that.  I used the spark-ec2 scripts to set up the cluster.
>>>> 
>>>> I might be able to share the data I'll mull it over the weekend to make
>>>> sure there's nothing sensitive, or if there's a way I can transform it
>> to
>>>> that point.
>>>> 
>>>> Phil
>>>> 
>>>> 
>>>> On Fri, Sep 12, 2014 at 6:30 PM, Pat Ferrel <[email protected]>
>>> wrote:
>>>> 
>>>>> The mahout pom says 1.0.1 but I’m running fine on 1.0.2
>>>>> 
>>>>> 
>>>>> On Sep 12, 2014, at 10:08 AM, Pat Ferrel <[email protected]>
>> wrote:
>>>>> 
>>>>> Is it a mature Spark cluster, what version of Spark?
>>>>> 
>>>>> If you can share the data I can try it on mine.
>>>>> 
>>>>> On Sep 12, 2014, at 9:42 AM, Phil Wills <[email protected]> wrote:
>>>>> 
>>>>> I've been experimenting with the fairly new ItemSimilarityDriver,
> which
>>>> is
>>>>> working fine up until the point it tries to write out it's results.
>>>>> Initially I was getting an issue with the akka frameSize being too
>>> small,
>>>>> but after expanding that I'm now getting a much more cryptic error:
>>>>> 
>>>>> 14/09/12 15:54:55 INFO scheduler.DAGScheduler: Failed to run
>>>> saveAsTextFile
>>>>> at TextDelimitedReaderWriter.scala:288
>>>>> Exception in thread "main" org.apache.spark.SparkException: Job
> aborted
>>>> due
>>>>> to stage failure: Task 8.0:3 failed 4 times, most recent failure: TID
>>> 448
>>>>> on host ip-10-105-176-77.eu-west-1.compute.internal failed for unknown
>>>>> reason
>>>>> 
>>>>> This is from the master node, but there doesn't seem to be anything
>> more
>>>>> intelligible in the slave node logs.
>>>>> 
>>>>> I've tried writing to the local file system as well as s3n and can see
>>>> it's
>>>>> not an access problem, as I am seeing a zero length file appear.
>>>>> 
>>>>> Thanks for any pointers and apologies if this would be better to ask
> on
>>>> the
>>>>> Spark list,
>>>>> 
>>>>> Phil
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 

Reply via email to