It should handle this input—no surprise. 

Spark must be compiled for the correct version of Hadoop that you are using 
(Mahout also). I’d make sure Spark is working properly with your HDFS by trying 
one of their examples if you haven’t already. Running locally may not be using 
the same version of Hadoop, have you checked that?

A filenamePattern of ‘.*’ will get all files in 
s3n://recommendation-logs/2014/09/06 and you have it set to search recursively. 
Check to make sure this is what you want. Did you use the same dir structure as 
you have on s3n when you ran locally? Since this driver looks at text files it 
can think it is working on data if it finds “[\t, ]” a tab, comma, or space in 
the line when it’s reading garbage so you should be sure it is working on only 
the files you want. Tell it to look for only a tab if that’s what you are using 
or use a regex to match the entire filename like “^part.*” or “.*log”. 

I have not tested with s3n:// URIs. I assume you can read all these with the 
hadoop tools like “hadoop fs -ls s3n://recommendation-logs/2014/09/06”?

off-list I’ll send a link to epinions data formatted for Mahout. You can try 
putting that in HDFS via sn3 and running it because I have tested that on a 
cluster. It is all in one file though so if there is a problem in file 
discovery it won’t show up.


On Sep 15, 2014, at 9:10 AM, Phil Wills <[email protected]> wrote:

Tried running locally on a reasonably beefy machine and it worked fine.
Which is the toy data, you're referring to?

JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64 SPARK_HOME=/root/spark
MAHOUT_HOME=. bin/mahout spark-itemsimilarity --input
s3n://recommendation-logs/2014/09/06 --output
s3n://recommendation-outputs/2014/09/06 --filenamePattern '.*' --recursive
--master spark://ec2-54-75-13-36.eu-west-1.compute.amazonaws.com:7077
--sparkExecutorMem 6g

and the working version running locally on a beefier box:

JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64 SPARK_HOME=/root/spark
MAHOUT_HOME=. MAHOUT_HEAPSIZE=16000 bin/mahout spark-itemsimilarity --input
s3n://ophan-recommendation-logs/2014/09/06 --output
s3n://ophan-recommendation-outputs/2014/09/06 --filenamePattern '.*'
--recursive  --sparkExecutorMem 16g

Sample input:

nnS1dIIBBtTnehVD79lgYeBw
http://www.example.com/world/2014/sep/05/malaysia-airlines-mh370-six-months-chinese-families-lack-answers

ikFSk14vHrTPqjSISvMihDUg
http://www.example.com/world/2014/sep/05/obama-core-coalition-10-countries-to-fight-isis

edqu8kfgsFSg2w3MhV5rUwuQ
http://www.example.com/lifeandstyle/wordofmouth/2014/sep/05/food-and-drink2?CMP=fb_gu

pfnmfONG1DQWG_EOOIxUASow
http://www.example.com/world/live/2014/sep/05/unresponsive-plane-f15-jets-aircraft-live-updates

pfUil_W0s2TZSqojMQrVcxVw        http://www.
example.com/football/blog/2014/sep/05/jose-mourinho-bargain-loic-remy-chelsea-france

nxTJnpyenFSP-tqWSLHQdW8w
http://www.example.com/books/2014/sep/05/were-we-happier-in-the-stone-age

lba37jwJVQS5GbiSuus1i6tA
http://www.example.com/stage/2014/sep/05/titus-andronicus-review-visually-striking-but-flawed

bEHaOzZPbtQz-X2K1wortBQQ
http://www.example.com/cities/2014/sep/05/death-america-suburban-dream-ferguson-missouri-resegregation

gjTGzDXiDOT5W2SThhm0tUmg
http://www.example.com/world/2014/sep/05/man-jailed-phoning-texting-ex-21807-times

pfFbQ5ddvBRhm0XLZbN6Xd2A
http://www.example.com/sport/2014/sep/05/gloucester-northampton-premiership-rugby



On Sun, Sep 14, 2014 at 4:06 PM, Pat Ferrel <[email protected]> wrote:

> I wonder if it’s trying to write an empty rdd to a text file. Can you give
> the CLI options and a snippet of data?
> 
> Also have you successfully run this on the toy data in the resource dir?
> There is a script to run it locally that you can adapt for running on a
> cluster. This will eliminate any cluster problem.
> 
> 
> On Sep 13, 2014, at 1:13 PM, Phil Wills <[email protected]> wrote:
> 
> Here's the master log from the line with the stack trace to termination:
> 
> 14/09/12 15:54:55 INFO scheduler.DAGScheduler: Failed to run saveAsTextFile
> at TextDelimitedReaderWriter.scala:288
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 8.0:3 failed 4 times, most recent failure: TID 448
> on host ip-10-105-176-77.eu-west-1.compute.internal failed for unknown
> reason
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> 
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
> at
> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
> at
> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
> at
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
> at
> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
> at
> 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
> at scala.Option.foreach(Option.scala:236)
> at
> 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
> at
> 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/09/12 15:54:55 INFO scheduler.DAGScheduler: Executor lost: 8 (epoch 20)
> 14/09/12 15:54:55 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 8 from BlockManagerMaster.
> 14/09/12 15:54:55 INFO storage.BlockManagerMaster: Removed 8 successfully
> in removeExecutor
> 14/09/12 15:54:55 INFO storage.BlockManagerInfo: Registering block manager
> ip-10-105-176-77.eu-west-1.compute.internal:58803 with 3.4 GB RAM
> 14/09/12 15:54:55 INFO cluster.SparkDeploySchedulerBackend: Registered
> executor:
> Actor[akka.tcp://[email protected]
> :56590/user/Executor#1456047585]
> with ID 9
> 
> On Sat, Sep 13, 2014 at 4:21 PM, Pat Ferrel <[email protected]> wrote:
> 
>> It’s not an error I’ve seen but they can tend to be pretty cryptic. Could
>> you post more of the stack trace?
>> 
>> On Sep 12, 2014, at 2:55 PM, Phil Wills <[email protected]> wrote:
>> 
>> I've tried on 1.0.1 and 1.0.2, updating the pom to 1.0.2 when running on
>> that.  I used the spark-ec2 scripts to set up the cluster.
>> 
>> I might be able to share the data I'll mull it over the weekend to make
>> sure there's nothing sensitive, or if there's a way I can transform it to
>> that point.
>> 
>> Phil
>> 
>> 
>> On Fri, Sep 12, 2014 at 6:30 PM, Pat Ferrel <[email protected]>
> wrote:
>> 
>>> The mahout pom says 1.0.1 but I’m running fine on 1.0.2
>>> 
>>> 
>>> On Sep 12, 2014, at 10:08 AM, Pat Ferrel <[email protected]> wrote:
>>> 
>>> Is it a mature Spark cluster, what version of Spark?
>>> 
>>> If you can share the data I can try it on mine.
>>> 
>>> On Sep 12, 2014, at 9:42 AM, Phil Wills <[email protected]> wrote:
>>> 
>>> I've been experimenting with the fairly new ItemSimilarityDriver, which
>> is
>>> working fine up until the point it tries to write out it's results.
>>> Initially I was getting an issue with the akka frameSize being too
> small,
>>> but after expanding that I'm now getting a much more cryptic error:
>>> 
>>> 14/09/12 15:54:55 INFO scheduler.DAGScheduler: Failed to run
>> saveAsTextFile
>>> at TextDelimitedReaderWriter.scala:288
>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due
>>> to stage failure: Task 8.0:3 failed 4 times, most recent failure: TID
> 448
>>> on host ip-10-105-176-77.eu-west-1.compute.internal failed for unknown
>>> reason
>>> 
>>> This is from the master node, but there doesn't seem to be anything more
>>> intelligible in the slave node logs.
>>> 
>>> I've tried writing to the local file system as well as s3n and can see
>> it's
>>> not an access problem, as I am seeing a zero length file appear.
>>> 
>>> Thanks for any pointers and apologies if this would be better to ask on
>> the
>>> Spark list,
>>> 
>>> Phil
>>> 
>>> 
>>> 
>> 
>> 
> 
> 

Reply via email to