OK, got it to reproduce. This is not what I expected. Its too many columns in a vector-hmm. Found the other user’s issue which was null input, not a bug.
BTW when did you update Mahout? Just put in the ability to point to dirs so I assume recently? On Apr 3, 2015, at 9:08 AM, Pat Ferrel <p...@occamsmachete.com> wrote: Yeah, that’s exactly what the other user is doing. This should be a common architecture in the future. I’m already looking at the other so will add this too. Thanks a bunch for the data. On Apr 3, 2015, at 8:58 AM, Michael Kelly <mich...@onespot.com> wrote: Yes, we are using a spark streaming job to create the input, and I wasn't repartitioning it, so there were a lot of parts. I'm testing it out now with repartitioning to see if that works. This is just a single interaction type. Thanks again, Michael On Fri, Apr 3, 2015 at 4:52 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > This sounds like a bug. Thanks for the sample input and narrowing it down. > I’ll look at it today. > > I got a similar question from another user with a lot of part files. A Spark > streaming job creates the part files. Is that what you are doing? > > Is this a single interaction type? > > > On Apr 3, 2015, at 6:23 AM, Michael Kelly <mich...@onespot.com> wrote: > > Hi Pat, > > I've done some further digging and it looks like the problem is > occurring when the input files are split up to into parts. The input > to the item-similarity matrix is the output from a spark job and it > ends up in about 2000 parts (on the hadoop file system). I have > reproduced the error locally using a small subset of the rows. > > This is a snippet of the file I am using - > > ... > > 5138353282348067470,1891081885 > 4417954190713934181,1828065687 > 133682221673920382,1454844406 > 133682221673920382,1129053737 > 133682221673920382,548627241 > 133682221673920382,1048452021 > 8547417492653230933,1121310481 > 7693904559640861382,1333374361 > 7204049418352603234,606209305 > 139299176617553863,467181330 > ... > > > When I run the item-similarity against a single input file which > contains all the rows, the job succeeds without error. > > When I break up the input file into 100 parts, and use the directory > containing them as input then I get the 'Index outside allowable > range' exception. > > Her are the input files that I used tarred and gzipped - > > https://s3.amazonaws.com/static.onespot.com/mahout/passing_single_file.tar.gz > https://s3.amazonaws.com/static.onespot.com/mahout/failing_split_into_100_parts.tar.gz > > There are 44067 rows in total, 11858 unique userIds and 24166 unique itemIds. > > This is the exception that I see on the 100 part run - > 15/04/03 12:07:09 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 707) > org.apache.mahout.math.IndexException: Index 24190 is outside > allowable range of [0,24166) > at org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) > at org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) > at > org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) > at > org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) > at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) > at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) > at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) > at > scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) > at > scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) > at > scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) > at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) > at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > I tried splitting the file up in 10,20 and 50 parts and the job completed. > Also, should the resulting similarity matrix be the same wether the > input is split up or not? I passed in the same random seed for the > spark job, but the matrices were different > > Thanks, > > Michael > > > > On Thu, Apr 2, 2015 at 6:56 PM, Pat Ferrel <p...@occamsmachete.com> wrote: >> The input must be tuples (if not using a filter) so the CLI you have expects >> user and item ids that are >> >> user-id1,item-id1 >> user-id500,item-id3000 >> … >> >> The ids must be tokenized because it doesn’t use a full csv parser, only >> lines of delimited text. >> >> If this doesn’t help can you supply a snippet of the input >> >> >> On Apr 2, 2015, at 10:39 AM, Michael Kelly <mich...@onespot.com> wrote: >> >> Hi all, >> >> I'm running the spark-itemsimilarity job from the cli on an AWS emr >> cluster, and I'm running into an exception. >> >> The input file format is >> UserId<tab>ItemId1<tab>ItemId2<tab>ItemId3...... >> >> There is only one row per user, and a total of 97,000 rows. >> >> I also tried input with one row per UserId/ItemId pair, which had >> about 250,000 rows, but I also saw a similar exception, this time the >> out of bounds index was around 110,000. >> >> The input is stored in hdfs and this is the command I used to start the job - >> >> mahout spark-itemsimilarity --input userItems --output output --master >> yarn-client >> >> Any idea what the problem might be? >> >> Thanks, >> >> Michael >> >> >> >> 15/04/02 16:37:40 WARN TaskSetManager: Lost task 1.0 in stage 10.0 >> (TID 7631, ip-XX.XX.ec2.internal): >> org.apache.mahout.math.IndexException: Index 22050 is outside >> allowable range of [0,21997) >> >> org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) >> >> org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) >> >> >> org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) >> >> >> org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) >> >> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) >> >> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) >> >> scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) >> >> scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) >> >> >> scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) >> >> >> scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) >> >> >> scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) >> >> scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) >> >> scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) >> >> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) >> >> >> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144) >> >> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) >> >> >> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) >> >> >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) >> >> >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >> >> org.apache.spark.scheduler.Task.run(Task.scala:54) >> >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) >> >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> >> java.lang.Thread.run(Thread.java:745) >> >