no, i don't think there's a workaround. it needs a fix; however, in public version there are much more fixes needed so I think this part will be refactored completely in 0.10.1
On Fri, Apr 3, 2015 at 12:38 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > OK, it was. Is there a workaround I can try? > > > On Apr 3, 2015, at 12:22 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > > Although... i am not aware of one in A'A > > could be faulty vector length in a matrix if matrix was created by drmWrap > with explicit specification of ncol > > On Fri, Apr 3, 2015 at 12:20 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > > > it's a bug. There's a number of similar ones in operator A'B. > > > > On Fri, Apr 3, 2015 at 6:23 AM, Michael Kelly <mich...@onespot.com> > wrote: > > > >> Hi Pat, > >> > >> I've done some further digging and it looks like the problem is > >> occurring when the input files are split up to into parts. The input > >> to the item-similarity matrix is the output from a spark job and it > >> ends up in about 2000 parts (on the hadoop file system). I have > >> reproduced the error locally using a small subset of the rows. > >> > >> This is a snippet of the file I am using - > >> > >> ... > >> > >> 5138353282348067470,1891081885 > >> 4417954190713934181,1828065687 > >> 133682221673920382,1454844406 > >> 133682221673920382,1129053737 > >> 133682221673920382,548627241 > >> 133682221673920382,1048452021 > >> 8547417492653230933,1121310481 > >> 7693904559640861382,1333374361 > >> 7204049418352603234,606209305 > >> 139299176617553863,467181330 > >> ... > >> > >> > >> When I run the item-similarity against a single input file which > >> contains all the rows, the job succeeds without error. > >> > >> When I break up the input file into 100 parts, and use the directory > >> containing them as input then I get the 'Index outside allowable > >> range' exception. > >> > >> Her are the input files that I used tarred and gzipped - > >> > >> > >> > https://s3.amazonaws.com/static.onespot.com/mahout/passing_single_file.tar.gz > >> > >> > https://s3.amazonaws.com/static.onespot.com/mahout/failing_split_into_100_parts.tar.gz > >> > >> There are 44067 rows in total, 11858 unique userIds and 24166 unique > >> itemIds. > >> > >> This is the exception that I see on the 100 part run - > >> 15/04/03 12:07:09 ERROR Executor: Exception in task 0.0 in stage 9.0 > (TID > >> 707) > >> org.apache.mahout.math.IndexException: Index 24190 is outside > >> allowable range of [0,24166) > >> at > org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) > >> at > >> org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) > >> at > >> > org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) > >> at > >> > org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) > >> at > >> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) > >> at > >> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) > >> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) > >> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) > >> at > >> > scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) > >> at > >> > scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) > >> at > >> > scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) > >> at > scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) > >> at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) > >> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > >> at > >> > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202) > >> at > >> > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56) > >> at > >> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > >> at > >> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > >> at org.apache.spark.scheduler.Task.run(Task.scala:56) > >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) > >> at > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > >> at > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > >> at java.lang.Thread.run(Thread.java:745) > >> > >> > >> I tried splitting the file up in 10,20 and 50 parts and the job > completed. > >> Also, should the resulting similarity matrix be the same wether the > >> input is split up or not? I passed in the same random seed for the > >> spark job, but the matrices were different > >> > >> Thanks, > >> > >> Michael > >> > >> > >> > >> On Thu, Apr 2, 2015 at 6:56 PM, Pat Ferrel <p...@occamsmachete.com> > wrote: > >>> The input must be tuples (if not using a filter) so the CLI you have > >> expects user and item ids that are > >>> > >>> user-id1,item-id1 > >>> user-id500,item-id3000 > >>> … > >>> > >>> The ids must be tokenized because it doesn’t use a full csv parser, > >> only lines of delimited text. > >>> > >>> If this doesn’t help can you supply a snippet of the input > >>> > >>> > >>> On Apr 2, 2015, at 10:39 AM, Michael Kelly <mich...@onespot.com> > wrote: > >>> > >>> Hi all, > >>> > >>> I'm running the spark-itemsimilarity job from the cli on an AWS emr > >>> cluster, and I'm running into an exception. > >>> > >>> The input file format is > >>> UserId<tab>ItemId1<tab>ItemId2<tab>ItemId3...... > >>> > >>> There is only one row per user, and a total of 97,000 rows. > >>> > >>> I also tried input with one row per UserId/ItemId pair, which had > >>> about 250,000 rows, but I also saw a similar exception, this time the > >>> out of bounds index was around 110,000. > >>> > >>> The input is stored in hdfs and this is the command I used to start the > >> job - > >>> > >>> mahout spark-itemsimilarity --input userItems --output output --master > >>> yarn-client > >>> > >>> Any idea what the problem might be? > >>> > >>> Thanks, > >>> > >>> Michael > >>> > >>> > >>> > >>> 15/04/02 16:37:40 WARN TaskSetManager: Lost task 1.0 in stage 10.0 > >>> (TID 7631, ip-XX.XX.ec2.internal): > >>> org.apache.mahout.math.IndexException: Index 22050 is outside > >>> allowable range of [0,21997) > >>> > >>> > >> org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147) > >>> > >>> > >> org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37) > >>> > >>> > >> > org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152) > >>> > >>> > >> > org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149) > >>> > >>> > >> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) > >>> > >>> > >> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) > >>> > >>> scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) > >>> > >>> scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) > >>> > >>> > >> > scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) > >>> > >>> > >> > scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) > >>> > >>> > >> > scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) > >>> > >>> > >> scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) > >>> > >>> > >> scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) > >>> > >>> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > >>> > >>> > >> > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144) > >>> > >>> > >> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) > >>> > >>> > >> > org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55) > >>> > >>> > >> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > >>> > >>> > >> > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > >>> > >>> org.apache.spark.scheduler.Task.run(Task.scala:54) > >>> > >>> > >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > >>> > >>> > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > >>> > >>> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > >>> > >>> java.lang.Thread.run(Thread.java:745) > >>> > >> > > > > > >