Re: spark-itemsimilarity IndexException - outside allowable range

Dmitriy Lyubimov Fri, 03 Apr 2015 12:51:07 -0700

no, i don't think there's a workaround. it needs a fix; however, in public
version there are much more fixes needed so I think this part will be
refactored completely in 0.10.1


On Fri, Apr 3, 2015 at 12:38 PM, Pat Ferrel <[email protected]> wrote:

> OK, it was. Is there a workaround I can try?
>
>
> On Apr 3, 2015, at 12:22 PM, Dmitriy Lyubimov <[email protected]> wrote:
>
> Although... i am not aware of one in A'A
>
> could be faulty vector length in a matrix if matrix was created by drmWrap
> with explicit specification of ncol
>
> On Fri, Apr 3, 2015 at 12:20 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
> > it's a  bug. There's a number of similar ones in operator A'B.
> >
> > On Fri, Apr 3, 2015 at 6:23 AM, Michael Kelly <[email protected]>
> wrote:
> >
> >> Hi Pat,
> >>
> >> I've done some further digging and it looks like the problem is
> >> occurring when the input files are split up to into parts. The input
> >> to the item-similarity matrix is the output from a spark job and it
> >> ends up in about 2000 parts (on the hadoop file system). I have
> >> reproduced the error locally using a small subset of the rows.
> >>
> >> This is a snippet of the file I am using -
> >>
> >> ...
> >>
> >> 5138353282348067470,1891081885
> >> 4417954190713934181,1828065687
> >> 133682221673920382,1454844406
> >> 133682221673920382,1129053737
> >> 133682221673920382,548627241
> >> 133682221673920382,1048452021
> >> 8547417492653230933,1121310481
> >> 7693904559640861382,1333374361
> >> 7204049418352603234,606209305
> >> 139299176617553863,467181330
> >> ...
> >>
> >>
> >> When I run the item-similarity against a single input file which
> >> contains all the rows, the job succeeds without error.
> >>
> >> When I break up the input file into 100 parts, and use the directory
> >> containing them as input then I get the 'Index outside allowable
> >> range' exception.
> >>
> >> Her are the input files that I used tarred and gzipped -
> >>
> >>
> >>
> https://s3.amazonaws.com/static.onespot.com/mahout/passing_single_file.tar.gz
> >>
> >>
> https://s3.amazonaws.com/static.onespot.com/mahout/failing_split_into_100_parts.tar.gz
> >>
> >> There are 44067 rows in total, 11858 unique userIds and 24166 unique
> >> itemIds.
> >>
> >> This is the exception that I see on the 100 part run -
> >> 15/04/03 12:07:09 ERROR Executor: Exception in task 0.0 in stage 9.0
> (TID
> >> 707)
> >> org.apache.mahout.math.IndexException: Index 24190 is outside
> >> allowable range of [0,24166)
> >> at
> org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)
> >> at
> >> org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)
> >> at
> >>
> org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)
> >> at
> >>
> org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)
> >> at
> >> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
> >> at
> >> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
> >> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
> >> at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
> >> at
> >>
> scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
> >> at
> >>
> scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
> >> at
> >>
> scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
> >> at
> scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
> >> at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
> >> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> >> at
> >>
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
> >> at
> >>
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
> >> at
> >>
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> >> at
> >>
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> >> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
> >> at
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> >> at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> >> at java.lang.Thread.run(Thread.java:745)
> >>
> >>
> >> I tried splitting the file up in 10,20 and 50 parts and the job
> completed.
> >> Also, should the resulting similarity matrix be the same wether the
> >> input is split up or not? I passed in the same random seed for the
> >> spark job, but the matrices were different
> >>
> >> Thanks,
> >>
> >> Michael
> >>
> >>
> >>
> >> On Thu, Apr 2, 2015 at 6:56 PM, Pat Ferrel <[email protected]>
> wrote:
> >>> The input must be tuples (if not using a filter) so the CLI you have
> >> expects user and item ids that are
> >>>
> >>> user-id1,item-id1
> >>> user-id500,item-id3000
> >>> …
> >>>
> >>> The ids must be tokenized because it doesn’t use a full csv parser,
> >> only lines of delimited text.
> >>>
> >>> If this doesn’t help can you supply a snippet of the input
> >>>
> >>>
> >>> On Apr 2, 2015, at 10:39 AM, Michael Kelly <[email protected]>
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I'm running the spark-itemsimilarity job from the cli on an AWS emr
> >>> cluster, and I'm running into an exception.
> >>>
> >>> The input file format is
> >>> UserId<tab>ItemId1<tab>ItemId2<tab>ItemId3......
> >>>
> >>> There is only one row per user, and a total of 97,000 rows.
> >>>
> >>> I also tried input with one row per UserId/ItemId pair, which had
> >>> about 250,000 rows, but I also saw a similar exception, this time the
> >>> out of bounds index was around 110,000.
> >>>
> >>> The input is stored in hdfs and this is the command I used to start the
> >> job -
> >>>
> >>> mahout spark-itemsimilarity --input userItems --output output --master
> >>> yarn-client
> >>>
> >>> Any idea what the problem might be?
> >>>
> >>> Thanks,
> >>>
> >>> Michael
> >>>
> >>>
> >>>
> >>> 15/04/02 16:37:40 WARN TaskSetManager: Lost task 1.0 in stage 10.0
> >>> (TID 7631, ip-XX.XX.ec2.internal):
> >>> org.apache.mahout.math.IndexException: Index 22050 is outside
> >>> allowable range of [0,21997)
> >>>
> >>>
> >> org.apache.mahout.math.AbstractVector.viewPart(AbstractVector.java:147)
> >>>
> >>>
> >> org.apache.mahout.math.scalabindings.VectorOps.apply(VectorOps.scala:37)
> >>>
> >>>
> >>
> org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:152)
> >>>
> >>>
> >>
> org.apache.mahout.sparkbindings.blas.AtA$$anonfun$5$$anonfun$apply$6.apply(AtA.scala:149)
> >>>
> >>>
> >> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
> >>>
> >>>
> >> scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
> >>>
> >>>       scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
> >>>
> >>>       scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
> >>>
> >>>
> >>
> scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
> >>>
> >>>
> >>
> scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
> >>>
> >>>
> >>
> scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
> >>>
> >>>
> >> scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
> >>>
> >>>
> >> scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
> >>>
> >>>       scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> >>>
> >>>
> >>
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144)
> >>>
> >>>
> >> org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> >>>
> >>>
> >>
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> >>>
> >>>
> >>
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> >>>
> >>>
> >>
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> >>>
> >>>       org.apache.spark.scheduler.Task.run(Task.scala:54)
> >>>
> >>>
> >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> >>>
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>>
> >>>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>>
> >>>       java.lang.Thread.run(Thread.java:745)
> >>>
> >>
> >
> >
>
>

Re: spark-itemsimilarity IndexException - outside allowable range

Reply via email to