The input is the Wikipedia article data set as recommended by the example.
 It was downloaded unchanged from
http://users.on.net/~henry/pagerank/links-simple-sorted.zip.  I just
unzipped this and then put to HDFS at the path input/input.txt before
running the command line I mentioned previously.  The following are the
first few lines of the Wikipedia data set file.


1: 1664968
2: 3 747213 1664968 1691047 4095634 5535664
3: 9 77935 79583 84707 564578 594898 681805 681886 835470 880698 1109091
1125108 1279972 1463445 1497566 1783284 1997564 2006526 2070954 2250217
2268713 2276203 2374802 2571397 2640902 2647217 2732378 2821237 3088028
3092827 3211549 3283735 3491412 3492254 3498305 3505664 3547201 3603437
3617913 3793767 3907547 4021634 4025897 4086017 4183126 4184025 4189168
4192731 4395141 4899940 4987592 4999120 5017477 5149173 5149311 5158741
5223097 5302153 5474252 5535280
4: 145
5: 8 57544 58089 60048 65880 284186 313376 564578 717529 729993 1097284
1204280 1204407 1255317 1670218 1720928 1850305 2269887 2333350 2359764
2640693 2743982 3303009 3322952 3492254 3573013 3721693 3797343 3797349
3797359 3849461 4033556 4173124 4189215 4207986 4669945 4817900 4901416
5010479 5062062 5072938 5098953 5292042 5429924 5599862 5599863 5689049
6: 8
7: 8
8: 5 57544 58089 59375 64985 313376 704624 717529 729993 1204280 1204407
1254637 1255317 1497566 1720928 1850305 2269887 2333350 2359764 2496900
2640848 2743982 3303009 3322952 3492254 3573013 3797343 3797349 3797359
4033556 4173124 4189168 4206743 4207986 4393611 4813259 4901416 5010479
5062062 5072938 5098953 5292042 5429924 5599862 5599863
9: 3 74106 75221 275656 313376 1279972 1565872 1613838 1997564 2640650
3092827 3491412 3492254 3956845 3973207 4025897 4189168 4189215 4813259
10: 3
11: 60956 313376 322893 497519 499246 594399 801968 806840 1123171 1228259
1463265 1892998 2022036 2070954 2639079 3492254 3594794 3967074 4096317
4189168 4189215 4273212 4611415 4708418 4813259 5300058 5575496
12: 5
13: 5534647
14: 4116750
15: 4095634
16: 5534647
17: 5703728
18: 4207272
19: 2402613
20: 2402613
22: 4095634
23: 5688890
24: 205444 530901 1601519 2583882 3072654 3492254 3498305 4096317 4189168
4638601 4751151 5242252


As you can see it is not in the format of user,item,rating, but I thought
this format would be produced in the several stages of MR jobs, but I guess
I am wrong.

Just for the fun of it I tried running with the 0.7 version and got the
same ArrayIndexOutOfBoundsException so it does appear to be data related as
Sean is suggesting.

Any ideas?

-Jonathan



On Tue, Jun 19, 2012 at 1:05 AM, Sean Owen <[email protected]> wrote:

> ... but this is a problem to do with bad input, it seems. And the book
> examples go with 0.5. The bug you are thinking of does not affect anything
> written about in the book. 0.5 is the right version to use as far as the
> book examples are concerned.
>
> What format is your input? should be "user,item,rating". This evidently
> isn't the format you are using.
>
> On Tue, Jun 19, 2012 at 7:33 AM, Sebastian Schelter <[email protected]>
> wrote:
>
> > Please use a later version of mahout! The 0.5 release has a major bug
> > in the recommendation code.
> >
> >
>

Reply via email to