OMG, sorry but I am a complete idiot. RowId creates a "docIndex" in the matrix dir. Once I specified the full path to the Distributed Row Matrix everything was fine.
On Aug 20, 2012, at 11:09 AM, Dmitriy Lyubimov <[email protected]> wrote: On Mon, Aug 20, 2012 at 11:03 AM, Dmitriy Lyubimov <[email protected]> wrote: > Ok this just means that something in the A input is not really > adhering to <Writable,VectorWritable> specification. In particular, > there seems to be a file in the input path that has <?,VectorWritable> > pair in its input. sorry this should read > there seems to be a file in the input path that has <?,Text> > pair in its input. Input seems to have Text values somewhere. > > Can you check your input files for key/value types? Note that includes > entire subtree of sequence files, not just files in the input > directory. > > Usually it is visible in the header of the sequence file (usually even > if it is using compression). > > I am not quite sure what you mean by "rowid" processing. > > > > On Sun, Aug 19, 2012 at 7:40 PM, Pat Ferrel <[email protected]> wrote: >> Getting an odd error on SSVD. >> >> Starting with the QJob I get 9 map tasks for the data set, 8 are run on the >> mini cluster in parallel. Most of them complete with no errors but there >> usually two map task failures for each QJob, they die with the error: >> >> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to >> org.apache.mahout.math.VectorWritable >> at >> org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.map(QJob.java:74) >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) >> at org.apache.hadoop.mapred.Child$4.run(Child.java:255) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:416) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) >> at org.apache.hadoop.mapred.Child.main(Child.java:249) >> >> The data was created using seq2sparse and then running rowid to create the >> input matrix. The data was encoded as named vectors. These are the two >> differences I could think of between how I ran it from the API and from the >> CLI. >> >> >> On Aug 18, 2012, at 7:29 PM, Pat Ferrel <[email protected]> wrote: >> >> -t Param >> >> I'm no hadoop expert but there are a couple parameters for each node in a >> cluster that specifies the default number of mappers and reducers for that >> node. There is a rule of thumb about how many mappers and reducers per core. >> You can tweak them either way depending on your typical jobs. >> >> No idea what you mean about the total reducers being 1 for most configs. My >> very small cluster at home with 10 cores in three machines is configured to >> produce a conservative 10 mappers and 10 reducers, which is about what >> happens with balanced jobs. The reducers = 1 is probably for a non-clustered >> one machine setup. >> >> I'm suspicious that the -t parameter is not needed but would definitely >> defer to a hadoop master. In any case I set it to 10 for my mini cluster. >> >> Variance Retained >> >> If one batch of data yields a greatly different estimate of VR than another, >> it would be worth noticing, even if we don't know the actual error in it. To >> say that your estimate of VR is valueless would require that we have some >> experience with it, no? >> >> On Aug 18, 2012, at 10:39 AM, Dmitriy Lyubimov <[email protected]> wrote: >> >> On Aug 18, 2012 8:32 AM, "Pat Ferrel" <[email protected]> wrote: >>> >>> Switching from API to CLI >>> >>> the parameter -t is described in the PDF >>> >>> --reduceTasks <int-value> optional. The number of reducers to use (where >> applicable): depends on the size of the hadoop cluster. At this point it >> could also be overwritten by a standard hadoop property using -D option >>> 4. Probably always needs to be speciļ¬ed as by default Hadoop would set it >> to 1, which is certainly far below the cluster capacity. Recommended value >> for this option ~ 95% or ~190% of available reducer capacity to allow for >> opportunistic executions. >>> >>> The description above seems to say it will be taken from the hadoop >> config if not specified, which is probably all most people would every >> want. I am unclear why this is needed? I cannot run SSVD without specifying >> it, in other words it does not seem to be optional? >> >> This parameter was made mandatory because people were repeatedly forgetting >> set the number of reducers and kept coming back with questions like why it >> is running so slow. So there was an issue in 0.7 where i made it mandatory. >> I am actually not sure now other mahout methods ensure reducer >> specification is always specified other than 1 >> >>> >>> As a first try using the CLI I'm running with 295625 rows and 337258 >> columns using the following parameters to get a sort of worst case run time >> result with best case data output. The parameters will be tweaked later to >> get better dimensional reduction and runtime. >>> >>> mahout ssvd -i b2/matrix -k 500 -q 1 -pca -o b2/ssvd-out -t (depends >> on cluster) >>> >>> Is there work being done to calculate the variance retained for the >> output or should I calculate it myself? >> >> No theres no work done since it implies your are building your own pipeline >> for a particular purpose. It also takes a lot of assumptions that may or >> may not hold in a particular case, such that you do something repeatedly >> and corpuses are of similar nature. Also, i know no paper that would do it >> exactly the way i described, so theres no error estimate on either >> inequality approach or any sort of decay interpolation. >> >> It is not very difficult to experiment a little with your data though with >> a subset of the corpus and see what may work. >> >>
