All dirs. Output, input, temp. On Sep 1, 2012 11:02 AM, "Pat Ferrel" <[email protected]> wrote:
> I got a bigger data set, which made no difference in error #1. I tried > changing the temp dir to an absolute address. Here I assume you are talking > about the temp dir supplied to SSVD? That had no effect. > > Since it was the SSVD base dir that was not correctly located I tried > changing that to an absolute path but this had no effect either. > > Oh, well. I guess I'll split up the code to run SSVD in hdfs and other > bits locally for debugging. Also there may a good way to use the debugger > with multi-threaded + hdfs, which will take a bit of research. > > Thanks > > On Sep 1, 2012, at 10:11 AM, Dmitriy Lyubimov <[email protected]> wrote: > > Another guess i have is that perhaps you used relative paths when > specifying temp dir? Try to use absoute paths. > On Sep 1, 2012 10:09 AM, "Pat Ferrel" <[email protected]> wrote: > > > Yes, I understand why #2 failed. I guess I'm asking how to get this to > > succeed. Without a way to run SSVD single threaded it's hard to debug my > > surrounding code. > > > > I am gathering a larger crawl, maybe that will work. > > > > On Sep 1, 2012, at 8:39 AM, Ted Dunning <[email protected]> wrote: > > > > Regardless of confusion between k and p (I was confused as well) you > still > > can't set the sum to more than the minimum size of your data. Here you > > have set it larger. And it breaks. > > > > On Sat, Sep 1, 2012 at 11:09 AM, Pat Ferrel <[email protected]> > wrote: > > > >> Oh, sorry, below I meant to say k (the number to reduce to) not p. > >> > >> In both cases p = 1, the first case k = 20, the second case k =100 > >> > >> Also the first error does seem to be running with local hadoop. The > error > >> is from looking for a temp file that does exist in the file system, but > > not > >> in the hadoop tmp based files. > >> > >> > >> On Sep 1, 2012, at 7:53 AM, Ted Dunning <[email protected]> wrote: > >> > >> With 57 crawled docs, you can't reasonably set p > 57. That is your > > second > >> error. > >> > >> On Sat, Sep 1, 2012 at 10:32 AM, Pat Ferrel <[email protected]> > > wrote: > >> > >>> I have a small data set that I am using in local mode for debugging > >>> purposes. The data is 57 crawled docs with something like 2200 terms. I > >> run > >>> this through seq2sparse, then my own cloned version of rowid to get a > >>> distributed row matrix, then into SSVD. I realize this is not a > >> production > >>> environment, but you need to debug somewhere and single threaded > >> execution > >>> is ideal for debugging. As I said this works in hadoop clustered mode. > >>> > >>> The error looks like some code is expecting hdfs to be running, no? > Here > >>> is the exception stack from the ide with p = 20: > >>> > >>> 12/09/01 07:22:55 WARN mapred.LocalJobRunner: job_local_0002 > >>> java.io.FileNotFoundException: File > >>> > >> > > > /tmp/hadoop-pat/mapred/local/archive/6590995089539988730_1587570556_37122331/file/Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-00000 > >>> does not exist. > >>> at > >>> > >> > > > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371) > >>> at > >>> > >> > > > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) > >>> at > >>> > >> > > > org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.<init>(SequenceFileDirValueIterator.java:92) > >>> at > >>> > >> > > > org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.setup(BtJob.java:219) > >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) > >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > >>> at > >>> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > >>> Exception in thread "main" java.io.IOException: Bt job unsuccessful. > >>> at > >>> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:609) > >>> at > >>> > >> > > > org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:397) > >>> at > >>> > >> > > > com.finderbots.analysis.AnalysisPipeline.SSVDTransformAndBack(AnalysisPipeline.java:257) > >>> at com.finderbots.analysis.AnalysisJob.run(AnalysisJob.java:20) > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > >>> Disconnected from the target VM, address: '127.0.0.1:54588', > transport: > >>> 'socket' > >>> at com.finderbots.analysis.AnalysisJob.main(AnalysisJob.java:34) > >>> > >>> Process finished with exit code 1 > >>> > >>> With p=100-200 I get the following: > >>> > >>> 12/09/01 07:30:33 ERROR common.IOUtils: new m can't be less than n > >>> java.lang.IllegalArgumentException: new m can't be less than n > >>> at > >>> > >> > > > org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109) > >>> at > >>> > >> > > > org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.cleanup(QRFirstStep.java:233) > >>> at > >>> > >> > > > org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.close(QRFirstStep.java:89) > >>> at org.apache.mahout.common.IOUtils.close(IOUtils.java:128) > >>> at > >>> > >> > > > org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.cleanup(QJob.java:158) > >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) > >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > >>> at > >>> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > >>> 12/09/01 07:30:33 WARN mapred.LocalJobRunner: job_local_0001 > >>> java.lang.IllegalArgumentException: new m can't be less than n > >>> at > >>> > >> > > > org.apache.mahout.math.hadoop.stochasticsvd.qr.GivensThinSolver.adjust(GivensThinSolver.java:109) > >>> at > >>> > >> > > > org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.cleanup(QRFirstStep.java:233) > >>> at > >>> > >> > > > org.apache.mahout.math.hadoop.stochasticsvd.qr.QRFirstStep.close(QRFirstStep.java:89) > >>> at org.apache.mahout.common.IOUtils.close(IOUtils.java:128) > >>> at > >>> > >> > > > org.apache.mahout.math.hadoop.stochasticsvd.QJob$QMapper.cleanup(QJob.java:158) > >>> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) > >>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > >>> at > >>> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > >>> Exception in thread "main" java.io.IOException: Q job unsuccessful. > >>> at > >>> org.apache.mahout.math.hadoop.stochasticsvd.QJob.run(QJob.java:230) > >>> at > >>> > >> > > > org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:376) > >>> at > >>> > >> > > > com.finderbots.analysis.AnalysisPipeline.SSVDTransformAndBack(AnalysisPipeline.java:257) > >>> at com.finderbots.analysis.AnalysisJob.run(AnalysisJob.java:20) > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > >>> at com.finderbots.analysis.AnalysisJob.main(AnalysisJob.java:34) > >>> Disconnected from the target VM, address: '127.0.0.1:54614', > transport: > >>> 'socket' > >>> > >>> Process finished with exit code 1 > >>> > >>> > >>> > >>> > >>> On Aug 31, 2012, at 4:21 PM, Dmitriy Lyubimov <[email protected]> > > wrote: > >>> > >>> Perhaps if you give more info about the stack etc. i might get a > >>> better idea though > >>> > >>> On Fri, Aug 31, 2012 at 4:19 PM, Dmitriy Lyubimov <[email protected]> > >>> wrote: > >>>> I am not sure, i haven't used it that way. > >>>> > >>>> I know it works fully distributed AND when embedded with local job > >>>> tracker (e.g. its tests are basically MR jobs with "local" job > >>>> tracker) which probably is not the same as Mahout local mode. "local" > >>>> job tracker is not good for much though: thus it doesn't use even > >>>> multicore parallelism as it doesn't support multiple reducers, so this > >>>> code is kind of for a real cluster really, pragmatically. There's also > >>>> Ted's implementation of non-distributed SSVD in Mahout which does not > >>>> require Hadoop dependencies but it is a different api with no PCA > >>>> option (not sure about power iterations). > >>>> > >>>> I am not sure why this very particular error appears in your setup. > >>>> > >>>> On Fri, Aug 31, 2012 at 3:02 PM, Pat Ferrel <[email protected]> > >>> wrote: > >>>>> Running on the local file system inside IDEA with MAHOUT_LOCAL set > and > >>> performing an SSVD I get the error below. Notice that R-m-00000 exists > > in > >>> the local file system and running it outside the debugger in > >> pseudo-cluster > >>> mode with HDFS works. Does SSVD work in local mode? > >>>>> > >>>>> java.io.FileNotFoundException: File > >>> > >> > > > /tmp/hadoop-pat/mapred/local/archive/5543644668644532045_1587570556_2120541978/file/Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-00000 > >>> does not exist. > >>>>> > >>>>> Maclaurin:big-data pat$ ls -al b/ssvd/Q-job/ > >>>>> total 72 > >>>>> drwxr-xr-x 10 pat staff 340 Aug 31 13:35 . > >>>>> drwxr-xr-x 4 pat staff 136 Aug 31 13:35 .. > >>>>> -rw-r--r-- 1 pat staff 80 Aug 31 13:35 .QHat-m-00000.crc > >>>>> -rw-r--r-- 1 pat staff 28 Aug 31 13:35 .R-m-00000.crc > >>>>> -rw-r--r-- 1 pat staff 8 Aug 31 13:35 ._SUCCESS.crc > >>>>> -rw-r--r-- 1 pat staff 12 Aug 31 13:35 > .part-m-00000.deflate.crc > >>>>> -rwxrwxrwx 1 pat staff 9154 Aug 31 13:35 QHat-m-00000 > >>>>> -rwxrwxrwx 1 pat staff 2061 Aug 31 13:35 R-m-00000 > >>>>> -rwxrwxrwx 1 pat staff 0 Aug 31 13:35 _SUCCESS > >>>>> -rwxrwxrwx 1 pat staff 8 Aug 31 13:35 part-m-00000.deflate > >>>>> > >>> > >>> > >> > >> > > > > > >
