Hi Kris, I think the best way would be to manually join the names to the result after executing the job.
--sebastian Am 02.07.2010 18:22, schrieb Kris Jack: > Hi Sebastian, > > I am currently using your code with NamedVectors in my input. In the > output, however, the names seem to be missing. Would there be a way to > include them? > > Thanks, > Kris > > > > 2010/6/29 Sebastian Schelter <[email protected]> > > >> Hi Kris, >> >> I'm glad I could help you and it's really cool that you are testing my >> patches on real data. I'm looking forward to hearing more! >> >> -sebastian >> >> Am 29.06.2010 11:25, schrieb Kris Jack: >> >>> Hi Sebastian, >>> >>> You really are very kind! I have taken your code and run it to print out >>> the contents of the output file. There are indeed only 37,952 results so >>> that gives me more confidence in the vector dumper. I'm not sure why >>> >> there >> >>> was a memory problem though, seeing as it seems to have output the >>> >> results >> >>> correctly. Now I just have to match them up with my original lucene ids >>> >> and >> >>> see how it is performing. I'll keep you posted with the results. >>> >>> Thanks, >>> Kris >>> >>> >>> >>> 2010/6/28 Sebastian Schelter <[email protected]> >>> >>> >>> >>>> Hi Kris, >>>> >>>> Unfortunately I'm not familiar with the VectorDumper code (and a quick >>>> look didn't help either), so I can't help you with the OutOfMemoryError. >>>> >>>> It could be possible that only 37,952 results are found for an input of >>>> 500,000 vectors, it really depends on the actual data. If you're sure >>>> that there should be more results, you could provide me with a sample >>>> input file and I'll try to find out why there aren't more results. >>>> >>>> I wrote a small class for you that dumps the output file of the job to >>>> the console, (I tested it with the output of my unit-tests), maybe that >>>> can help us find the source of the problem. >>>> >>>> -sebastian >>>> >>>> public class MatrixReader extends AbstractJob { >>>> >>>> public static void main(String[] args) throws Exception { >>>> ToolRunner.run(new MatrixReader(), args); >>>> } >>>> >>>> @Override >>>> public int run(String[] args) throws Exception { >>>> >>>> addInputOption(); >>>> >>>> Map<String,String> parsedArgs = parseArguments(args); >>>> if (parsedArgs == null) { >>>> return -1; >>>> } >>>> >>>> Configuration conf = getConf(); >>>> FileSystem fs = FileSystem.get(conf); >>>> >>>> Path vectorFile = fs.listStatus(getInputPath(), >>>> TasteHadoopUtils.PARTS_FILTER)[0].getPath(); >>>> >>>> SequenceFile.Reader reader = null; >>>> try { >>>> reader = new SequenceFile.Reader(fs, vectorFile, conf); >>>> IntWritable key = new IntWritable(); >>>> VectorWritable value = new VectorWritable(); >>>> >>>> while (reader.next(key, value)) { >>>> int row = key.get(); >>>> System.out.print(String.valueOf(key.get()) + ": "); >>>> Iterator<Element> elementsIterator = >>>> >> value.get().iterateNonZero(); >> >>>> String separator = ""; >>>> while (elementsIterator.hasNext()) { >>>> Element element = elementsIterator.next(); >>>> System.out.print(separator + String.valueOf(element.index()) + >>>> "," + String.valueOf(element.get())); >>>> separator = ";"; >>>> } >>>> System.out.print("\n"); >>>> } >>>> } finally { >>>> reader.close(); >>>> } >>>> return 0; >>>> } >>>> } >>>> >>>> Am 28.06.2010 17:18, schrieb Kris Jack: >>>> >>>> >>>>> Hi, >>>>> >>>>> I am now using the version of >>>>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that >>>>> >> Sebastian >> >>>>> >>>> has >>>> >>>> >>>>> written and has been added to the trunk. Thanks again for that! I can >>>>> generate an output file that should contain a list of documents with >>>>> >>>>> >>>> their >>>> >>>> >>>>> top 100* *most similar documents. I am having problems, however, in >>>>> converting the output file into a readable format using mahout's >>>>> >>>>> >>>> vectordump: >>>> >>>> >>>>> $ ./mahout vectordump --seqFile similarRows --output results.out >>>>> >>>>> >>>> --printKey >>>> >>>> >>>>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally >>>>> Input Path: /home/kris/similarRows >>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >>>>> at >>>>> >>>>> >>>>> >>>> >> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59) >> >>>> >>>>> at >>>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) >>>>> at >>>>> >>>>> >>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) >>>> >>>> >>>>> at >>>>> >>>>> >>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830) >>>> >>>> >>>>> at >>>>> >>>>> >>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) >>>> >>>> >>>>> at >>>>> >>>>> >>>>> >>>> >> org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77) >> >>>> >>>>> at >>>>> >>>>> >> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138) >> >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>> at >>>>> >>>>> >>>>> >>>> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> >>>> >>>>> at >>>>> >>>>> >>>>> >>>> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> >>>> >>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> at >>>>> >>>>> >>>>> >>>> >> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >> >>>> >>>>> at >>>>> >>>>> >>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>> >>>> >>>>> at >>>>> >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174) >> >>>>> What is this doing that takes up so much memory? A file is produced >>>>> >> with >> >>>>> 37,952 readable rows but I'm expecting more like 500,000 results, since >>>>> >> I >> >>>>> have this number of documents. Should I be using something else to >>>>> >> read >> >>>>> >>>> the >>>> >>>> >>>>> output file of the RowSimilarityJob? >>>>> >>>>> Thanks, >>>>> Kris >>>>> >>>>> >>>>> >>>>> 2010/6/18 Sebastian Schelter <[email protected]> >>>>> >>>>> >>>>> >>>>> >>>>>> Hi Kris, >>>>>> >>>>>> maybe you want to give the patch from >>>>>> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not >>>>>> >> yet >> >>>>>> tested it with larger data yet, but I would be happy to get some >>>>>> feedback for it and maybe it helps you with your usecase. >>>>>> >>>>>> -sebastian >>>>>> >>>>>> Am 18.06.2010 18:46, schrieb Kris Jack: >>>>>> >>>>>> >>>>>> >>>>>>> Thanks Ted, >>>>>>> >>>>>>> I got that working. Unfortunately, the matrix multiplication job is >>>>>>> >>>>>>> >>>>>>> >>>>>> taking >>>>>> >>>>>> >>>>>> >>>>>>> far longer than I hoped. With just over 10 million documents, 10 >>>>>>> >>>>>>> >>>> mappers >>>> >>>> >>>>>>> and 10 reducers, I can't get it to complete the job in under 48 >>>>>>> >> hours. >> >>>>>>> Perhaps you have an idea for speeding it up? I have already been >>>>>>> >> quite >> >>>>>>> ruthless with making the vectors sparse. I did not include terms >>>>>>> >> that >> >>>>>>> appeared in over 1% of the corpus and only kept terms that appeared >>>>>>> >> at >> >>>>>>> >>>>>>> >>>>>> least >>>>>> >>>>>> >>>>>> >>>>>>> 50 times. Is it normal that the matrix multiplication map reduce >>>>>>> >> task >> >>>>>>> should take so long to process with this quantity of data and >>>>>>> >> resources >> >>>>>>> available or do you think that my system is not configured properly? >>>>>>> >>>>>>> Thanks, >>>>>>> Kris >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2010/6/15 Ted Dunning <[email protected]> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Threshold are generally dangerous. It is usually preferable to >>>>>>>> >>>>>>>> >>>> specify >>>> >>>> >>>>>>>> >>>>>> the >>>>>> >>>>>> >>>>>> >>>>>>>> sparseness you want (1%, 0.2%, whatever), sort the results in >>>>>>>> >>>>>>>> >>>> descending >>>> >>>> >>>>>>>> score order using Hadoop's builtin capabilities and just drop the >>>>>>>> >>>>>>>> >>>> rest. >>>> >>>> >>>>>>>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[email protected]> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> I was wondering if there was an >>>>>>>>> interesting way to do this with the current mahout code such as >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> requesting >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> that the Vector accumulator returns only elements that have values >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> greater >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> than a given threshold, sorting the vector by value rather than >>>>>>>>> >> key, >> >>>>>>>>> >>>> or >>>> >>>> >>>>>>>>> something else? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >
