Look at examples/bin/classify-reuters.sh. It runs seq2sparse on the reuters corpus.
I'm an incrementalist- I find an example that works and then change things one at a time. There is so much behind the scenes in the text jobs. On Fri, Feb 24, 2012 at 4:01 PM, Temese Szalai <[email protected]> wrote: > Hi Lance - > > Thanks for the follow up. I appreciate it. No, unfortunately, I don't think > that is what I want. > > I am really just trying to get the basics down and one of those basics > includes generating vectors from text files, which starts with generating > sequence files from a directory of text documents. > I believe mahout is installed properly as I successfully did the synthetic > control data example and it seemed to work without any glitches. "Creating > vectors from text" is linked to from the Quickstart page on the wiki. > > what i've tried is really a classic text book example that i have seen > referenced a bunch of places. i've downloaded and extracted the corpus > (reuters). i'm doing this so that i can figure out the plumbing so that i > can build vectors from my own text documents. > the command i have used and has failed is: > > $ bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ > -o reuters-seqfiles > > but, for me, it doesn't generate sequence files. (terminal output of that > command is earlier in the thread.) > > i'm just stuck. frankly, i am starting to consider other non-mahout > approaches to clustering ... or at least generating vectors (or possibly > just sequence files) outside of mahout. i've also downloaded and installed > mahout-0.6. i guess i'll try again with that too. > > I build taxonomic text classification solutions. People are always asking > me if there is a way to use automated tools to help "detect" possible > categories in a large text data set. My previous experiences with text > clustering have yielded results that weren't satisfactory. > With a tool set like Mahout now available, I am hoping that has changed to > some extent. I'd love for that answer to be "yes - it's mahout clustering. > here's how." > but, i have to see that it works and i can't even get the text book demo > example to do something useful yet. It's probably because of something > silly that I overlooked somewhere > or something that should be obvious and I know i will kick myself when i > find out what it is. But, just the same, I'm still stuck. > > Thoughts? > > Thanks again. > Temese > On Thu, Feb 23, 2012 at 7:47 PM, Lance Norskog <[email protected]> wrote: > >> What does this do? And is it what you want? >> >> org.apache.mahout.text.PrefixAdditionFilter >> >> You can run these apps from inside Eclipse/IntelliJ, and single step >> where it walks files. >> >> On Wed, Feb 22, 2012 at 7:01 PM, Temese Szalai <[email protected]> >> wrote: >> > Hello - >> > >> > I'm new to Mahout and I'm not having any luck trying to use seqdirectory >> to >> > create seqfiles so that i can then generate vectors from text files. >> > Seems like this operation should work like a charm. >> > >> > Here is the command that I used to attempt to process the Reuters corpus >> > into seqfiles and the output that I got in the terminal. >> > >> > *$ bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o >> > reuters-seqfiles* >> > *Running on hadoop, using >> > HADOOP_HOME=/Users/temeseszalai/Desktop/hadoop-0.20.203.0* >> > *No HADOOP_CONF_DIR set, using >> > /Users/temeseszalai/Desktop/hadoop-0.20.203.0/src/conf * >> > *12/02/22 16:29:01 INFO common.AbstractJob: Command line arguments: >> > {--charset=UTF-8, --chunkSize=64, --endPhase=2147483647, >> > --fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter, >> > --input=examples/reuters-extracted/, --keyPrefix=, >> > --output=reuters-seqfiles, --startPhase=0, --tempDir=temp}* >> > *12/02/22 16:29:02 INFO driver.MahoutDriver: Program took 418 ms* >> > >> > I am using mahout-distribution-0.5 on Mac OSX (10.7.3). >> > I don't get any error messages from seqdirectory. I just don't get any >> > seqfiles. >> > >> > the output directory is always empty and the time it takes to run is >> always >> > minimal.. have tried with different data, different paths, have had >> someone >> > else with >> > considerably more java experience sanity check and still no luck. >> > >> > I'm clearly doing something wrong ... No idea what ... I've tried poking >> > around to see if anyone else has had the same issue and haven't turned up >> > much that is useful. >> > >> > Any thoughts? Guidance would definitely be appreciated. >> > >> > Thanks in advance. >> > Temese >> >> >> >> -- >> Lance Norskog >> [email protected] >> -- Lance Norskog [email protected]
