Hi Lance -

Thanks for the follow up. I appreciate it. No, unfortunately, I don't think
that is what I want.

I am really just trying to get the basics down and one of those basics
includes generating vectors from text files, which starts with generating
sequence files from a directory of text documents.
I believe mahout is installed properly as I successfully did the synthetic
control data example and it seemed to work without any glitches. "Creating
vectors from text" is linked to from the Quickstart page on the wiki.

what i've tried is really a classic text book example that i have seen
referenced a bunch of places. i've downloaded and extracted the corpus
(reuters).  i'm doing this so that i can figure out the plumbing so that i
can build vectors from my own text documents.
the command i have used and has failed is:

$ bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/
-o reuters-seqfiles

but, for me, it doesn't generate sequence files. (terminal output of that
command is earlier in the thread.)

i'm just stuck. frankly, i am starting to consider other non-mahout
approaches to clustering ... or at least generating vectors (or possibly
just sequence files) outside of mahout. i've also downloaded and installed
mahout-0.6. i guess i'll try again with that too.

I build taxonomic text classification solutions. People are always asking
me if there is a way to use automated tools to help "detect" possible
categories in a large text data set. My previous experiences with text
clustering have yielded results that weren't satisfactory.
With a tool set like Mahout now available, I am hoping that has changed to
some extent. I'd love for that answer to be "yes - it's mahout clustering.
here's how."
but, i have to see that it works and i can't even get the text book demo
example to do something useful yet. It's probably because of something
silly that I overlooked somewhere
or something that should be obvious and I know i will kick myself when i
find out what it is. But, just the same, I'm still stuck.

Thoughts?

Thanks again.
Temese
On Thu, Feb 23, 2012 at 7:47 PM, Lance Norskog <[email protected]> wrote:

> What does this do? And is it what you want?
>
> org.apache.mahout.text.PrefixAdditionFilter
>
> You can run these apps from inside Eclipse/IntelliJ, and single step
> where it walks files.
>
> On Wed, Feb 22, 2012 at 7:01 PM, Temese Szalai <[email protected]>
> wrote:
> > Hello -
> >
> > I'm new to Mahout and I'm not having any luck trying to use seqdirectory
> to
> > create seqfiles so that i can then generate vectors from text files.
> > Seems like this operation should work like a charm.
> >
> > Here is the command that I used to attempt to process the Reuters corpus
> > into seqfiles and the output that I got in the terminal.
> >
> > *$ bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o
> > reuters-seqfiles*
> > *Running on hadoop, using
> > HADOOP_HOME=/Users/temeseszalai/Desktop/hadoop-0.20.203.0*
> > *No HADOOP_CONF_DIR set, using
> > /Users/temeseszalai/Desktop/hadoop-0.20.203.0/src/conf *
> > *12/02/22 16:29:01 INFO common.AbstractJob: Command line arguments:
> > {--charset=UTF-8, --chunkSize=64, --endPhase=2147483647,
> > --fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter,
> > --input=examples/reuters-extracted/, --keyPrefix=,
> > --output=reuters-seqfiles, --startPhase=0, --tempDir=temp}*
> > *12/02/22 16:29:02 INFO driver.MahoutDriver: Program took 418 ms*
> >
> > I am using mahout-distribution-0.5 on Mac OSX (10.7.3).
> > I don't get any error messages from seqdirectory. I just don't get any
> > seqfiles.
> >
> > the output directory is always empty and the time it takes to run is
> always
> > minimal.. have tried with different data, different paths, have had
> someone
> > else with
> > considerably more java experience sanity check and still no luck.
> >
> > I'm clearly doing something wrong ... No idea what ... I've tried poking
> > around to see if anyone else has had the same issue and haven't turned up
> > much that is useful.
> >
> > Any thoughts? Guidance would definitely be appreciated.
> >
> > Thanks in advance.
> > Temese
>
>
>
> --
> Lance Norskog
> [email protected]
>

Reply via email to