Look at examples/bin/classify-reuters.sh. It runs seq2sparse on the
reuters corpus.

I'm an incrementalist- I find an example that works and then change
things one at a time. There is so much behind the scenes in the text
jobs.

On Fri, Feb 24, 2012 at 4:01 PM, Temese Szalai <[email protected]> wrote:
> Hi Lance -
>
> Thanks for the follow up. I appreciate it. No, unfortunately, I don't think
> that is what I want.
>
> I am really just trying to get the basics down and one of those basics
> includes generating vectors from text files, which starts with generating
> sequence files from a directory of text documents.
> I believe mahout is installed properly as I successfully did the synthetic
> control data example and it seemed to work without any glitches. "Creating
> vectors from text" is linked to from the Quickstart page on the wiki.
>
> what i've tried is really a classic text book example that i have seen
> referenced a bunch of places. i've downloaded and extracted the corpus
> (reuters).  i'm doing this so that i can figure out the plumbing so that i
> can build vectors from my own text documents.
> the command i have used and has failed is:
>
> $ bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/
> -o reuters-seqfiles
>
> but, for me, it doesn't generate sequence files. (terminal output of that
> command is earlier in the thread.)
>
> i'm just stuck. frankly, i am starting to consider other non-mahout
> approaches to clustering ... or at least generating vectors (or possibly
> just sequence files) outside of mahout. i've also downloaded and installed
> mahout-0.6. i guess i'll try again with that too.
>
> I build taxonomic text classification solutions. People are always asking
> me if there is a way to use automated tools to help "detect" possible
> categories in a large text data set. My previous experiences with text
> clustering have yielded results that weren't satisfactory.
> With a tool set like Mahout now available, I am hoping that has changed to
> some extent. I'd love for that answer to be "yes - it's mahout clustering.
> here's how."
> but, i have to see that it works and i can't even get the text book demo
> example to do something useful yet. It's probably because of something
> silly that I overlooked somewhere
> or something that should be obvious and I know i will kick myself when i
> find out what it is. But, just the same, I'm still stuck.
>
> Thoughts?
>
> Thanks again.
> Temese
> On Thu, Feb 23, 2012 at 7:47 PM, Lance Norskog <[email protected]> wrote:
>
>> What does this do? And is it what you want?
>>
>> org.apache.mahout.text.PrefixAdditionFilter
>>
>> You can run these apps from inside Eclipse/IntelliJ, and single step
>> where it walks files.
>>
>> On Wed, Feb 22, 2012 at 7:01 PM, Temese Szalai <[email protected]>
>> wrote:
>> > Hello -
>> >
>> > I'm new to Mahout and I'm not having any luck trying to use seqdirectory
>> to
>> > create seqfiles so that i can then generate vectors from text files.
>> > Seems like this operation should work like a charm.
>> >
>> > Here is the command that I used to attempt to process the Reuters corpus
>> > into seqfiles and the output that I got in the terminal.
>> >
>> > *$ bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o
>> > reuters-seqfiles*
>> > *Running on hadoop, using
>> > HADOOP_HOME=/Users/temeseszalai/Desktop/hadoop-0.20.203.0*
>> > *No HADOOP_CONF_DIR set, using
>> > /Users/temeseszalai/Desktop/hadoop-0.20.203.0/src/conf *
>> > *12/02/22 16:29:01 INFO common.AbstractJob: Command line arguments:
>> > {--charset=UTF-8, --chunkSize=64, --endPhase=2147483647,
>> > --fileFilterClass=org.apache.mahout.text.PrefixAdditionFilter,
>> > --input=examples/reuters-extracted/, --keyPrefix=,
>> > --output=reuters-seqfiles, --startPhase=0, --tempDir=temp}*
>> > *12/02/22 16:29:02 INFO driver.MahoutDriver: Program took 418 ms*
>> >
>> > I am using mahout-distribution-0.5 on Mac OSX (10.7.3).
>> > I don't get any error messages from seqdirectory. I just don't get any
>> > seqfiles.
>> >
>> > the output directory is always empty and the time it takes to run is
>> always
>> > minimal.. have tried with different data, different paths, have had
>> someone
>> > else with
>> > considerably more java experience sanity check and still no luck.
>> >
>> > I'm clearly doing something wrong ... No idea what ... I've tried poking
>> > around to see if anyone else has had the same issue and haven't turned up
>> > much that is useful.
>> >
>> > Any thoughts? Guidance would definitely be appreciated.
>> >
>> > Thanks in advance.
>> > Temese
>>
>>
>>
>> --
>> Lance Norskog
>> [email protected]
>>



-- 
Lance Norskog
[email protected]

Reply via email to