Indeed, and our seq2sparse utility enables this directly. If you ask for
cmdline help from seq2sparse, you'll see a bunch of options you maybe don't
use:
$ ./bin/mahout seq2sparse -h
Error: Could not find or load main class classpath
Running on hadoop, using /usr/local/Cellar/hadoop/0.20.1/libexec/bin/hadoop
and HADOOP_CONF_DIR=
MAHOUT-JOB:
/Users/jake/open_src/gitrepo/mahout-twitter/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar
Usage:
[--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize
<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma
<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>
--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>
--overwrite --help --sequentialAccessVector --namedVector --logNormalize]
Options
--minSupport (-s) minSupport (Optional) Minimum Support. Default
Value: 2
--analyzerName (-a) analyzerName The class name of the analyzer
--chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000
MB
--output (-o) output The directory pathname for output.
--input (-i) input Path to job input directory.
--minDF (-md) minDF The minimum document frequency.
Default
is 1
--maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf)
vectors
to be used, expressed in times the
standard deviation (sigma) of the
document frequencies of these
vectors.
Can be used to remove really high
frequency terms. Expressed as a
double
value. Good value to be specified is
3.0.
In case the value is less than 0 no
vectors will be filtered out. Default
is
-1.0. Overrides maxDFPercent
--maxDFPercent (-x) maxDFPercent The max percentage of docs for the
DF.
Can be used to remove really high
frequency terms. Expressed as an
integer
between 0 and 100. Default is 99. If
maxDFSigma is also set, it will
override
this value.
--weight (-wt) weight The kind of weight to use. Currently
TF
or TFIDF
--norm (-n) norm The norm to use, expressed as either
a
float or "INF" if you want to use the
Infinite norm. Must be greater or
equal
to 0. The default is not to
normalize
--minLLR (-ml) minLLR (Optional)The minimum Log Likelihood
Ratio(Float) Default is 1.0
--numReducers (-nr) numReducers (Optional) Number of reduce tasks.
Default Value: 1
--maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams
to
create (2 = bigrams, 3 = trigrams,
etc)
Default Value:1
--overwrite (-ow) If set, overwrite the output
directory
--help (-h) Print out help
--sequentialAccessVector (-seq) (Optional) Whether output vectors
should
be SequentialAccessVectors. If set
true
else false
--namedVector (-nv) (Optional) Whether output vectors
should
be NamedVectors. If set true else
false
--logNormalize (-lnorm) (Optional) Whether output vectors
should
be logNormalize. If set true else
false
13/08/06 04:44:45 INFO driver.MahoutDriver: Program took 158 ms (Minutes:
0.0026333333333333334)
--------
In particular, "--maxNGramSize 3" says you don't want just raw tokens, but
bigrams like "Arnold Schwartzenegger" and trigrams like "New York Yankees".
To decide *which* ones to use (because there are *way* too many 2 and
3grams if you take all of them), the simple technique we have in this
utility is by a) filter by document frequency, either on the high end: get
rid of features [either tokens/unigrams or ngrams for n > 1] which occur
too frequently, by setting --maxDFPercent 95 [this drops the 5% of most
commonly occurring features] or --maxDFSigma 3.0 [this drops all tokens
with doc frequency > 3 sigma higher than the mean], or by getting rid of
features which occur too rarely, with --minDF 2 : this would make sure that
features which occur less than 2 times get dropped, b) filter by log
likelihood ratio: --minLLR 10.0 sets the minimum LLR for an ngram to be at
least 10.0. For a more detailed explanation of ngrams and LLR, Ted's
classic blog
post<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>may
be helpful.
The TL;DR of it is that for practical purposes, you really want to try a
simple run with say "--minLLR 1.0", see how many ngrams are left, and how
good they look, and what they're LLR is, and then bump the value up to
something which gets rid of more of the crappy ones - maybe its 10.0, maybe
it's 15.0, maybe it's 25.0, depends on your data, and how many features you
want to end up with in the end of the day.
On Tue, Aug 6, 2013 at 3:05 AM, Marco <[email protected]> wrote:
> Is it possible to have vectors components from raw text samples with more
> than one word?
>
> Example:
>
> Key: California: Value: "Arnold Schwarzenegger" "San Andreas Fault"
>
> (I've put quotation marks just to show how I'd like to group vector's
> values)
>
--
-jake