Re: Vectors (from raw text) with more than one word values

Suneel Marthi Tue, 06 Aug 2013 07:20:15 -0700

Marco,

The actual tokenization is done by the Lucene Analyzer you specify with the 
option " --analyzerName" (the default being Lucene's StandardAnalyzer) while 
invoking seq2sparse.

Top of my head, I don't think there is a custom Lucene tokenizer for "quoted" 
text, but it should be real easy to create one.

________________________________
 From: Marco <[email protected]>
To: "[email protected]" <[email protected]> 
Sent: Tuesday, August 6, 2013 9:08 AM
Subject: Re: Vectors (from raw text) with more than one word values

wow! this is a hell of an answer!
thanks very much for it.

i also thought about something else: since i'm the one also producing the 
sequence files that'll then be "seq2sparsed", i figured i could "wrap" my 
n-grams (say using quotation marks or whatever) so that then seq2sparse would 
not break them into smaller pieces.

any chance this is possible? does it depend on the separator i use?

----- Messaggio originale -----
Da: Jake Mannix <[email protected]>
A: "[email protected]" <[email protected]>; Marco 
<[email protected]>
Cc: 
Inviato: Martedì 6 Agosto 2013 14:32
Oggetto: Re: Vectors (from raw text) with more than one word values

Indeed, and our seq2sparse utility enables this directly.  If you ask for
cmdline help from seq2sparse, you'll see a bunch of options you maybe don't
use:

$ ./bin/mahout seq2sparse -h
Error: Could not find or load main class classpath
Running on hadoop, using /usr/local/Cellar/hadoop/0.20.1/libexec/bin/hadoop
and HADOOP_CONF_DIR=
MAHOUT-JOB:
/Users/jake/open_src/gitrepo/mahout-twitter/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar
Usage:

[--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize

<chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma

<maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm>

--minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize>

--overwrite --help --sequentialAccessVector --namedVector --logNormalize]

Options

  --minSupport (-s) minSupport        (Optional) Minimum Support. Default

                                      Value: 2

  --analyzerName (-a) analyzerName    The class name of the analyzer

  --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 100-10000
MB
  --output (-o) output                The directory pathname for output.

  --input (-i) input                  Path to job input directory.

  --minDF (-md) minDF                 The minimum document frequency.
Default
                                      is 1

  --maxDFSigma (-xs) maxDFSigma       What portion of the tf (tf-idf)
vectors
                                      to be used, expressed in times the

                                      standard deviation (sigma) of the

                                      document frequencies of these
vectors.
                                      Can be used to remove really high

                                      frequency terms. Expressed as a
double
                                      value. Good value to be specified is
3.0.
                                      In case the value is less than 0 no

                                      vectors will be filtered out. Default
is
                                      -1.0.  Overrides maxDFPercent

  --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the
DF.
                                      Can be used to remove really high

                                      frequency terms. Expressed as an
integer
                                      between 0 and 100. Default is 99.  If

                                      maxDFSigma is also set, it will
override
                                      this value.

  --weight (-wt) weight               The kind of weight to use. Currently
TF
                                      or TFIDF

  --norm (-n) norm                    The norm to use, expressed as either
a
                                      float or "INF" if you want to use the

                                      Infinite norm.  Must be greater or
equal
                                      to 0.  The default is not to
normalize
  --minLLR (-ml) minLLR               (Optional)The minimum Log Likelihood

                                      Ratio(Float)  Default is 1.0

  --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.

                                      Default Value: 1

  --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of ngrams
to
                                      create (2 = bigrams, 3 = trigrams,
etc)
                                      Default Value:1

  --overwrite (-ow)                   If set, overwrite the output
directory
  --help (-h)                         Print out help

  --sequentialAccessVector (-seq)     (Optional) Whether output vectors
should
                                      be SequentialAccessVectors. If set
true
                                      else false

  --namedVector (-nv)                 (Optional) Whether output vectors
should
                                      be NamedVectors. If set true else
false
  --logNormalize (-lnorm)             (Optional) Whether output vectors
should
                                      be logNormalize. If set true else
false
13/08/06 04:44:45 INFO driver.MahoutDriver: Program took 158 ms (Minutes:
0.0026333333333333334)

--------

In particular, "--maxNGramSize 3" says you don't want just raw tokens, but
bigrams like "Arnold Schwartzenegger" and trigrams like "New York Yankees".
To decide *which* ones to use (because there are *way* too many 2 and
3grams if you take all of them), the simple technique we have in this
utility is by a) filter by document frequency, either on the high end: get
rid of features [either tokens/unigrams or ngrams for n > 1] which occur
too frequently, by setting --maxDFPercent 95 [this drops the 5% of most
commonly occurring features] or --maxDFSigma 3.0 [this drops all tokens
with doc frequency > 3 sigma higher than the mean], or by getting rid of
features which occur too rarely, with --minDF 2 : this would make sure that
features which occur less than 2 times get dropped, b) filter by log
likelihood ratio: --minLLR 10.0 sets the minimum LLR for an ngram to be at
least 10.0.  For a more detailed explanation of ngrams and LLR, Ted's
classic blog 
post<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>may
be helpful.

  The TL;DR of it is that for practical purposes, you really want to try a
simple run with say "--minLLR 1.0", see how many ngrams are left, and how
good they look, and what they're LLR is, and then bump the value up to
something which gets rid of more of the crappy ones - maybe its 10.0, maybe
it's 15.0, maybe it's 25.0, depends on your data, and how many features you
want to end up with in the end of the day.

On Tue, Aug 6, 2013 at 3:05 AM, Marco <[email protected]> wrote:

> Is it possible to have vectors components from raw text samples with more
> than one word?
>
> Example:
>
> Key: California: Value: "Arnold Schwarzenegger" "San Andreas Fault"
>
> (I've put quotation marks just to show how I'd like to group vector's
> values)
>

-- 

  -jake

Re: Vectors (from raw text) with more than one word values

Reply via email to