On Tue, Aug 6, 2013 at 7:18 AM, Suneel Marthi <[email protected]>wrote:
> Marco, > > The actual tokenization is done by the Lucene Analyzer you specify with > the option " --analyzerName" (the default being Lucene's StandardAnalyzer) > while invoking seq2sparse. > > Top of my head, I don't think there is a custom Lucene tokenizer for > "quoted" text, but it should be real easy to create one. > +1 to Suneel's remarks. I'm not sure of an off-the-shelf tokenizer which looks for "grouping" tokens and automatically crams those ones together. That would be very helpful, however, as then you could run e.g. stanford's named entity recognizer (or lingpipe, etc) over the text first, and have it annotate things like e.g. noun phrases before, using your predetermined grouping tokens. Alternatively, the "hacky" solution would be to force the typical lucene tokenizer to leave stuff alone by forcing it together: in your preprocessing code which decides that "New York Yankees" is a trigram you don't want split apart, just have it replace this string with NewYorkYankees, and most Lucene tokenizers will leave it alone. For seq2sparse, you'll get exactly the vectors you want, but your dictionary will contain these "lame" concatenatedbigrams and so forth, which you could write some simple code to re-unscramble after the fact. > > > > > > ________________________________ > From: Marco <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Tuesday, August 6, 2013 9:08 AM > Subject: Re: Vectors (from raw text) with more than one word values > > > wow! this is a hell of an answer! > thanks very much for it. > > i also thought about something else: since i'm the one also producing the > sequence files that'll then be "seq2sparsed", i figured i could "wrap" my > n-grams (say using quotation marks or whatever) so that then seq2sparse > would not break them into smaller pieces. > > any chance this is possible? does it depend on the separator i use? > > > > ----- Messaggio originale ----- > Da: Jake Mannix <[email protected]> > A: "[email protected]" <[email protected]>; Marco < > [email protected]> > Cc: > Inviato: Martedì 6 Agosto 2013 14:32 > Oggetto: Re: Vectors (from raw text) with more than one word values > > Indeed, and our seq2sparse utility enables this directly. If you ask for > cmdline help from seq2sparse, you'll see a bunch of options you maybe don't > use: > > $ ./bin/mahout seq2sparse -h > Error: Could not find or load main class classpath > Running on hadoop, using /usr/local/Cellar/hadoop/0.20.1/libexec/bin/hadoop > and HADOOP_CONF_DIR= > MAHOUT-JOB: > > /Users/jake/open_src/gitrepo/mahout-twitter/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar > Usage: > > [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize > > <chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma > > <maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm> > > --minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize> > > --overwrite --help --sequentialAccessVector --namedVector --logNormalize] > > Options > > --minSupport (-s) minSupport (Optional) Minimum Support. Default > > Value: 2 > > --analyzerName (-a) analyzerName The class name of the analyzer > > --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000 > MB > --output (-o) output The directory pathname for output. > > --input (-i) input Path to job input directory. > > --minDF (-md) minDF The minimum document frequency. > Default > is 1 > > --maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf) > vectors > to be used, expressed in times the > > standard deviation (sigma) of the > > document frequencies of these > vectors. > Can be used to remove really high > > frequency terms. Expressed as a > double > value. Good value to be specified is > 3.0. > In case the value is less than 0 no > > vectors will be filtered out. Default > is > -1.0. Overrides maxDFPercent > > --maxDFPercent (-x) maxDFPercent The max percentage of docs for the > DF. > Can be used to remove really high > > frequency terms. Expressed as an > integer > between 0 and 100. Default is 99. If > > maxDFSigma is also set, it will > override > this value. > > --weight (-wt) weight The kind of weight to use. Currently > TF > or TFIDF > > --norm (-n) norm The norm to use, expressed as either > a > float or "INF" if you want to use the > > Infinite norm. Must be greater or > equal > to 0. The default is not to > normalize > --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood > > Ratio(Float) Default is 1.0 > > --numReducers (-nr) numReducers (Optional) Number of reduce tasks. > > Default Value: 1 > > --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams > to > create (2 = bigrams, 3 = trigrams, > etc) > Default Value:1 > > --overwrite (-ow) If set, overwrite the output > directory > --help (-h) Print out help > > --sequentialAccessVector (-seq) (Optional) Whether output vectors > should > be SequentialAccessVectors. If set > true > else false > > --namedVector (-nv) (Optional) Whether output vectors > should > be NamedVectors. If set true else > false > --logNormalize (-lnorm) (Optional) Whether output vectors > should > be logNormalize. If set true else > false > 13/08/06 04:44:45 INFO driver.MahoutDriver: Program took 158 ms (Minutes: > 0.0026333333333333334) > > -------- > > In particular, "--maxNGramSize 3" says you don't want just raw tokens, but > bigrams like "Arnold Schwartzenegger" and trigrams like "New York Yankees". > To decide *which* ones to use (because there are *way* too many 2 and > 3grams if you take all of them), the simple technique we have in this > utility is by a) filter by document frequency, either on the high end: get > rid of features [either tokens/unigrams or ngrams for n > 1] which occur > too frequently, by setting --maxDFPercent 95 [this drops the 5% of most > commonly occurring features] or --maxDFSigma 3.0 [this drops all tokens > with doc frequency > 3 sigma higher than the mean], or by getting rid of > features which occur too rarely, with --minDF 2 : this would make sure that > features which occur less than 2 times get dropped, b) filter by log > likelihood ratio: --minLLR 10.0 sets the minimum LLR for an ngram to be at > least 10.0. For a more detailed explanation of ngrams and LLR, Ted's > classic blog post< > http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>may > be helpful. > > The TL;DR of it is that for practical purposes, you really want to try a > simple run with say "--minLLR 1.0", see how many ngrams are left, and how > good they look, and what they're LLR is, and then bump the value up to > something which gets rid of more of the crappy ones - maybe its 10.0, maybe > it's 15.0, maybe it's 25.0, depends on your data, and how many features you > want to end up with in the end of the day. > > > > On Tue, Aug 6, 2013 at 3:05 AM, Marco <[email protected]> wrote: > > > Is it possible to have vectors components from raw text samples with more > > than one word? > > > > Example: > > > > Key: California: Value: "Arnold Schwarzenegger" "San Andreas Fault" > > > > (I've put quotation marks just to show how I'd like to group vector's > > values) > > > > > > -- > > -jake > -- -jake
