Marco, The actual tokenization is done by the Lucene Analyzer you specify with the option " --analyzerName" (the default being Lucene's StandardAnalyzer) while invoking seq2sparse.
Top of my head, I don't think there is a custom Lucene tokenizer for "quoted" text, but it should be real easy to create one. ________________________________ From: Marco <[email protected]> To: "[email protected]" <[email protected]> Sent: Tuesday, August 6, 2013 9:08 AM Subject: Re: Vectors (from raw text) with more than one word values wow! this is a hell of an answer! thanks very much for it. i also thought about something else: since i'm the one also producing the sequence files that'll then be "seq2sparsed", i figured i could "wrap" my n-grams (say using quotation marks or whatever) so that then seq2sparse would not break them into smaller pieces. any chance this is possible? does it depend on the separator i use? ----- Messaggio originale ----- Da: Jake Mannix <[email protected]> A: "[email protected]" <[email protected]>; Marco <[email protected]> Cc: Inviato: Martedì 6 Agosto 2013 14:32 Oggetto: Re: Vectors (from raw text) with more than one word values Indeed, and our seq2sparse utility enables this directly. If you ask for cmdline help from seq2sparse, you'll see a bunch of options you maybe don't use: $ ./bin/mahout seq2sparse -h Error: Could not find or load main class classpath Running on hadoop, using /usr/local/Cellar/hadoop/0.20.1/libexec/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /Users/jake/open_src/gitrepo/mahout-twitter/examples/target/mahout-examples-0.9-SNAPSHOT-job.jar Usage: [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize <chunkSize> --output <output> --input <input> --minDF <minDF> --maxDFSigma <maxDFSigma> --maxDFPercent <maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR> --numReducers <numReducers> --maxNGramSize <ngramSize> --overwrite --help --sequentialAccessVector --namedVector --logNormalize] Options --minSupport (-s) minSupport (Optional) Minimum Support. Default Value: 2 --analyzerName (-a) analyzerName The class name of the analyzer --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000 MB --output (-o) output The directory pathname for output. --input (-i) input Path to job input directory. --minDF (-md) minDF The minimum document frequency. Default is 1 --maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf) vectors to be used, expressed in times the standard deviation (sigma) of the document frequencies of these vectors. Can be used to remove really high frequency terms. Expressed as a double value. Good value to be specified is 3.0. In case the value is less than 0 no vectors will be filtered out. Default is -1.0. Overrides maxDFPercent --maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF. Can be used to remove really high frequency terms. Expressed as an integer between 0 and 100. Default is 99. If maxDFSigma is also set, it will override this value. --weight (-wt) weight The kind of weight to use. Currently TF or TFIDF --norm (-n) norm The norm to use, expressed as either a float or "INF" if you want to use the Infinite norm. Must be greater or equal to 0. The default is not to normalize --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood Ratio(Float) Default is 1.0 --numReducers (-nr) numReducers (Optional) Number of reduce tasks. Default Value: 1 --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to create (2 = bigrams, 3 = trigrams, etc) Default Value:1 --overwrite (-ow) If set, overwrite the output directory --help (-h) Print out help --sequentialAccessVector (-seq) (Optional) Whether output vectors should be SequentialAccessVectors. If set true else false --namedVector (-nv) (Optional) Whether output vectors should be NamedVectors. If set true else false --logNormalize (-lnorm) (Optional) Whether output vectors should be logNormalize. If set true else false 13/08/06 04:44:45 INFO driver.MahoutDriver: Program took 158 ms (Minutes: 0.0026333333333333334) -------- In particular, "--maxNGramSize 3" says you don't want just raw tokens, but bigrams like "Arnold Schwartzenegger" and trigrams like "New York Yankees". To decide *which* ones to use (because there are *way* too many 2 and 3grams if you take all of them), the simple technique we have in this utility is by a) filter by document frequency, either on the high end: get rid of features [either tokens/unigrams or ngrams for n > 1] which occur too frequently, by setting --maxDFPercent 95 [this drops the 5% of most commonly occurring features] or --maxDFSigma 3.0 [this drops all tokens with doc frequency > 3 sigma higher than the mean], or by getting rid of features which occur too rarely, with --minDF 2 : this would make sure that features which occur less than 2 times get dropped, b) filter by log likelihood ratio: --minLLR 10.0 sets the minimum LLR for an ngram to be at least 10.0. For a more detailed explanation of ngrams and LLR, Ted's classic blog post<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>may be helpful. The TL;DR of it is that for practical purposes, you really want to try a simple run with say "--minLLR 1.0", see how many ngrams are left, and how good they look, and what they're LLR is, and then bump the value up to something which gets rid of more of the crappy ones - maybe its 10.0, maybe it's 15.0, maybe it's 25.0, depends on your data, and how many features you want to end up with in the end of the day. On Tue, Aug 6, 2013 at 3:05 AM, Marco <[email protected]> wrote: > Is it possible to have vectors components from raw text samples with more > than one word? > > Example: > > Key: California: Value: "Arnold Schwarzenegger" "San Andreas Fault" > > (I've put quotation marks just to show how I'd like to group vector's > values) > -- -jake
