Scott, How much data do you have?
How much do you plan to have? On Fri, Feb 14, 2014 at 8:04 AM, Scott C. Cote <[email protected]> wrote: > Hello All, > > I have two questions (Q1, Q2). > > Q1: Am digging in to Text Analysis and am wrestling with competing analyzed > data maintenance strategies. > > NOTE: my text comes from a very narrowly focused source. > > - Am currently crunching the data (batch) using the following scheme: > 1. Load source text as rows in a mysql database. > 2. Create named TFIDF vectors using a custom analyzer from source text > (-stopwords, lowercase, std filter, Š.) > 3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced cosine > metric (derived from a custom metric found in MiA) > 4. Load references of Clusters into SOLR (core1) cluster id, top terms > along with full cluster data into Mongo (a cluster is a doc) > 5. Then load source text into SOLR(core2) using same custom analyzer with > appropriate boost along with the reference cluster id > NOTE: in all cases, the id of the source text is preserved throughout the > flow in the vector naming process, etc. > > So now I have a mysql table, two SOLR cores, and a Mongo Document > Collection (all tied together with text id as the common name) > > - Now when a new document enters the system after "batch" has been > performed, I use core2 to test the top SOLR matches (custom analyzer > normalizes the new doc) to find best cluster within a tolerance. If a > cluster is found, then I place the text in that cluster if not, then I > start a new group (my word for a cluster not generated via kmeans). Either > way, the doc makes its way into both (core1 and core2). I keep track of the > number of group creations/document placements so that if a threshold is > crossed, then I can re-batch the data. > > In MiA, (I think ch 11), suggests that a user could run the canopy cluster > routine to assign new entries to the clusters (instead of what I am doing). > Does he mean to regenerate a new dictionary, frequencies, etc for the > corpus > for every inbound document? My observations have been that this has been a > very speedy process, but I'm hoping that I'm just too much of a novice and > haven't thought of a way to simply update the dictionary/frequencies. > (this > process also calls for the eventual rebatching of the clusters). > > While I was very early in my "implement what I have read" process, Suneel > and Ted recommended that I examine the Streaming Kmeans process. Would > that > process sidestep much of what I'm doing? > > Q2: I need to really understand the lexicon of my corpus. How do I see the > list of terms that have been omitted due either to being in too many > documents or are not in enough documents for consideration? > > Please know that I know that I can look at the dictionary to see what terms > are covered. And since my custom analyzer is using the > StandardAnalyzer.stop words, those are obvious also. If there isn't an > option to emit the omitted words, where would be the natural place to > capture that data and save it into yet another data store (Sequence > file,etc)? > > Thanks in Advance for the Guidance, > > SCott > > >
