Right now - I'm dealing with only 40,000 documents, but we will eventually grow more than 10x (put on the manager hat and say 1 mil docs) where a doc is usually no longer than 20 or 30 words.
SCott On 2/14/14 12:46 PM, "Ted Dunning" <[email protected]> wrote: >Scott, > >How much data do you have? > >How much do you plan to have? > > > >On Fri, Feb 14, 2014 at 8:04 AM, Scott C. Cote <[email protected]> >wrote: > >> Hello All, >> >> I have two questions (Q1, Q2). >> >> Q1: Am digging in to Text Analysis and am wrestling with competing >>analyzed >> data maintenance strategies. >> >> NOTE: my text comes from a very narrowly focused source. >> >> - Am currently crunching the data (batch) using the following scheme: >> 1. Load source text as rows in a mysql database. >> 2. Create named TFIDF vectors using a custom analyzer from source text >> (-stopwords, lowercase, std filter, Š.) >> 3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced >>cosine >> metric (derived from a custom metric found in MiA) >> 4. Load references of Clusters into SOLR (core1) cluster id, top terms >> along with full cluster data into Mongo (a cluster is a doc) >> 5. Then load source text into SOLR(core2) using same custom analyzer >>with >> appropriate boost along with the reference cluster id >> NOTE: in all cases, the id of the source text is preserved throughout >>the >> flow in the vector naming process, etc. >> >> So now I have a mysql table, two SOLR cores, and a Mongo Document >> Collection (all tied together with text id as the common name) >> >> - Now when a new document enters the system after "batch" has been >> performed, I use core2 to test the top SOLR matches (custom analyzer >> normalizes the new doc) to find best cluster within a tolerance. If a >> cluster is found, then I place the text in that cluster if not, then I >> start a new group (my word for a cluster not generated via kmeans). >>Either >> way, the doc makes its way into both (core1 and core2). I keep track of >>the >> number of group creations/document placements so that if a threshold is >> crossed, then I can re-batch the data. >> >> In MiA, (I think ch 11), suggests that a user could run the canopy >>cluster >> routine to assign new entries to the clusters (instead of what I am >>doing). >> Does he mean to regenerate a new dictionary, frequencies, etc for the >> corpus >> for every inbound document? My observations have been that this has >>been a >> very speedy process, but I'm hoping that I'm just too much of a novice >>and >> haven't thought of a way to simply update the dictionary/frequencies. >> (this >> process also calls for the eventual rebatching of the clusters). >> >> While I was very early in my "implement what I have read" process, >>Suneel >> and Ted recommended that I examine the Streaming Kmeans process. Would >> that >> process sidestep much of what I'm doing? >> >> Q2: I need to really understand the lexicon of my corpus. How do I see >>the >> list of terms that have been omitted due either to being in too many >> documents or are not in enough documents for consideration? >> >> Please know that I know that I can look at the dictionary to see what >>terms >> are covered. And since my custom analyzer is using the >> StandardAnalyzer.stop words, those are obvious also. If there isn't an >> option to emit the omitted words, where would be the natural place to >> capture that data and save it into yet another data store (Sequence >> file,etc)? >> >> Thanks in Advance for the Guidance, >> >> SCott >> >> >>
