Hello All, I have two questions (Q1, Q2).
Q1: Am digging in to Text Analysis and am wrestling with competing analyzed data maintenance strategies. NOTE: my text comes from a very narrowly focused source. - Am currently crunching the data (batch) using the following scheme: 1. Load source text as rows in a mysql database. 2. Create named TFIDF vectors using a custom analyzer from source text (-stopwords, lowercase, std filter, .) 3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced cosine metric (derived from a custom metric found in MiA) 4. Load references of Clusters into SOLR (core1) cluster id, top terms along with full cluster data into Mongo (a cluster is a doc) 5. Then load source text into SOLR(core2) using same custom analyzer with appropriate boost along with the reference cluster id NOTE: in all cases, the id of the source text is preserved throughout the flow in the vector naming process, etc. So now I have a mysql table, two SOLR cores, and a Mongo Document Collection (all tied together with text id as the common name) - Now when a new document enters the system after "batch" has been performed, I use core2 to test the top SOLR matches (custom analyzer normalizes the new doc) to find best cluster within a tolerance. If a cluster is found, then I place the text in that cluster if not, then I start a new group (my word for a cluster not generated via kmeans). Either way, the doc makes its way into both (core1 and core2). I keep track of the number of group creations/document placements so that if a threshold is crossed, then I can re-batch the data. In MiA, (I think ch 11), suggests that a user could run the canopy cluster routine to assign new entries to the clusters (instead of what I am doing). Does he mean to regenerate a new dictionary, frequencies, etc for the corpus for every inbound document? My observations have been that this has been a very speedy process, but I'm hoping that I'm just too much of a novice and haven't thought of a way to simply update the dictionary/frequencies. (this process also calls for the eventual rebatching of the clusters). While I was very early in my "implement what I have read" process, Suneel and Ted recommended that I examine the Streaming Kmeans process. Would that process sidestep much of what I'm doing? Q2: I need to really understand the lexicon of my corpus. How do I see the list of terms that have been omitted due either to being in too many documents or are not in enough documents for consideration? Please know that I know that I can look at the dictionary to see what terms are covered. And since my custom analyzer is using the StandardAnalyzer.stop words, those are obvious also. If there isn't an option to emit the omitted words, where would be the natural place to capture that data and save it into yet another data store (Sequence file,etc)? Thanks in Advance for the Guidance, SCott
