In-memory ball k-means should solve your problem pretty well right now. In-memory streaming k-means followed by ball k-means will take you to well beyond your scaled case.
At 1 million documents, you should be able to do your clustering in a few minutes, depending on whether some of the sparse matrix performance issues got fixed in the clustering code (I think they did). On Fri, Feb 14, 2014 at 10:50 AM, Scott C. Cote <[email protected]>wrote: > Right now - I'm dealing with only 40,000 documents, but we will eventually > grow more than 10x (put on the manager hat and say 1 mil docs) where a doc > is usually no longer than 20 or 30 words. > > SCott > > On 2/14/14 12:46 PM, "Ted Dunning" <[email protected]> wrote: > > >Scott, > > > >How much data do you have? > > > >How much do you plan to have? > > > > > > > >On Fri, Feb 14, 2014 at 8:04 AM, Scott C. Cote <[email protected]> > >wrote: > > > >> Hello All, > >> > >> I have two questions (Q1, Q2). > >> > >> Q1: Am digging in to Text Analysis and am wrestling with competing > >>analyzed > >> data maintenance strategies. > >> > >> NOTE: my text comes from a very narrowly focused source. > >> > >> - Am currently crunching the data (batch) using the following scheme: > >> 1. Load source text as rows in a mysql database. > >> 2. Create named TFIDF vectors using a custom analyzer from source text > >> (-stopwords, lowercase, std filter, Š.) > >> 3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced > >>cosine > >> metric (derived from a custom metric found in MiA) > >> 4. Load references of Clusters into SOLR (core1) cluster id, top terms > >> along with full cluster data into Mongo (a cluster is a doc) > >> 5. Then load source text into SOLR(core2) using same custom analyzer > >>with > >> appropriate boost along with the reference cluster id > >> NOTE: in all cases, the id of the source text is preserved throughout > >>the > >> flow in the vector naming process, etc. > >> > >> So now I have a mysql table, two SOLR cores, and a Mongo Document > >> Collection (all tied together with text id as the common name) > >> > >> - Now when a new document enters the system after "batch" has been > >> performed, I use core2 to test the top SOLR matches (custom analyzer > >> normalizes the new doc) to find best cluster within a tolerance. If a > >> cluster is found, then I place the text in that cluster if not, then I > >> start a new group (my word for a cluster not generated via kmeans). > >>Either > >> way, the doc makes its way into both (core1 and core2). I keep track of > >>the > >> number of group creations/document placements so that if a threshold is > >> crossed, then I can re-batch the data. > >> > >> In MiA, (I think ch 11), suggests that a user could run the canopy > >>cluster > >> routine to assign new entries to the clusters (instead of what I am > >>doing). > >> Does he mean to regenerate a new dictionary, frequencies, etc for the > >> corpus > >> for every inbound document? My observations have been that this has > >>been a > >> very speedy process, but I'm hoping that I'm just too much of a novice > >>and > >> haven't thought of a way to simply update the dictionary/frequencies. > >> (this > >> process also calls for the eventual rebatching of the clusters). > >> > >> While I was very early in my "implement what I have read" process, > >>Suneel > >> and Ted recommended that I examine the Streaming Kmeans process. Would > >> that > >> process sidestep much of what I'm doing? > >> > >> Q2: I need to really understand the lexicon of my corpus. How do I see > >>the > >> list of terms that have been omitted due either to being in too many > >> documents or are not in enough documents for consideration? > >> > >> Please know that I know that I can look at the dictionary to see what > >>terms > >> are covered. And since my custom analyzer is using the > >> StandardAnalyzer.stop words, those are obvious also. If there isn't an > >> option to emit the omitted words, where would be the natural place to > >> capture that data and save it into yet another data store (Sequence > >> file,etc)? > >> > >> Thanks in Advance for the Guidance, > >> > >> SCott > >> > >> > >> > > >
