In-memory ball k-means should solve your problem pretty well right now.
 In-memory streaming k-means followed by ball k-means will take you to well
beyond your scaled case.

At 1 million documents, you should be able to do your clustering in a few
minutes, depending on whether some of the sparse matrix performance issues
got fixed in the clustering code (I think they did).




On Fri, Feb 14, 2014 at 10:50 AM, Scott C. Cote <[email protected]>wrote:

> Right now - I'm dealing with only 40,000 documents, but we will eventually
> grow more than 10x (put on the manager hat and say 1 mil docs) where a doc
> is usually no longer than 20 or 30 words.
>
> SCott
>
> On 2/14/14 12:46 PM, "Ted Dunning" <[email protected]> wrote:
>
> >Scott,
> >
> >How much data do you have?
> >
> >How much do you plan to have?
> >
> >
> >
> >On Fri, Feb 14, 2014 at 8:04 AM, Scott C. Cote <[email protected]>
> >wrote:
> >
> >> Hello All,
> >>
> >> I have two questions (Q1, Q2).
> >>
> >> Q1: Am digging in to Text Analysis and am wrestling with competing
> >>analyzed
> >> data maintenance strategies.
> >>
> >> NOTE: my text comes from a very narrowly focused source.
> >>
> >> - Am currently crunching the data (batch) using the following scheme:
> >> 1. Load source text as rows in a mysql database.
> >> 2. Create named TFIDF vectors using a custom analyzer from source text
> >> (-stopwords, lowercase, std filter, Š.)
> >> 3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced
> >>cosine
> >> metric (derived from a custom metric found in MiA)
> >> 4. Load references of Clusters into SOLR (core1) ­ cluster id, top terms
> >> along with full cluster data into Mongo (a cluster is a doc)
> >> 5. Then load source text into SOLR(core2) using same custom analyzer
> >>with
> >> appropriate boost along with the reference cluster id
> >> NOTE: in all cases, the id of the source text is preserved throughout
> >>the
> >> flow in the vector naming process, etc.
> >>
> >> So now I have a mysql table,  two SOLR cores, and a Mongo Document
> >> Collection (all tied together with text id as the common name)
> >>
> >> - Now when  a new document enters the system after "batch" has been
> >> performed, I use core2 to test the top  SOLR matches (custom analyzer
> >> normalizes the new doc) to find best cluster within a tolerance.  If a
> >> cluster is found, then I place the text in that cluster ­ if not, then I
> >> start a new group (my word for a cluster not generated via kmeans).
> >>Either
> >> way, the doc makes its way into both (core1 and core2). I keep track of
> >>the
> >> number of group creations/document placements so that if a threshold is
> >> crossed, then I can re-batch the data.
> >>
> >> In MiA, (I think ch 11), suggests that a user could run the canopy
> >>cluster
> >> routine to assign new entries to the clusters (instead of what I am
> >>doing).
> >> Does he mean to regenerate a new dictionary, frequencies, etc for the
> >> corpus
> >> for every inbound document?  My observations have been that this has
> >>been a
> >> very speedy process, but I'm hoping that I'm just too much of a novice
> >>and
> >> haven't thought of a way to simply update the dictionary/frequencies.
> >>  (this
> >> process also calls for the eventual rebatching of the clusters).
> >>
> >> While I was very early in my "implement what I have read" process,
> >>Suneel
> >> and Ted recommended that I examine the Streaming Kmeans process.  Would
> >> that
> >> process sidestep much of what I'm doing?
> >>
> >> Q2: I need to really understand the lexicon of my corpus.  How do I see
> >>the
> >> list of terms that have been omitted due either to being in too many
> >> documents or are not in enough documents for consideration?
> >>
> >> Please know that I know that I can look at the dictionary to see what
> >>terms
> >> are covered.  And since my custom analyzer is using the
> >> StandardAnalyzer.stop words, those are obvious also.  If there isn't an
> >> option to emit the  omitted words, where would be the natural place to
> >> capture that data and save it into yet another data store (Sequence
> >> file,etc)?
> >>
> >> Thanks in Advance for the Guidance,
> >>
> >> SCott
> >>
> >>
> >>
>
>
>

Reply via email to