kmeans

Scott C. Cote Fri, 14 Feb 2014 08:06:09 -0800

Hello All,

I have two questions (Q1, Q2).


Q1: Am digging in to Text Analysis and am wrestling with competing analyzed
data maintenance strategies.

NOTE: my text comes from a very narrowly focused source.

- Am currently crunching the data (batch) using the following scheme:
1. Load source text as rows in a mysql database.
2. Create named TFIDF vectors using a custom analyzer from source text
(-stopwords, lowercase, std filter, .)
3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced cosine
metric (derived from a custom metric found in MiA)
4. Load references of Clusters into SOLR (core1)  cluster id, top terms
along with full cluster data into Mongo (a cluster is a doc)
5. Then load source text into SOLR(core2) using same custom analyzer with
appropriate boost along with the reference cluster id
NOTE: in all cases, the id of the source text is preserved throughout the
flow in the vector naming process, etc.

So now I have a mysql table,  two SOLR cores, and a Mongo Document
Collection (all tied together with text id as the common name)

- Now when  a new document enters the system after "batch" has been
performed, I use core2 to test the top  SOLR matches (custom analyzer
normalizes the new doc) to find best cluster within a tolerance.  If a
cluster is found, then I place the text in that cluster  if not, then I
start a new group (my word for a cluster not generated via kmeans).  Either
way, the doc makes its way into both (core1 and core2). I keep track of the
number of group creations/document placements so that if a threshold is
crossed, then I can re-batch the data.

In MiA, (I think ch 11), suggests that a user could run the canopy cluster
routine to assign new entries to the clusters (instead of what I am doing).
Does he mean to regenerate a new dictionary, frequencies, etc for the corpus
for every inbound document?  My observations have been that this has been a
very speedy process, but I'm hoping that I'm just too much of a novice and
haven't thought of a way to simply update the dictionary/frequencies.  (this
process also calls for the eventual rebatching of the clusters).

While I was very early in my "implement what I have read" process, Suneel
and Ted recommended that I examine the Streaming Kmeans process.  Would that
process sidestep much of what I'm doing?

Q2: I need to really understand the lexicon of my corpus.  How do I see the
list of terms that have been omitted due either to being in too many
documents or are not in enough documents for consideration?

Please know that I know that I can look at the dictionary to see what terms
are covered.  And since my custom analyzer is using the
StandardAnalyzer.stop words, those are obvious also.  If there isn't an
option to emit the  omitted words, where would be the natural place to
capture that data and save it into yet another data store (Sequence
file,etc)?

Thanks in Advance for the Guidance,

SCott

streaming kmeans vs incremental canopy/solr/kmeans

Reply via email to