Scott,

How much data do you have?

How much do you plan to have?



On Fri, Feb 14, 2014 at 8:04 AM, Scott C. Cote <[email protected]> wrote:

> Hello All,
>
> I have two questions (Q1, Q2).
>
> Q1: Am digging in to Text Analysis and am wrestling with competing analyzed
> data maintenance strategies.
>
> NOTE: my text comes from a very narrowly focused source.
>
> - Am currently crunching the data (batch) using the following scheme:
> 1. Load source text as rows in a mysql database.
> 2. Create named TFIDF vectors using a custom analyzer from source text
> (-stopwords, lowercase, std filter, Š.)
> 3. Perform Canopy Cluster and then Kmeans Cluster using an enhanced cosine
> metric (derived from a custom metric found in MiA)
> 4. Load references of Clusters into SOLR (core1) ­ cluster id, top terms
> along with full cluster data into Mongo (a cluster is a doc)
> 5. Then load source text into SOLR(core2) using same custom analyzer with
> appropriate boost along with the reference cluster id
> NOTE: in all cases, the id of the source text is preserved throughout the
> flow in the vector naming process, etc.
>
> So now I have a mysql table,  two SOLR cores, and a Mongo Document
> Collection (all tied together with text id as the common name)
>
> - Now when  a new document enters the system after "batch" has been
> performed, I use core2 to test the top  SOLR matches (custom analyzer
> normalizes the new doc) to find best cluster within a tolerance.  If a
> cluster is found, then I place the text in that cluster ­ if not, then I
> start a new group (my word for a cluster not generated via kmeans).  Either
> way, the doc makes its way into both (core1 and core2). I keep track of the
> number of group creations/document placements so that if a threshold is
> crossed, then I can re-batch the data.
>
> In MiA, (I think ch 11), suggests that a user could run the canopy cluster
> routine to assign new entries to the clusters (instead of what I am doing).
> Does he mean to regenerate a new dictionary, frequencies, etc for the
> corpus
> for every inbound document?  My observations have been that this has been a
> very speedy process, but I'm hoping that I'm just too much of a novice and
> haven't thought of a way to simply update the dictionary/frequencies.
>  (this
> process also calls for the eventual rebatching of the clusters).
>
> While I was very early in my "implement what I have read" process, Suneel
> and Ted recommended that I examine the Streaming Kmeans process.  Would
> that
> process sidestep much of what I'm doing?
>
> Q2: I need to really understand the lexicon of my corpus.  How do I see the
> list of terms that have been omitted due either to being in too many
> documents or are not in enough documents for consideration?
>
> Please know that I know that I can look at the dictionary to see what terms
> are covered.  And since my custom analyzer is using the
> StandardAnalyzer.stop words, those are obvious also.  If there isn't an
> option to emit the  omitted words, where would be the natural place to
> capture that data and save it into yet another data store (Sequence
> file,etc)?
>
> Thanks in Advance for the Guidance,
>
> SCott
>
>
>

Reply via email to