Right, I was able to avoid the cardinality exception. This was the approach:
Batch: 1. Had articles stored as XML, each XML with 500 articles. Used some custom code to convert those XML to SequenceFiles 2. Created vectors from these sequence files using cmd line "SparseVectorsFromSequenceFiles". This created a dictionary file as well 3. Clustered and classified articles using KMeansDriver Incremental: 1. Converted new articles to seqfiles with custom code 2. Generated vectors using SparseVectorsFromSequenceFiles. A new smaller dictionary was created as well. 3. Create a Map Reduce Job to rewrite those vectors: the cardinality was manipulated to match the cardinality of the clusters centroids generated in the batch phase; the indexes of the vectors were converted to match dictionary created in the batch phase, since they were using indexes from the new dictionary 3. Used a modified version of the method 'emitPointToNearestCluster' from the KMeansClusterer class to find the cluster for the new articles The new articles were successfully classified against the clusters, but there are some big issues with this approach: - Number of clusters remain fixed, because of K-means. So, if new articles comes about new subject they'll be forced to fit into existing clusters. One options is to use a threshold, and if the new articles distance is greater than this threshold, create new clusters on the fly. That would involve having to deal with two sets of clusters: from the batch phase and from the various incremental calculations - Dictionary processing: every incremental vectorization outputs a dictionary that can have new terms that the full dictionary doesn't. The vector rewriting approach must deal with this difference, which implies maintaining an incremental dictionary for the new terms. Also the fact that new terms are introduced that weren't present during initial cluster calculation for sure won't help to get a high quality incremental clustering Gustavo On Wed, Nov 24, 2010 at 5:28 PM, Jeff Eastman <[email protected]> wrote: > It likely means that your cluster's cardinality is different from your > input vector's cardinality. If your input vectors are term vectors computed > from Lucene, then this could occur if a new term is introduced, increasing > the size of the input vector. I can also see some problems if you are using > seq2sparse for just the new vector, as that builds a new term dictionary. > Also, TF-IDF wants to analyze the term frequencies over the entire corpus > which won't work incrementally. > > I think you can fool the clustering by setting the sizes of your input > vectors to be max_int but that won't help you with the other issues above. > Our text processing algorithms will take some adjustments to handle this > preprocessing correctly. > > -----Original Message----- > From: Edoardo Tosca [mailto:[email protected]] > Sent: Wednesday, November 24, 2010 9:16 AM > To: [email protected] > Subject: Re: (Near) Realtime clustering > > Thank you, > I am trying adding new documents but I'm stuck with an exception. > Basically I copied some code from KMeansDriver, and I execute the > clusterDataSeq method. > I have seen that the clusterDataSeq accepts a clusterIn Path parameter that > should be the path that contains already generated clusters. > Am I right? > > When it try to emitPointToNearestCluster and in particular it calculate the > distance a CardinalityException is thrown: > what does it mean? > > BTW I'm creating the vector getting documents from a Lucene index. > > On Wed, Nov 24, 2010 at 5:00 PM, Jeff Eastman <[email protected]> wrote: > > > Note that the clustering drivers all have a static clusterData() method > to > > run just the clustering (classification) of points. You would have to > call > > this from your own driver as the current CLI does not offer just this > > option, but something like this should work: > > > > - Input documents are vectorized into sequence files which have > timestamps > > so you know when to delete documents which have aged > > - Run full clustering over all remaining documents to produce clusters-n > > and clusteredPoints. This is the batch job over the entire corpus. > > - As new documents are received, use the clusterData() method to classify > > them using the previous clusters-n. This can be run using -xm sequential > so > > it is all done in memory. > > - Periodically, add all the new documents to the corpus, delete any which > > have aged out of your time window, and start over > > > > > > > > -----Original Message----- > > From: Divya [mailto:[email protected]] > > Sent: Tuesday, November 23, 2010 6:32 PM > > To: [email protected] > > Subject: RE: (Near) Realtime clustering > > > > Hi, > > > > Even I also have similar requirement. > > Can some one please provide me the steps of hybrid approach. > > > > > > Regards, > > Divya > > > > -----Original Message----- > > From: Jeff Eastman [mailto:[email protected]] > > Sent: Wednesday, November 24, 2010 2:19 AM > > To: [email protected] > > Subject: RE: (Near) Realtime clustering > > > > I'd suggest a hybrid approach: Run the batch clustering periodically over > > the entire corpus to update the cluster centers and then use those > centers > > for real-time clustering (classification) of new documents as they > arrive. > > You can use the sequential execution mode of the clustering job to > classify > > documents in real-time. This will suffer from the fact that new news > topics > > will not immediately materialize new clusters until the batch job runs > > again. > > > > -----Original Message----- > > From: Gustavo Fernandes [mailto:[email protected]] > > Sent: Tuesday, November 23, 2010 9:58 AM > > To: [email protected] > > Subject: (Near) Realtime clustering > > > > Hello, we have a mission to implement a system to cluster news articles > in > > near real time mode. We have a large amount of articles (millions), and > we > > started using k-means to created clusters based on a fixed value of "k". > > The > > problem is that we have a constant incoming flow of news articles and we > > can't afford to rely on a batch process, we need to be able to present > > users > > clustered articles as soon as they arrive in our database. So far our > > clusters are saved into a SequenceFile, as normally output by k-means > > driver. > > What would be the recommended way of approaching this problem with > Mahout? > > Is it possible to manipulate the generated clusters and incrementally add > > new articles to them, or even forming new clusters without incurring the > > penalty of recalculating for every vector again? Is starting with k-means > > the right way? What would be the right combination of algorithms to > provide > > incremental and fast clustering calculation? > > > > TIA, > > Gustavo > > > > >
