Hi Deejay, 2012/6/8 Deejay <[email protected]>
> Hi all, > > I recently discovered Apache UIMA, and it looks like a very large project! > I > was hoping that someone more experienced with it than I could comment on > whether there are parts of the project that could help with my problem. > > I need to go over many millions of objects (Protocol Buffers in HBase, as > it > happens), and cluster them according to their similarity. Once each > cluster is > formed, I need to 'collapse' each property of the objects to find the most > prevalent value. After this, the collapsed object will be added to a Solr > index. > I think you could take advantage of UIMA Collection Processing Engine [1], particularly by using a UIMA-AS based architecture since it looks like you are handling huge collections [2]. Apart from the specific algorithms used for clustering / collapsing, which would define the UIMA pipeline implementations/configurations, you could use SolrCas [3] to finally write data in the index. > > Would any part of Apache UIMA be useful for the clustering or collapsing, > or > have I misunderstood the nature of the project? > > HTH Tommaso [1] : http://uima.apache.org/d/uimaj-2.4.0/tutorials_and_users_guides.html#ugr.tug.cpe [2] : http://uima.apache.org/doc-uimaas-what.html [3] : http://uima.apache.org/sandbox.html#solrcas.consumer
