My 2c: Start with getting all the relevant texts into one place, namely a search index. A good prototyping tool would be Solr. You will need something like ManifoldCF: http://incubator.apache.org/connectors/ for collecting documents from the various environments. Here is Erik Hatcher's "Rapid Prototyping With Solr": http://www.slideshare.net/erikhatcher/rapid-prototyping-with-solr-4312681 Once you get enough stuff into Solr, you will be able to search it easily. Next, you can start using Mahout: http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/ I would go for an iterative design, first taking a small sample of documents from each environment, trying the systems out, and then scaling. Good luck, Yuval
On Tue, Nov 15, 2011 at 9:12 AM, Burcu Buyukkagnici <[email protected]>wrote: > Hi, > I'm new to this community. I want to use mahout as a component of an > enterprise search project. The project is at conceptual phase. My business > need is to be able to find everything about a related task and reorganize > the output as a new view. The results should be actionable. Also the system > should be integrated with software development environment tools; > Subversion; JIRA and Redmine; Sharepoint Blogs; wikis and people ( active > directory) > Everything means, files, tools and people. Files are mostly text based > (word, pdf, source files);to search audio and video files are further > needs. > > Where does mahout; Lucene/solr and UIMA framework fit in the following > scenario? And what are the system requirements to setup a development > environment? > > X is a new project team member in a software development firm. Her project > is a 10 years-old maintainence project mainly; however customers want small > development requests on that platform. Her boss wants her to prepare a > software requirement specification document for a new request. Since she > hasn't prepared an SRS before; she wants to find previously prepared > documents, and asks her collegues to give her a sample. > Her friend gives her a sample based on a very ancient version of SRS from > her local computer. The company has Windows file server, a new content > management system (portal); also some projects use Subversion to store the > docs and also wikis. > > > 1. There should be a platform that can search files in all these > environments. > 2. The system should understand SRS is an outcome of software > requirements engineering or analysis process. The system should > understand > SRS, software requirements specification and functional design > descriptions > are similar terms. > 3. The company has manuals, templates and process definitions about > requirements engineering and has an SRS template which supersedes other > versions. While searching the system should list organizational docs and > then project docs related to SRSes. > 4. The project has different SRSes written through 10 years. So the > system should list that specific projectsSRS templates indicationg > version > conflicts between org. document templates and projects... > 5. Also the system should list the people who involve requirements > engineering process previously in that project first; then in other > projects. > 6. Also system should have a suggestion mechanism. The system should > know the domain of the project X is workin on and its sub parts. For ex, > X > is working on an e-commerce project. And the new request is about mobile > payments. In the same company but in a different project; a project team > is > working on e-wallet projects for a bank. Based on her profile, system > should be able to suggest people, tools and outcomes from the other > project > relating with payments domain. > > The domain identification and grouping the related docs, tools and people > in an existing system is nearly not possible manually. I want the system > can identify and cluster the related things itself and also learn and > improve the results by user feedback. Also, some people should give input > to the system by classifying the concepts for the system. Like for example; > I have organizational assets; document; tools; people. The documents are > project docs and organizational docs and they are related. This can be a > guidance for the system. > > I think carrot2 is doing sth very similar to what I say; but it has got > file limitation.Anyway, I need a roadmap to initiate a project like > this.Where should I start? > > Thanks, >
