Guys, Assuming that you have a training dataset Machine Learning would be a good way of classifying a document. Apache Mahout could be used or an API like https://github.com/DigitalPebble/TextClassification would work as well. We used it in Nutch for some projects and it worked fine (but I am biased as it is ours). GATE is a useful resource a well, probably a bit of an overkill for a task like this unless you want to use it to generate more intelligent features.
HTH Julien On 29 June 2012 13:18, Lewis John Mcgibbney <[email protected]>wrote: > Hi Jim, > > The thing about this problem is that I assume religion information > would not be included in the document metadata therefore it's not a > simple case of using one of the existing implementations e.g. > parse-metatags to grab this data... > > I think it would be something more a long the lines of text processing > post (or @runtime) fetching. Documents could then be classified > accordingly. I recently spoke with someone who undertook such an > exercise but not using Nutch I must admit. If you are familar with > GATE [0] you could create some kind of plugin to identify this kind of > information but I am not familiar with the process of retaining it for > indexing as I have not thoroughly tried the concept. > > hth > > Lewis > > [0] http://gate.ac.uk/ > > On Fri, Jun 29, 2012 at 12:36 PM, Jim Chandler <[email protected]> > wrote: > > Lewis, > > > > I work with George. What we are trying to do is identify whether or not > a > > document is religious in nature or not. And if so what that religion is. > > We are aware this could be a difficult undertaking, and we would like > not > > to reinvent the wheel. > > > > HTH > > Jim > > > > On Thu, Jun 28, 2012 at 5:16 PM, Lewis John Mcgibbney < > > [email protected]> wrote: > > > >> Hi George, > >> > >> Where are each of these fields present within the document? > >> > >> Lewis > >> > >> > On Wed, Jun 27, 2012 at 7:59 PM, JAB <[email protected]> > >> wrote: > >> >> I've written some simple Nutch plug-ins to detect a document's > Author, > >> >> Publication Date, and if its an article about Religion (including > what > >> >> religion its talking about). I was wondering if anyone knows of any > open > >> >> source plug-ins any group has written to cover these plug-in issues, > >> rather > >> >> than me relying on my own custom solutions. I'm new to Nutch/Gate > >> >> development. > >> >> > >> >> -- > >> >> View this message in context: > >> > http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662.html > >> >> Sent from the Nutch - Dev mailing list archive at Nabble.com. > >> > > >> > > >> > > >> > -- > >> > Lewis > >> > >> > >> > >> -- > >> Lewis > >> > > > > -- > Lewis > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

