Guys,

Assuming that you have a training dataset Machine Learning would be a good
way of classifying a document. Apache Mahout could be used or an API like
https://github.com/DigitalPebble/TextClassification would work as well. We
used it in Nutch for some projects and it worked fine (but I am biased as
it is ours). GATE is a useful resource a well, probably a bit of an
overkill for a task like this unless you want to use it to generate more
intelligent features.

HTH

Julien

On 29 June 2012 13:18, Lewis John Mcgibbney <[email protected]>wrote:

> Hi Jim,
>
> The thing about this problem is that I assume religion information
> would not be included in the document metadata therefore it's not a
> simple case of using one of the existing implementations e.g.
> parse-metatags to grab this data...
>
> I think it would be something more a long the lines of text processing
> post (or @runtime) fetching. Documents could then be classified
> accordingly. I recently spoke with someone who undertook such an
> exercise but not using Nutch I must admit. If you are familar with
> GATE [0] you could create some kind of plugin to identify this kind of
> information but I am not familiar with the process of retaining it for
> indexing as I have not thoroughly tried the concept.
>
> hth
>
> Lewis
>
> [0] http://gate.ac.uk/
>
> On Fri, Jun 29, 2012 at 12:36 PM, Jim Chandler <[email protected]>
> wrote:
> > Lewis,
> >
> > I work with George.  What we are trying to do is identify whether or not
> a
> > document is religious in nature or not.  And if so what that religion is.
> >  We are aware this could be a difficult undertaking, and we would like
> not
> > to reinvent the wheel.
> >
> > HTH
> > Jim
> >
> > On Thu, Jun 28, 2012 at 5:16 PM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> >> Hi George,
> >>
> >> Where are each of these fields present within the document?
> >>
> >> Lewis
> >>
> >> > On Wed, Jun 27, 2012 at 7:59 PM, JAB <[email protected]>
> >> wrote:
> >> >> I've written some simple Nutch plug-ins to detect a document's
> Author,
> >> >> Publication Date, and if its an article about Religion (including
> what
> >> >> religion its talking about). I was wondering if anyone knows of any
> open
> >> >> source plug-ins any group has written to cover these plug-in issues,
> >> rather
> >> >> than me relying on my own custom solutions. I'm new to Nutch/Gate
> >> >> development.
> >> >>
> >> >> --
> >> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662.html
> >> >> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> >> >
> >> >
> >> >
> >> > --
> >> > Lewis
> >>
> >>
> >>
> >> --
> >> Lewis
> >>
>
>
>
> --
> Lewis
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to