Re: Unstructured text classification

Ted Dunning Fri, 05 Sep 2014 13:36:45 -0700

Jonathan has it right here.  Don't forget to include some structural
information as well as the text content itself.  By structural information,
I mean things like the domain, how much boilerplate, possibly some sort of
classifier that runs on the boiler plate and so on.  You may also want to
include aggregate features like the number of different kinds of markup
elements, a the CSS tags that are used, the amount of javascript on the
page, domains linked to and more.

These additional information elements make classification of certain kinds
of text almost trivial.  Comment spam, for instance, will often have links
to particular domains.  A compromised web-page might have in-line
javascript that no other page normally has.

Numerical features will make use of Naive Bayes a bit problematic.  You
might be able to mitigate this by binning the features and converting the
numbers into deciles where each decile is a different symbol.

On Fri, Sep 5, 2014 at 10:24 AM, Jonathan Cooper-Ellis <[email protected]>
wrote:

> Hi Grzegorz,
>
> You can use the boilerpipe library to extract main content from your sites
> (Tika supports this) and pass that to a NB classifier and probably get
> pretty good results.
>
> Hope that helps!
>
> On Friday, September 5, 2014, Grzegorz Ewald <[email protected]>
> wrote:
>
> > Hi Mahout users!
> >
> > I'm starting to deal with unstructured text classification, namely
> > classification of web pages of unknown structure. The number of possible
> > categories would probably be quite small (as for now I believe that three
> > categories are enough).
> >
> > Later I would add another level of data processing based on document
> > structure (existence of meta tags and so on).
> >
> > Do you have any experience or suggestions? Somehow I don't feel like
> using
> > bag of words approach (but maybe i am wrong?).
> >
> > --
> > Regards,
> > Grzegorz
> >
> > <mailto:[email protected] <javascript:;>>
> >
>

Re: Unstructured text classification

Reply via email to