Jonathan has it right here. Don't forget to include some structural information as well as the text content itself. By structural information, I mean things like the domain, how much boilerplate, possibly some sort of classifier that runs on the boiler plate and so on. You may also want to include aggregate features like the number of different kinds of markup elements, a the CSS tags that are used, the amount of javascript on the page, domains linked to and more.
These additional information elements make classification of certain kinds of text almost trivial. Comment spam, for instance, will often have links to particular domains. A compromised web-page might have in-line javascript that no other page normally has. Numerical features will make use of Naive Bayes a bit problematic. You might be able to mitigate this by binning the features and converting the numbers into deciles where each decile is a different symbol. On Fri, Sep 5, 2014 at 10:24 AM, Jonathan Cooper-Ellis <[email protected]> wrote: > Hi Grzegorz, > > You can use the boilerpipe library to extract main content from your sites > (Tika supports this) and pass that to a NB classifier and probably get > pretty good results. > > Hope that helps! > > On Friday, September 5, 2014, Grzegorz Ewald <[email protected]> > wrote: > > > Hi Mahout users! > > > > I'm starting to deal with unstructured text classification, namely > > classification of web pages of unknown structure. The number of possible > > categories would probably be quite small (as for now I believe that three > > categories are enough). > > > > Later I would add another level of data processing based on document > > structure (existence of meta tags and so on). > > > > Do you have any experience or suggestions? Somehow I don't feel like > using > > bag of words approach (but maybe i am wrong?). > > > > -- > > Regards, > > Grzegorz > > > > <mailto:[email protected] <javascript:;>> > > >
