Hi Grzegorz,

You can use the boilerpipe library to extract main content from your sites
(Tika supports this) and pass that to a NB classifier and probably get
pretty good results.

Hope that helps!

On Friday, September 5, 2014, Grzegorz Ewald <[email protected]>
wrote:

> Hi Mahout users!
>
> I'm starting to deal with unstructured text classification, namely
> classification of web pages of unknown structure. The number of possible
> categories would probably be quite small (as for now I believe that three
> categories are enough).
>
> Later I would add another level of data processing based on document
> structure (existence of meta tags and so on).
>
> Do you have any experience or suggestions? Somehow I don't feel like using
> bag of words approach (but maybe i am wrong?).
>
> --
> Regards,
> Grzegorz
>
> <mailto:[email protected] <javascript:;>>
>

Reply via email to