Hi Grzegorz, You can use the boilerpipe library to extract main content from your sites (Tika supports this) and pass that to a NB classifier and probably get pretty good results.
Hope that helps! On Friday, September 5, 2014, Grzegorz Ewald <[email protected]> wrote: > Hi Mahout users! > > I'm starting to deal with unstructured text classification, namely > classification of web pages of unknown structure. The number of possible > categories would probably be quite small (as for now I believe that three > categories are enough). > > Later I would add another level of data processing based on document > structure (existence of meta tags and so on). > > Do you have any experience or suggestions? Somehow I don't feel like using > bag of words approach (but maybe i am wrong?). > > -- > Regards, > Grzegorz > > <mailto:[email protected] <javascript:;>> >
