Hi Mahout users! I'm starting to deal with unstructured text classification, namely classification of web pages of unknown structure. The number of possible categories would probably be quite small (as for now I believe that three categories are enough).
Later I would add another level of data processing based on document structure (existence of meta tags and so on). Do you have any experience or suggestions? Somehow I don't feel like using bag of words approach (but maybe i am wrong?). -- Regards, Grzegorz <mailto:[email protected]>
