Hello, Previously I have build a primitive crawler in java, extracting certain information per html page using xpaths. then I discovered nutch, and now I want to be able to extract certain elements in dom, tru xpath, multiple xpaths per site.
I am crawling a number of web sites, lets say 16, and I would like to be able to write multiple xpaths per site, and then index the output of those extractions in solr, as a different field. I have googled for a while, and I understand certain plugin can be developed that will act as a custom html parser. I understand that another path is using tika. I also have experimented with boilerpiple library, and It was insufficient to extract the data I want. (I am extracting specificiations of certain products, usually in tables, and fragmented) One diffuculty with my htmlcleaner based xpath evaluator was that the real world htmls sometime were broken, and even when I cleaned them html cleaner will not find xpaths taken from firebug. Which way should I start? Any ideas / help / recomendation greatly appreciated, Best Regards, C.B.

