custom extractor

Cam Bazz Wed, 06 Jul 2011 09:10:55 -0700

Hello,

Previously I have build a primitive crawler in java, extracting
certain information per html page using xpaths. then I discovered
nutch, and now I want to be able to extract certain elements in dom,
tru xpath, multiple xpaths per site.


I am crawling a number of web sites, lets say 16, and I would like to
be able to write multiple xpaths per site, and then index the output
of those extractions in solr, as a different field.

I have googled for a while, and I understand certain plugin can be
developed that will act as a custom html parser. I understand that
another path is using tika.

I also have experimented with boilerpiple library, and It was
insufficient to extract the data I want. (I am extracting
specificiations of certain products, usually in tables, and
fragmented)

One diffuculty with my htmlcleaner based xpath evaluator was that the
real world htmls sometime were broken, and even when I cleaned them
html cleaner will not find xpaths taken from firebug.

Which way should I start?

Any ideas / help / recomendation greatly appreciated,

Best Regards,
C.B.

custom extractor

Reply via email to