Hi everyone,

 

Would anyone find useful a parser for collecting outlinks from CSS
(stylesheets)?

 

As far as I can tell Tika doesn't offer this (it looks like Tika 1.12 parses
CSS as plain text, correct me if I'm wrong). Modern CSS often contains
"url(.)" links to content needed to properly style pages (e.g. fonts,
images). I have a simple, working, tested "parse-css" plugin that uses
http://cssparser.sourceforge.net/ and parses only outlinks, but if it's not
something that belongs in Nutch that's fine. Otherwise I'll happily open a
pull request.

 

Thanks,

Joe

Reply via email to