[ http://issues.apache.org/jira/browse/NUTCH-162?page=all ]
KuroSaka TeruHiko updated NUTCH-162:
It seems many .html files are actually generated by ant target generate-docs
in build.xml, and only these four changes are needed to fix this bug:
mv
[
http://issues.apache.org/jira/browse/NUTCH-385?page=comments#action_12444162 ]
Ken Krugler commented on NUTCH-385:
---
There is a middle ground, though we don't know yet how important it is to
address.
When we crawl partner sites, we
[ http://issues.apache.org/jira/browse/NUTCH-185?page=all ]
Rida Benjelloun updated NUTCH-185:
--
Attachment: parse-xml.zip
Hi,
The plugin parse-xml has been updated. I have tested it with 0.8.1 version. The
plugin fix also the bug related the
[ http://issues.apache.org/jira/browse/NUTCH-185?page=all ]
Rida Benjelloun updated NUTCH-185:
--
Affects Version/s: 0.8.1
0.8
XMLParser is configurable xml parser plugin.
During fetching, OutlinkExtractor.getOutlinks() finds lots of junk, such as
the following:
rdf:about=
xmlns:pdf=
http://ns.adobe.com/pdf/1.3/
pdf:Producer
pdf:Producer
rdf:Description
rdf:Description
rdf:about=
xmlns:xap=
http://ns.adobe.com/xap/1.0/
xap:CreatorTool
xap:CreatorTool
xap:ModifyDate
[
http://issues.apache.org/jira/browse/NUTCH-185?page=comments#action_12444205 ]
nutch.newbie commented on NUTCH-185:
Thank you very much! I will be giving it a go now.
Will this plugin be added to the Nutch trunk as a part of distribution?