in nutch-site.xml I modified plugin.includes
parse-(html) works
parse-(tika) does not
I need to also parse pdfs so I need both features, I tried parse-(html|
tika) to see if html would be selected before tika and that did not
work.
On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
if u r using nigthly build, Did u changed d same in parse-plugin.xml??
uncomment this:
mimeType name=text/html
plugin id=parse-html /
/mimeType
hopefully this helps u
On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius
mgris...@comcast.net wrote:
in nutch-site.xml I modified
I m also facing the same problem..
i thought of devlop a plugin that will return null when such URL is
encountered and will return null. As a result that URl wont be
indexed.
But i was thinking what will be the criteria on the basis of which i
ll discard the URl.
I hope my approach is
Hello everyone,
I'm using Nutch v0.9 I'm able to crawl, fetch and parse html and .pdf. But
when I try with .ppt, .xls, .rtf and .doc I don't have any problem but when
I use SegmentReader to get the information of each url I don't find any
parsetext in these formats. I configured the plugins and
Hi Matthew,
There is an open issue with Tika (e.g.
https://issues.apache.org/jira/browse/TIKA-379) that could explain the
differences betwen parse-html and parse-tika. Note that you can specify :
*parse-(html|pdf) *in order to get both HTML and PDF files.
Could you please open an issue in JIRA
I've installed nutch 1.0 on eclipse (windows XP).I performed crawling (on
local filesystem,mostly html files present in a directory) and it worked
fine ,but when I ran the search program with a query,it always gives result
as Total hits 0.(no matter what the query is)
can anyone guess/knows what