Re: nutch crawl issue

2010-04-29 Thread matthew a. grisius
in nutch-site.xml I modified plugin.includes parse-(html) works parse-(tika) does not I need to also parse pdfs so I need both features, I tried parse-(html| tika) to see if html would be selected before tika and that did not work. On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:

Re: nutch crawl issue

2010-04-29 Thread arpit khurdiya
if u r using nigthly build, Did u changed d same in parse-plugin.xml?? uncomment this: mimeType name=text/html plugin id=parse-html / /mimeType hopefully this helps u On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius mgris...@comcast.net wrote: in nutch-site.xml I modified

Re: why does nutch interpret directory as URL

2010-04-29 Thread arpit khurdiya
I m also facing the same problem.. i thought of devlop a plugin that will return null when such URL is encountered and will return null. As a result that URl wont be indexed. But i was thinking what will be the criteria on the basis of which i ll discard the URl. I hope my approach is

Parsing .ppt, .xls, .rtf and .doc

2010-04-29 Thread nachonieto3
Hello everyone, I'm using Nutch v0.9 I'm able to crawl, fetch and parse html and .pdf. But when I try with .ppt, .xls, .rtf and .doc I don't have any problem but when I use SegmentReader to get the information of each url I don't find any parsetext in these formats. I configured the plugins and

Re: nutch crawl issue

2010-04-29 Thread Julien Nioche
Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files. Could you please open an issue in JIRA

Re:Search problem in nutch on eclipse (win XP)

2010-04-29 Thread Harish Kumar
I've installed nutch 1.0 on eclipse (windows XP).I performed crawling (on local filesystem,mostly html files present in a directory) and it worked fine ,but when I ran the search program with a query,it always gives result as Total hits 0.(no matter what the query is) can anyone guess/knows what