getting malformed URL exception
I am trying to index files local intranet using nutch 1.0, hence, i m giving path as file:hostname/shared/ as seed. Now when i use AdaptiveScheduler and crawl the intranet for the first time, it works fine but when i recrawl, it gives me malformedURL exception. But when i use the Default Scheduler it works well. Any idea ... what is going wrong.. -- Regards, Arpit Khurdiya
Re: nutch crawl issue
if u r using nigthly build, Did u changed d same in parse-plugin.xml?? uncomment this: mimeType name=text/html plugin id=parse-html / /mimeType hopefully this helps u On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius mgris...@comcast.net wrote: in nutch-site.xml I modified plugin.includes parse-(html) works parse-(tika) does not I need to also parse pdfs so I need both features, I tried parse-(html| tika) to see if html would be selected before tika and that did not work. On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: using Nutch nightly build nutch-2010-04-27_04-00-28: I am trying to bin/nutch crawl a single html file generated by javadoc and no links are followed. I verified this with bin/nutch readdb and bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base seed doc specified is processed. I searched and reviewed the nutch-user archive and tried several different settings but none of the settings appear to have any effect. I then downloaded maven-2.2.1 so that I could mvn install tika and produce tika-app-0.7.jar to command line extract information about the html javadoc file. I am not familiar w/ tika but the command line version doesn't return any metadata, e.g. no 'src=' links from the html 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how nutch uses tika and maybe it's not related . . . Has anyone crawled javadoc files or have any suggestions? Thanks. -m. -- Regards, Arpit Khurdiya
Re: why does nutch interpret directory as URL
I m also facing the same problem.. i thought of devlop a plugin that will return null when such URL is encountered and will return null. As a result that URl wont be indexed. But i was thinking what will be the criteria on the basis of which i ll discard the URl. I hope my approach is correct. On Thu, Apr 29, 2010 at 9:59 AM, xiao yang yangxiao9...@gmail.com wrote: Because it's a URL indeed. You can either filter this kind of URL by configuring crawl-urlfilter.txt (-^.*/$ may helps, but I'm not sure about the regular expression) or filter the search result (you need to develop a nutch plugin). Thanks! Xiao On Thu, Apr 29, 2010 at 4:33 AM, BK bk4...@gmail.com wrote: While indexing files on local file system, why does NUTCH interpret the directory as a URL - fetching file:/C:/temp/html/ This causes the index page of this directory to show up on search results. Any solutions for this issue?? Bharteesh Kulkarni -- Regards, Arpit Khurdiya
Issues in recrawling
HI, I m new to the world of nutch. I am trying to crawl local file systems on LAN using nutch 1.0. Documents are rarely modified and then search them using solr. And frequency of recrawling is 1 day as documents are frequently added and deleted. I have few queries regarding recrawling. 1. What is the major difference between bin/nutch crawl Command and the recrawling script given in wiki? is it just that the script merges the segments? I more curious on the performance issue. 2. Is there any way to inform Solr Index to delete a particular document as that resource do not exist any longer after recrawling? I dont want create a new SolrIndex every time i crawl, i want to update my index. 3. As documents are rarely modified i want them to be fetched only when they get modified. But, after interval.default is exceeded, the document is fetched without taking into consideration whether the document has been modified or not. Is there any way around of fetching only those documents that are newly added or those that have been modified? Thanks a lot.. -- Regards, Arpit Khurdiya