getting malformed URL exception

2010-05-01 Thread arpit khurdiya
I am trying to index files local intranet using nutch 1.0, hence, i m
giving path as file:hostname/shared/ as seed.
Now when i use AdaptiveScheduler and crawl the intranet for the first
time, it works fine but when i recrawl, it gives me malformedURL
exception. But when i use the Default Scheduler it works well. Any
idea ... what is going wrong..

-- 
Regards,
Arpit Khurdiya


Re: nutch crawl issue

2010-04-29 Thread arpit khurdiya
 if u r using  nigthly build, Did u changed d same in parse-plugin.xml??
uncomment this:
 mimeType name=text/html
plugin id=parse-html /
/mimeType

hopefully this helps u

On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius
mgris...@comcast.net wrote:
 in nutch-site.xml I modified plugin.includes

 parse-(html) works
 parse-(tika) does not

 I need to also parse pdfs so I need both features, I tried parse-(html|
 tika) to see if html would be selected before tika and that did not
 work.

 On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
 using Nutch nightly build nutch-2010-04-27_04-00-28:

 I am trying to bin/nutch crawl a single html file generated by javadoc
 and no links are followed. I verified this with bin/nutch readdb and
 bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
 seed doc specified is processed.

 I searched and reviewed the nutch-user archive and tried several
 different settings but none of the settings appear to have any effect.

 I then downloaded maven-2.2.1 so that I could mvn install tika and
 produce tika-app-0.7.jar to command line extract information about the
 html javadoc file. I am not familiar w/ tika but the command line
 version doesn't return any metadata, e.g. no 'src=' links from the html
 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
 nutch uses tika and maybe it's not related . . .

 Has anyone crawled javadoc files or have any suggestions? Thanks.

 -m.






-- 
Regards,
Arpit Khurdiya


Re: why does nutch interpret directory as URL

2010-04-29 Thread arpit khurdiya
I m also facing the same problem..

i thought of devlop a plugin  that will return null when such  URL is
encountered and will return null. As a result that URl wont be
indexed.

But i was thinking what will be the criteria on the basis of which i
ll discard the URl.

I hope my approach is correct.

On Thu, Apr 29, 2010 at 9:59 AM, xiao yang yangxiao9...@gmail.com wrote:
 Because it's a URL indeed.
 You can either filter this kind of URL by configuring
 crawl-urlfilter.txt (-^.*/$ may helps, but I'm not sure about the
 regular expression) or filter the search result (you need to develop a
 nutch plugin).
 Thanks!

 Xiao

 On Thu, Apr 29, 2010 at 4:33 AM, BK bk4...@gmail.com wrote:
 While indexing files on local file system, why does NUTCH interpret the
 directory as a URL - fetching file:/C:/temp/html/
 This causes the index page of this directory to show up on search results.
 Any solutions for this issue??


 Bharteesh Kulkarni





-- 
Regards,
Arpit Khurdiya


Issues in recrawling

2010-04-27 Thread arpit khurdiya
HI,

I m new to the world of nutch. I am trying to crawl  local file
systems on LAN using nutch 1.0. Documents are rarely modified and then
search them using solr. And frequency of recrawling is 1 day as
documents are frequently added and deleted. I have few queries
regarding recrawling.

1. What is the major difference between bin/nutch crawl Command and
the recrawling script given in wiki? is it just that the script merges
the segments? I more curious on the performance issue.

2. Is there any way to inform Solr Index to delete a particular
document as that resource do not exist any longer after recrawling? I
dont want create a new SolrIndex every time i crawl, i want to update
my index.

3. As documents are rarely modified i want them to be fetched only
when they get modified. But, after interval.default is exceeded, the
document is fetched without taking into consideration whether the
document has been modified or not. Is there any way around of fetching
only those documents that are newly added or those that have been
modified?

Thanks a lot..

--
Regards,
Arpit Khurdiya