I will use nutch to search one (!) internet site (example: www.mysite.de).
I am quit new to nutch and checked it out. In the tutorial I found the
intranet crawl chapter.
I think, that is what I need. I followed the example and all works fine and
I can search my site.
My questions:
- How do I
Hi,
I'm using nutch 0.7.
Is it possible to crawl only certain number of pages in single crawl cycle
(depth)? I looked at FetchList Tool class and I think it would be nice that
emitFetchList method had a piece of code in its main loop that woud look
something like this
if (count
I will try to answer your questions. If I am wrong, I am sure one of the
more experienced developers can correct me ...:)
- How do I update/refresh the index? There is no explanation or example
about the intranet crawl!
The main index (in crawldir/index) is updated by the CrawlTool after every
I think the http://wiki.apache.org/nutch/WritingPluginExample tutorial shows
how to implement the Filter - you would be filtering the 'content' metatag
instead of the 'recommended'. Then it is up to you what other Filters you
enable/disable. Also look at the
Hi
Currently the trunk has provision to setModified time in the crawldatum
This is not currently used . Will this be used in the future ?
Rgds
Prabhu
Raghavendra Prabhu wrote:
Hi
Currently the trunk has provision to setModified time in the crawldatum
This is not currently used . Will this be used in the future ?
Yes, it was added in anticipation of the adaptive fetch interval
patches. However, now I think it should probably be stored
Thanks Andrez
One more thing which i wanted to know was fetching the last modified date is
done as a part of content parsing function.
Cant we split the whole thing into two funtions
1)The first function will getmodified date
2) The second function will get content
Initiall the first function
Raghavendra Prabhu wrote:
Thanks Andrez
One more thing which i wanted to know was fetching the last modified date is
done as a part of content parsing function.
Cant we split the whole thing into two funtions
1)The first function will getmodified date
2) The second function will get content
Chris,
I bumpped the maximum number of open file descriptors to 32k, but still
no luck:
...
060214 062901 reduce 9%
060214 062905 reduce 10%
060214 062908 reduce 11%
060214 062911 reduce 12%
060214 062914 reduce 11%
060214 062917 reduce 10%
060214 062918 reduce 9%
060214 062919 reduce
Hi Florent
Does the mapreduce go in a loop
Can you let us know the environment
Are you running on windows or linus
If on windows ,you should use Cygwin
Rgds
Prabhu
On 2/14/06, Florent Gluck [EMAIL PROTECTED] wrote:
Chris,
I bumpped the maximum number of open file descriptors to 32k,
Please ignore my earlier message
I think is due to some other reason
Rgds
Prabhu
On 2/14/06, Raghavendra Prabhu [EMAIL PROTECTED] wrote:
Hi Florent
Does the mapreduce go in a loop
Can you let us know the environment
Are you running on windows or linus
If on windows ,you should
Hi:
Google mini internals... check it out -
http://www.anandtech.com/IT/showdoc.aspx?i=2523p=3
Pentium 3 and old dell memory?
Regards
Hi
What is the format in which we enter range date query
Can anyone tell me how to form the query?
Rgds
Prabhu
From the source of the query-more plugin:
// query syntax is defined as date:mmdd-mmdd
Jake.
-Original Message-
From: Raghavendra Prabhu [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 14, 2006 12:54 PM
To: nutch-user@lucene.apache.org
Subject: format for range date query
Hi
No problem. Check out the Intranet configuration section of
the tutorial (http://lucene.apache.org/nutch/tutorial.html):
Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with
the name of the domain you wish to crawl. For example, if you wished to
limit the crawl
Would it be possible to add a link to
http://www.mail-archive.com/nutch-user%40lucene.apache.org/ on
http://lucene.apache.org/nutch/mailing_lists.html? I'd suggest it goes
as another bullet point in under the Users mailing list section with the
link text Search the List Archives.
Vanderdray, Jacob wrote:
Is there an HTTPS protocol implementation for nutch?
Yes, protocol-httpclient supports https.
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
Is there an HTTPS protocol implementation for nutch?
If you use protocol-httpclient (versus protocol-http) then it should
support https.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers
18 matches
Mail list logo