intranet crwl update

2006-02-14 Thread Poettgen
I will use nutch to search one (!) internet site (example: www.mysite.de). I am quit new to nutch and checked it out. In the tutorial I found the intranet crawl chapter. I think, that is what I need. I followed the example and all works fine and I can search my site. My questions: - How do I

Max pages in crawl cycle

2006-02-14 Thread Bostjan
Hi, I'm using nutch 0.7. Is it possible to crawl only certain number of pages in single crawl cycle (depth)? I looked at FetchList Tool class and I think it would be nice that emitFetchList method had a piece of code in its main loop that woud look something like this if (count

Re: intranet crwl update

2006-02-14 Thread Thomas Delnoij
I will try to answer your questions. If I am wrong, I am sure one of the more experienced developers can correct me ...:) - How do I update/refresh the index? There is no explanation or example about the intranet crawl! The main index (in crawldir/index) is updated by the CrawlTool after every

Re: index content within metatag only

2006-02-14 Thread Thomas Delnoij
I think the http://wiki.apache.org/nutch/WritingPluginExample tutorial shows how to implement the Filter - you would be filtering the 'content' metatag instead of the 'recommended'. Then it is up to you what other Filters you enable/disable. Also look at the

writing modified date in crawl datum

2006-02-14 Thread Raghavendra Prabhu
Hi Currently the trunk has provision to setModified time in the crawldatum This is not currently used . Will this be used in the future ? Rgds Prabhu

Re: writing modified date in crawl datum

2006-02-14 Thread Andrzej Bialecki
Raghavendra Prabhu wrote: Hi Currently the trunk has provision to setModified time in the crawldatum This is not currently used . Will this be used in the future ? Yes, it was added in anticipation of the adaptive fetch interval patches. However, now I think it should probably be stored

Re: writing modified date in crawl datum

2006-02-14 Thread Raghavendra Prabhu
Thanks Andrez One more thing which i wanted to know was fetching the last modified date is done as a part of content parsing function. Cant we split the whole thing into two funtions 1)The first function will getmodified date 2) The second function will get content Initiall the first function

Re: writing modified date in crawl datum

2006-02-14 Thread Andrzej Bialecki
Raghavendra Prabhu wrote: Thanks Andrez One more thing which i wanted to know was fetching the last modified date is done as a part of content parsing function. Cant we split the whole thing into two funtions 1)The first function will getmodified date 2) The second function will get content

Re: Error while indexing (mapred)

2006-02-14 Thread Florent Gluck
Chris, I bumpped the maximum number of open file descriptors to 32k, but still no luck: ... 060214 062901 reduce 9% 060214 062905 reduce 10% 060214 062908 reduce 11% 060214 062911 reduce 12% 060214 062914 reduce 11% 060214 062917 reduce 10% 060214 062918 reduce 9% 060214 062919 reduce

Re: Error while indexing (mapred)

2006-02-14 Thread Raghavendra Prabhu
Hi Florent Does the mapreduce go in a loop Can you let us know the environment Are you running on windows or linus If on windows ,you should use Cygwin Rgds Prabhu On 2/14/06, Florent Gluck [EMAIL PROTECTED] wrote: Chris, I bumpped the maximum number of open file descriptors to 32k,

Re: Error while indexing (mapred)

2006-02-14 Thread Raghavendra Prabhu
Please ignore my earlier message I think is due to some other reason Rgds Prabhu On 2/14/06, Raghavendra Prabhu [EMAIL PROTECTED] wrote: Hi Florent Does the mapreduce go in a loop Can you let us know the environment Are you running on windows or linus If on windows ,you should

offtopic - disecting google mini

2006-02-14 Thread Nutch Newbie
Hi: Google mini internals... check it out - http://www.anandtech.com/IT/showdoc.aspx?i=2523p=3 Pentium 3 and old dell memory? Regards

format for range date query

2006-02-14 Thread Raghavendra Prabhu
Hi What is the format in which we enter range date query Can anyone tell me how to form the query? Rgds Prabhu

RE: format for range date query

2006-02-14 Thread Vanderdray, Jacob
From the source of the query-more plugin: // query syntax is defined as date:mmdd-mmdd Jake. -Original Message- From: Raghavendra Prabhu [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 14, 2006 12:54 PM To: nutch-user@lucene.apache.org Subject: format for range date query Hi

RE: Nutch search engine can be used to search only on specific domain?

2006-02-14 Thread Vanderdray, Jacob
No problem. Check out the Intranet configuration section of the tutorial (http://lucene.apache.org/nutch/tutorial.html): Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl

Link to Search Interface for List

2006-02-14 Thread Vanderdray, Jacob
Would it be possible to add a link to http://www.mail-archive.com/nutch-user%40lucene.apache.org/ on http://lucene.apache.org/nutch/mailing_lists.html? I'd suggest it goes as another bullet point in under the Users mailing list section with the link text Search the List Archives.

Re: HTTPS Protocol Implementation

2006-02-14 Thread Andrzej Bialecki
Vanderdray, Jacob wrote: Is there an HTTPS protocol implementation for nutch? Yes, protocol-httpclient supports https. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web

Re: HTTPS Protocol Implementation

2006-02-14 Thread Ken Krugler
Is there an HTTPS protocol implementation for nutch? If you use protocol-httpclient (versus protocol-http) then it should support https. -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 Find Code, Find Answers