Re: Regex filter - sanity check

2010-11-16 Thread Sebastian Nagel
Hi Eric, So, the last part with prefix url is that I can add: http://Something.mydomain.com http://mydomainc.com/seasonal/law_school_sucks And that file will be tell nutch to follow those url's as prefixes? Is that correct? The plugin urlfilter-prefix will filter out all URLs which do

Re: how to crawl a page but not index it

2010-11-16 Thread ytthet
Hi All, I have similar requirements as Beats. I need to crawl certain page to extract URLs, but not to index the page. For example, blog home page contains snap shot of last page and links to them. In that case, I need to extract only links and not to index the page. I cannot do as Jake

Re: how to crawl a page but not index it

2010-11-16 Thread Andrzej Bialecki
On 2010-11-16 12:13, ytthet wrote: Hi All, I have similar requirements as Beats. I need to crawl certain page to extract URLs, but not to index the page. For example, blog home page contains snap shot of last page and links to them. In that case, I need to extract only links and not

Re: Skipping URL's with keyword

2010-11-16 Thread Dennis Kubes
For a requirement like that you probably need to do one of two things: 1) Write a very long regex url with something like (word1|word2|word3...). Wouldn't be my first choice but should work. 2) Write you own URL filter plugin that can filter by word. Check out the domain url filter as an

Re: how to cut indexes for distributed searching?

2010-11-16 Thread Dennis Kubes
There are two schools of thought on distributed tf/idf. The lazy way and exact way. 1) Lazy way says that if you have consistent number of docs in each shard (index) then your tf/idf should be approximate even though the scoring only pulls from each index individually during calculation. 2) Exact

Fetch error during crawling

2010-11-16 Thread matinte
Hi, I am trying to crawl with a seed url given but I'm having the next error: ... fetch of url failed with: java.io.EOFException -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: done Do you have any idea? Thanks

Re: Fetch error during crawling

2010-11-16 Thread Ye T Thet
Matinte, I have encountered that before. In my experience, it is caused by url. The url you are trying to crawl does not exists or server is not responding. Warm Regards, YT Thet On Wed, Nov 17, 2010 at 12:44 AM, matinte miguel.ti...@gmail.com wrote: Hi, I am trying to crawl with a seed

Re: Fetch error during crawling

2010-11-16 Thread Markus Jelsma
That should generate an IOException if i'm not mistaken. On Tuesday 16 November 2010 18:16:45 Ye T Thet wrote: Matinte, I have encountered that before. In my experience, it is caused by url. The url you are trying to crawl does not exists or server is not responding. Warm Regards,

Re: Fetch error during crawling

2010-11-16 Thread matinte
The url does exist but for example, when I try curl url it returns: curl: (56) Failure when receiving data from the peer It could be a problem of the server? 2010/11/16 Markus Jelsma-2 [via Lucene] ml-node+1912044-590307235-224...@n3.nabble.comml-node%2b1912044-590307235-224...@n3.nabble.com

Re: Fetch error during crawling

2010-11-16 Thread Ye T Thet
Thanks Markus for correction. It might be correct. However, in my case, the server was taking very long to respond (to serve the page). And I receive similar to following when fetching the document. 2010-08-28 01:24:53,212 INFO fetcher.Fetcher - fetch of