Hi Eric,
So, the last part with prefix url is that I can add:
http://Something.mydomain.com
http://mydomainc.com/seasonal/law_school_sucks
And that file will be tell nutch to follow those url's as prefixes? Is that
correct?
The plugin urlfilter-prefix will filter out all URLs which do
Hi All,
I have similar requirements as Beats.
I need to crawl certain page to extract URLs, but not to index the page.
For example, blog home page contains snap shot of last page and links to
them. In that case, I need to extract only links and not to index the page.
I cannot do as Jake
On 2010-11-16 12:13, ytthet wrote:
Hi All,
I have similar requirements as Beats.
I need to crawl certain page to extract URLs, but not to index the page.
For example, blog home page contains snap shot of last page and links to
them. In that case, I need to extract only links and not
For a requirement like that you probably need to do one of two things:
1) Write a very long regex url with something like
(word1|word2|word3...). Wouldn't be my first choice but should work.
2) Write you own URL filter plugin that can filter by word. Check out
the domain url filter as an
There are two schools of thought on distributed tf/idf. The lazy way and
exact way.
1) Lazy way says that if you have consistent number of docs in each
shard (index) then your tf/idf should be approximate even though the
scoring only pulls from each index individually during calculation.
2) Exact
Hi,
I am trying to crawl with a seed url given but I'm having the next error:
...
fetch of url failed with: java.io.EOFException
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
Do you have any idea?
Thanks
Matinte,
I have encountered that before.
In my experience, it is caused by url. The url you are trying to crawl
does not exists or server is not responding.
Warm Regards,
YT Thet
On Wed, Nov 17, 2010 at 12:44 AM, matinte miguel.ti...@gmail.com wrote:
Hi,
I am trying to crawl with a seed
That should generate an IOException if i'm not mistaken.
On Tuesday 16 November 2010 18:16:45 Ye T Thet wrote:
Matinte,
I have encountered that before.
In my experience, it is caused by url. The url you are trying to crawl
does not exists or server is not responding.
Warm Regards,
The url does exist but for example, when I try curl url it returns:
curl: (56) Failure when receiving data from the peer
It could be a problem of the server?
2010/11/16 Markus Jelsma-2 [via Lucene]
ml-node+1912044-590307235-224...@n3.nabble.comml-node%2b1912044-590307235-224...@n3.nabble.com
Thanks Markus for correction.
It might be correct.
However, in my case, the server was taking very long to respond (to serve
the page). And I receive similar to following when fetching the document.
2010-08-28 01:24:53,212 INFO fetcher.Fetcher - fetch of
10 matches
Mail list logo