Hi,
I downloaded SMB protocol plugin from following location:
http://issues.apache.org/jira/browse/NUTCH-427
I configured it (as mentioned it in read.txt) with Nutch. But when I tried to
crawl, nothing gets crawled and get following exception in hadoop log.
2009-12-21 16:25:04,728 FATAL
Hi,
Nutch is throwing errors while fetching large files (file with size more
then 100mb). I have a website with pages that point to large files (file
size varies from 10mb to 500mb) and there are several large files in that
website. I want to fetch all the files using Nutch, but nutch is
On 2009-12-21 17:15, Sundara Kaku wrote:
Hi,
Nutch is throwing errors while fetching large files (file with size more
then 100mb). I have a website with pages that point to large files (file
size varies from 10mb to 500mb) and there are several large files in that
website. I want to fetch
Hi,
I found db.ignore.external.links property.
How do I limit the crawl by also excluding links within the same domain as
well ?
Thanks
http://www.fileformat.info/info/unicode/char/2029/index.htm
i have experienced that this unicode character breaks JSON deserializing
when using SOLR and AJAX.
it comes from a pdf text.
where to filter out or replace this character? pdf parser/text
extractor? solr indexer?
regards
reinhard
You should be able to do this using one of the variations of *-urlfilter.txt
files. Instead of using + in front of the regex, you can tell it to
exclude lines that match the regex with a -.
Just a guess, I haven't actually tried it, but you could probably use
something like the following. (I'm
But how could we tell Nutch every time to do crawling in this way?
I do not want to edit *-filter.txt every time.
Thanks,
Jun
-Original Message-
From: Jesse Hires [mailto:jhi...@gmail.com]
Sent: 2009年12月22日 9:23
To: nutch-user@lucene.apache.org
Subject: Re: domain crawl using