You may want to consider using memcached -
http://www.danga.com/memcached/ - it's super simple and super
stable. I use it over at Simpy.com and the memcached daemon there
has been up for months without showing any signs of trouble.
We've had good luck with ehcache
On 9/7/06, David Wallace [EMAIL PROTECTED] wrote:
Just guessing, but could this be caused by session ids in the URL? Or
some other unimportant piece of data? If this is the case, then every
page would be added to the index when it's crawled, regardless of
whether it's already in there, with a
I want to customize the crawling process by modifying the way pages are
stored. As far as I know, Nutch will stored web pages in binary file, page
by page. After a link analysis step, Nutch will crawl to the destination
page and download it. When pages are stored, I want to write only link to a
Tomi NA wrote:
On 9/7/06, David Wallace [EMAIL PROTECTED] wrote:
Just guessing, but could this be caused by session ids in the URL? Or
some other unimportant piece of data? If this is the case, then every
page would be added to the index when it's crawled, regardless of
whether it's already
(moved to nutch-user)
Tomi NA wrote:
On 9/7/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Tomi NA wrote:
On 9/7/06, Nick Burch [EMAIL PROTECTED] wrote:
On Thu, 7 Sep 2006, Tomi NA wrote:
On 9/7/06, Venkateshprasanna [EMAIL PROTECTED] wrote:
Is there any filter available for extracting
Thanks for your reply.
I have found that the method you mentioned looks into the http header from
web server. It looks for charset and does the mapping. The apache web
server which contains the document has already configured:
AddDefaultCharset Big5-HKSCS
The crawl engine does treat the
Hi,
I've been trying to get the nutch fetcher to work since a couple of
days, but it always hangs on one of the reduce processes, and the job is
aborted. I am using numFetchers=24 during generate, 24 map tasks and 6
reduce tasks during fetch on a 3 machine cluster. The task that failed
was
Hi,
I would like to know, how could I get all the subcollections and how many
documents belong to each subcollection after making a query.
The approach I took was to iterate over the results, getting details for
each one. The problem is that every query I make is limited by numHits [
On 9/8/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
(moved to nutch-user)
Tomi NA wrote:
On 9/7/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Tomi NA wrote:
On 9/7/06, Nick Burch [EMAIL PROTECTED] wrote:
On Thu, 7 Sep 2006, Tomi NA wrote:
On 9/7/06, Venkateshprasanna [EMAIL PROTECTED]
On 9/8/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Tomi NA wrote:
On 9/7/06, David Wallace [EMAIL PROTECTED] wrote:
Just guessing, but could this be caused by session ids in the URL? Or
some other unimportant piece of data? If this is the case, then every
page would be added to the index
I fetch tow site, and index seperately. How could I merge them ?
Thank YOU!
You would need to modify Fetcher line 433 to use a a text output format
like this:
job.setOutputFormat(TextOutputFormat.class);
and you would need to modify Fetcher line 307 only collect the
information you are looking for, maybe something link this:
Outlink[] links =
Thanks.
I put the following in nutch-site.xml:
property
namedb.ignore.external.links/name
valuetrue/value
descriptionIf true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without
How many urls are you fetching and does each machine have the same
settings as below?
Remember that number of fetchers is number of fetcher threads per task
per machine. So you would be running 2 tasks per machine * 12 threads *
3 machines = 75 fetchers.
Dennis
Vishal Shah wrote:
Hi,
You may be running into problems with regex stalls on filtering. Try
removing the regex filter from the nutch-site.xml plugin.includes
property. I was having similar problems before switching to just use
prefix and suffix filters as below. I attached my prefix and suffix url
filter files
Dear Nutch User List,
I am desperately trying to index an Intranet with the following
characteristics
1) Some sites require no authentication - these already work great!
2) Some sites require basic HTTP Authentication.
3) Some sites require NTLM Authentication.
4) No sites require both HTTP and
Dear Nutch Users,
Does anyone have experience indexing the contents of windows files shares?
There's information on the Wiki about indexing the local disk, but nothing
about remote shares.
Also, does Nutch traverse directories on its own, or does it require
document links? Thanks in advance.
Assuming you have two separate war files deployed, it should be as easy
as setting the searcher.dir property in the nutch-site.xml file in the
different web-inf directories to the separate index locations. If you
want to go the distributed searching route there is a in depth
explanation on
18 matches
Mail list logo