Is this an appropriate place to ask what hardware and OS people are running?
If not, sorry for the spam. :)
Right now I am experimenting with three Intel Atom 330 based computers
running Fedora Core.
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
//
I just decided to start everything over with the latest version of nutch
from the trunk. So far I am able to crawl and index ok, but I am having
trouble getting results back from a search.
I get the typical 0 results found when the searchers/indexes cannot be
found, but I don't know where to look
This also prevents things like over indexing generated calendars where the
next day/month/year link will always produce output no matter how far it
goes.
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Wed,
Can I blow away crawldb, then inject a new set of URLs, without having to
rebuild the indexes?
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
In nutch-default.xml I have the following
property
namedb.fetch.retry.max/name
value3/value
descriptionThe maximum number of times a url that has encountered
recoverable errors is generated for fetch./description
/property
Yet after letting things run for some time, if I look at the
After letting my setup run for a while, I have quite the queue of unfetched
URLs. On the order of 10:1 of fetched vs unfetched.
Is there a way to trim the lowest scoring unfetched URLs from nutch?
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
//
You should be able to do this using one of the variations of *-urlfilter.txt
files. Instead of using + in front of the regex, you can tell it to
exclude lines that match the regex with a -.
Just a guess, I haven't actually tried it, but you could probably use
something like the following. (I'm
I'm seeing a lot of duplicates where a single site is getting recognized as
two different sites. Specifically I am seeing www.domain.com and
domain.combeing recognized as two different sites.
I imagine there is a setting to prevent this. If so, what is the setting, if
not, what would you recomend
Bialecki a...@getopt.org wrote:
On 2009-12-10 19:59, Jesse Hires wrote:
I'm seeing a lot of duplicates where a single site is getting recognized
as
two different sites. Specifically I am seeing www.domain.com and
domain.combeing recognized as two different sites.
I imagine there is a setting
Check in your tomcat logs to make sure it is finding things correctly (tail
-f on it while doing a search).
Also make sure the location of the index and segments are where the conf
files say they are.
Did you start bin/nutch server portnumber where portnumber is the port
you specified in the conf
use the -topN flag to only grab a small number of URLs.
Also I believe there is also a setting you can put in nutch-site.xml that
can be used to slow down how many URLs you grab over time.
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to
Thanks! Fixing how I was merging the indexes took care of the warning.
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Tue, Dec 1, 2009 at 4:49 AM, Andrzej Bialecki a...@getopt.org wrote:
Jesse Hires wrote
I am getting warnings in hadoop.log that segments.gen and segments_2 are not
directories, and as you can see by the listing, they are in fact files not
directories. I'm not sure what stage of the process this is happening in, as
I just now stumbled on them, but it concerns me that it says it is
at 8:57 AM, Andrzej Bialecki a...@getopt.org wrote:
Jesse Hires wrote:
I am getting warnings in hadoop.log that segments.gen and segments_2 are
not
directories, and as you can see by the listing, they are in fact files not
directories. I'm not sure what stage of the process this is happening
/index2/segments.gen not a directory)
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Mon, Nov 30, 2009 at 9:30 AM, Jesse Hires jhi...@gmail.com wrote:
actually searcher.dir is still the default crawl
Does bin/nutch merge only create a whole new index out of several smaller
indexes, or can it be used to incrementally update a single large index with
newly fetched and indexed smaller segments?
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
//
I seem to be running into a roadblock with the resources I have available.
The time it takes to split a segment into two segments using -slice goes off
the hook when there are over 500k unfected urls.
I've been running generate/fetch for -topN 4000 and it has been
incrementally increasing in time
My apologies. missed a patch option :-P
Must need more coffee.
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Tue, Nov 3, 2009 at 8:08 PM, Jesse Hires jhi...@gmail.com wrote:
Julien,
I tried to apply your
I have a two datanode and one namenode setup. One of my datanodes is slower
than the other, causing the fetch to run significantly longer on it. Is
there a way to balance this out?
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be
Thanks, I'll give that a shot!
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Thu, Oct 29, 2009 at 5:53 AM, Andrzej Bialecki a...@getopt.org wrote:
Jesse Hires wrote:
I have a two datanode and one
to be random
} // xkcd.com
On Sat, Oct 17, 2009 at 11:49 AM, Andrzej Bialecki a...@getopt.org wrote:
Jesse Hires wrote:
Does anyone have any insight into the following error I am seeing in the
hadoop logs? Is this something I should be concerned with, or is it
expected
that this shows up
Does anyone have any insight into the following error I am seeing in the
hadoop logs? Is this something I should be concerned with, or is it expected
that this shows up in the logs from time to time? If it is not expected,
where can I look for more information on what is going on?
2009-10-16
} // xkcd.com
On Wed, Sep 23, 2009 at 5:48 AM, Jesse Hires jhi...@gmail.com wrote:
Exactly! sorry for being so confusing in my original question.
Jesse
int GetRandomNumber()
{
return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com
On Wed, Sep
in nutch-site xml to point to the
search-servers.txt
file, where you entered the hosts and ports of your search servers
(detailed
description:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg12730.html).
Kind regards,
Martina
-Ursprüngliche Nachricht-
Von: Jesse
24 matches
Mail list logo