hardware questions?

2010-03-10 Thread Jesse Hires
Is this an appropriate place to ask what hardware and OS people are running? If not, sorry for the spam. :) Right now I am experimenting with three Intel Atom 330 based computers running Fedora Core. Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice //

help trouble shooting search problems.

2010-02-17 Thread Jesse Hires
I just decided to start everything over with the latest version of nutch from the trunk. So far I am able to crawl and index ok, but I am having trouble getting results back from a search. I get the typical 0 results found when the searchers/indexes cannot be found, but I don't know where to look

Re: url normalization

2010-01-27 Thread Jesse Hires
This also prevents things like over indexing generated calendars where the next day/month/year link will always produce output no matter how far it goes. Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Wed,

can I blow away crawldb?

2010-01-25 Thread Jesse Hires
Can I blow away crawldb, then inject a new set of URLs, without having to rebuild the indexes? Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com

How come I have so many retries listed in stats?

2010-01-09 Thread Jesse Hires
In nutch-default.xml I have the following property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property Yet after letting things run for some time, if I look at the

Is there a way to trim unfetched URLs?

2009-12-24 Thread Jesse Hires
After letting my setup run for a while, I have quite the queue of unfetched URLs. On the order of 10:1 of fetched vs unfetched. Is there a way to trim the lowest scoring unfetched URLs from nutch? Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice //

Re: domain crawl using bin/nutch

2009-12-21 Thread Jesse Hires
You should be able to do this using one of the variations of *-urlfilter.txt files. Instead of using + in front of the regex, you can tell it to exclude lines that match the regex with a -. Just a guess, I haven't actually tried it, but you could probably use something like the following. (I'm

domain vs www.domain?

2009-12-10 Thread Jesse Hires
I'm seeing a lot of duplicates where a single site is getting recognized as two different sites. Specifically I am seeing www.domain.com and domain.combeing recognized as two different sites. I imagine there is a setting to prevent this. If so, what is the setting, if not, what would you recomend

Re: domain vs www.domain?

2009-12-10 Thread Jesse Hires
Bialecki a...@getopt.org wrote: On 2009-12-10 19:59, Jesse Hires wrote: I'm seeing a lot of duplicates where a single site is getting recognized as two different sites. Specifically I am seeing www.domain.com and domain.combeing recognized as two different sites. I imagine there is a setting

Re: nutch 1.0 - Front End not showing results.

2009-12-04 Thread Jesse Hires
Check in your tomcat logs to make sure it is finding things correctly (tail -f on it while doing a search). Also make sure the location of the index and segments are where the conf files say they are. Did you start bin/nutch server portnumber where portnumber is the port you specified in the conf

Re: Can nutch pause, stop and start where it left off?

2009-12-04 Thread Jesse Hires
use the -topN flag to only grab a small number of URLs. Also I believe there is also a setting you can put in nutch-site.xml that can be used to slow down how many URLs you grab over time. Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to

Re: odd warnings

2009-12-02 Thread Jesse Hires
Thanks! Fixing how I was merging the indexes took care of the warning. Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Tue, Dec 1, 2009 at 4:49 AM, Andrzej Bialecki a...@getopt.org wrote: Jesse Hires wrote

odd warnings

2009-11-30 Thread Jesse Hires
I am getting warnings in hadoop.log that segments.gen and segments_2 are not directories, and as you can see by the listing, they are in fact files not directories. I'm not sure what stage of the process this is happening in, as I just now stumbled on them, but it concerns me that it says it is

Re: odd warnings

2009-11-30 Thread Jesse Hires
at 8:57 AM, Andrzej Bialecki a...@getopt.org wrote: Jesse Hires wrote: I am getting warnings in hadoop.log that segments.gen and segments_2 are not directories, and as you can see by the listing, they are in fact files not directories. I'm not sure what stage of the process this is happening

Re: odd warnings

2009-11-30 Thread Jesse Hires
/index2/segments.gen not a directory) Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Mon, Nov 30, 2009 at 9:30 AM, Jesse Hires jhi...@gmail.com wrote: actually searcher.dir is still the default crawl

can you incrementally build an index?

2009-11-23 Thread Jesse Hires
Does bin/nutch merge only create a whole new index out of several smaller indexes, or can it be used to incrementally update a single large index with newly fetched and indexed smaller segments? Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice //

Is there a way to create and index a segment that only has fetched URLs?

2009-11-14 Thread Jesse Hires
I seem to be running into a roadblock with the resources I have available. The time it takes to split a segment into two segments using -slice goes off the hook when there are over 500k unfected urls. I've been running generate/fetch for -topN 4000 and it has been incrementally increasing in time

Re: Incremental Whole Web Crawling

2009-11-03 Thread Jesse Hires
My apologies. missed a patch option :-P Must need more coffee. Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Tue, Nov 3, 2009 at 8:08 PM, Jesse Hires jhi...@gmail.com wrote: Julien, I tried to apply your

unbalanced fetching

2009-10-29 Thread Jesse Hires
I have a two datanode and one namenode setup. One of my datanodes is slower than the other, causing the fetch to run significantly longer on it. Is there a way to balance this out? Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be

Re: unbalanced fetching

2009-10-29 Thread Jesse Hires
Thanks, I'll give that a shot! Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Thu, Oct 29, 2009 at 5:53 AM, Andrzej Bialecki a...@getopt.org wrote: Jesse Hires wrote: I have a two datanode and one

Re: ERROR datanode.DataNode - DatanodeRegistration ... BlockAlreadyExistsException

2009-10-18 Thread Jesse Hires
to be random } // xkcd.com On Sat, Oct 17, 2009 at 11:49 AM, Andrzej Bialecki a...@getopt.org wrote: Jesse Hires wrote: Does anyone have any insight into the following error I am seeing in the hadoop logs? Is this something I should be concerned with, or is it expected that this shows up

ERROR datanode.DataNode - DatanodeRegistration ... BlockAlreadyExistsException

2009-10-16 Thread Jesse Hires
Does anyone have any insight into the following error I am seeing in the hadoop logs? Is this something I should be concerned with, or is it expected that this shows up in the logs from time to time? If it is not expected, where can I look for more information on what is going on? 2009-10-16

Re: splitting an index (yes, again)

2009-09-25 Thread Jesse Hires
} // xkcd.com On Wed, Sep 23, 2009 at 5:48 AM, Jesse Hires jhi...@gmail.com wrote: Exactly! sorry for being so confusing in my original question. Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Wed, Sep

Re: splitting an index (yes, again)

2009-09-23 Thread Jesse Hires
in nutch-site xml to point to the search-servers.txt file, where you entered the hosts and ports of your search servers (detailed description: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg12730.html). Kind regards, Martina -Ursprüngliche Nachricht- Von: Jesse