Re: Two Nutch parallel crawl with two conf folder.

2010-03-09 Thread Gora Mohanty
On Tue, 9 Mar 2010 14:36:33 +0100
MilleBii mille...@gmail.com wrote:

 Never tried... Also you may want to check $NUTCH_HOME variable
 which should be different for each instance, otherwise it will
 only use one of the two conf dir.
[...]

Had meant to reply to the original poster, but had forgotten.
We have indeed run multiple instances of Nutch in separate
directories, without any problems.

I presume that you are using the crawl.sh script, or a derivative
of it. If so, as pointed out above, a likely cause of what you
are seeing is that the NUTCH_HOME variable in the script is set
to the same directory, so that the configuration from that directory
is the one picked up.

Regards,
Gora


Re: Deleting stale URLs from Nutch/Solr

2009-10-27 Thread Gora Mohanty
On Mon, 26 Oct 2009 17:26:23 +0100
Andrzej Bialecki a...@getopt.org wrote:
[...]
 Stale (no longer existing) URLs are marked with STATUS_DB_GONE.
 They are kept in Nutch crawldb to prevent their re-discovery
 (through stale links pointing to these URL-s from other pages).
 If you really want to remove them from CrawlDb you can filter
 them out (using CrawlDbMerger with just one input db, and setting
 your URLFilters appropriately).
[...]

Thank you for your help. Your suggestions look promising, but I
think that I did not make myself adequately clear. Once we have
completed a site crawl with Nutch, ideally I would like to be
able to find stale links without doing a complete recrawl, i.e.,
only through restarting the crawl from where it last left off. Is
that possible.

I tried a simple test on a local webserver with five pages in a
three-level hierarchy. The crawl completes, and discovers all
five URLs as expected. Now, I remove a tertiary page. Ideally,
I would like to be able run a recrawl, and have Nutch dicover
the now-missing URL. However, when I try that, it finds no new
links, and exits. ./bin/nutch readdb crawl/crawldb -stats
shows me:
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 5
retry 0:5
min score:  0.333
avg score:  0.4664
max score:  1.0
status 2 (db_fetched):  5
CrawlDb statistics: done

Regards,
Gora


Re: Deleting stale URLs from Nutch/Solr

2009-10-27 Thread Gora Mohanty
On Tue, 27 Oct 2009 07:29:10 +0100
Andrzej Bialecki a...@getopt.org wrote:
[...]
 I assume you mean that the generate step produces no new URL-s
 to fetch? That's expected, because they become eligible for
 re-fetching only after Nutch considers them expired, i.e. after
 the fetchTime + fetchInterval, and the default fetchInterval is
 30 days.

Yes, it was indeed stopping at the generate step, and your
explanation makes sense.

 You can pretend that the time moved on using the -adddays
 parameter.
[...]

Thanks. This worked exactly as you said. I have tested this,
and the removed page indeed shows up with status db_gone, and
I can now script a solution for my problem with stale URLs,
along the lines that you have suggested.

Thank you very much for this quick and thorough response. As
I imagine that this is a common requirement, I will write up
a brief blog entry on this by the weekend, along with a solution.

Regards,
Gora


Deleting stale URLs from Nutch/Solr

2009-10-26 Thread Gora Mohanty
Hi,

  We are using Nutch to crawl an internal site, and index content
to Solr. The issue is that the site is run through a CMS, and
occasionally pages are deleted, so that the corresponding URLs
become invalid. Is there any way that Nutch can discover stale
URLs during recrawls, or is the only solution a completely fresh
crawl? Also, is it possible to have Nutch automatically remove
such stale content from Solr?

  I am stumped by this problem, and would appreciate any pointers,
or even thoughts on this.

Regards,
Gora


Re: indexing just certain content

2009-10-09 Thread Gora Mohanty
On Fri, 9 Oct 2009 18:00:41 +0200
MilleBii mille...@gmail.com wrote:

 Don't think it will work because at the indexing filter stage all
 the HTML tags are gone from the text.
 
 I think you need to modify the HTML parser to filter out the tags
 you want to get rid of.
 
 In some use case I have I would like to perform 'intelligent
 indexing', ie use the tag information to extract specific fields
 to be indexed along with the main text. A reverse case of yours.
 Todate I did not find a way to do it.
 So if you find a solution I'm with you.
[...]

This is something that we would also be interested in. Actually,
we even have a working solution to extract content from between
start/stop tags, written by our colleagues from a partner company.

There are a couple of things that we would like to fix with this
solution:
(a) It directly modifies HtmlParser.java, which is obviously
unmaintainable.
(b) It is a solution for specific tags, rather than picking them
up from configuration parameters.
(c) We have not yet traced the complete execution path for Nutch,
i.e., when is the parser called, when are filters called, etc.
Is there a document anywhere about this? We were thinking of a
filter, but from what you say above, that is the wrong stage.
(d) Ideally, whatever solution we come up with would be contributed
back to Nutch, which also helps us from a maintenance
standpoint. Is there a defined process for getting external
plugins accepted into Nutch?

We are willing to put in some time into this, starting the coming
week. Where can we start a brainstorming Wiki for this? Is the
Nutch Wiki the right place?

Regards,
Gora