Re: Deleting stale URLs from Nutch/Solr

2009-10-27 Thread Gora Mohanty
On Mon, 26 Oct 2009 17:26:23 +0100
Andrzej Bialecki a...@getopt.org wrote:
[...]
 Stale (no longer existing) URLs are marked with STATUS_DB_GONE.
 They are kept in Nutch crawldb to prevent their re-discovery
 (through stale links pointing to these URL-s from other pages).
 If you really want to remove them from CrawlDb you can filter
 them out (using CrawlDbMerger with just one input db, and setting
 your URLFilters appropriately).
[...]

Thank you for your help. Your suggestions look promising, but I
think that I did not make myself adequately clear. Once we have
completed a site crawl with Nutch, ideally I would like to be
able to find stale links without doing a complete recrawl, i.e.,
only through restarting the crawl from where it last left off. Is
that possible.

I tried a simple test on a local webserver with five pages in a
three-level hierarchy. The crawl completes, and discovers all
five URLs as expected. Now, I remove a tertiary page. Ideally,
I would like to be able run a recrawl, and have Nutch dicover
the now-missing URL. However, when I try that, it finds no new
links, and exits. ./bin/nutch readdb crawl/crawldb -stats
shows me:
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 5
retry 0:5
min score:  0.333
avg score:  0.4664
max score:  1.0
status 2 (db_fetched):  5
CrawlDb statistics: done

Regards,
Gora


Re: Deleting stale URLs from Nutch/Solr

2009-10-27 Thread Andrzej Bialecki

Gora Mohanty wrote:

On Mon, 26 Oct 2009 17:26:23 +0100
Andrzej Bialecki a...@getopt.org wrote:
[...]

Stale (no longer existing) URLs are marked with STATUS_DB_GONE.
They are kept in Nutch crawldb to prevent their re-discovery
(through stale links pointing to these URL-s from other pages).
If you really want to remove them from CrawlDb you can filter
them out (using CrawlDbMerger with just one input db, and setting
your URLFilters appropriately).

[...]

Thank you for your help. Your suggestions look promising, but I
think that I did not make myself adequately clear. Once we have
completed a site crawl with Nutch, ideally I would like to be
able to find stale links without doing a complete recrawl, i.e.,
only through restarting the crawl from where it last left off. Is
that possible.

I tried a simple test on a local webserver with five pages in a
three-level hierarchy. The crawl completes, and discovers all
five URLs as expected. Now, I remove a tertiary page. Ideally,
I would like to be able run a recrawl, and have Nutch dicover
the now-missing URL. However, when I try that, it finds no new
links, and exits.


I assume you mean that the generate step produces no new URL-s to 
fetch? That's expected, because they become eligible for re-fetching 
only after Nutch considers them expired, i.e. after the fetchTime + 
fetchInterval, and the default fetchInterval is 30 days.


You can pretend that the time moved on using the -adddays parameter. 
Then Nutch will generate a new fetchlist, and when it discovers that the 
page is missing it will mark it as gone - actually, you could then take 
that information directly from the Nutch segment and instead of 
processing the CrawlDb you could process the segment to collect a 
partial list of gone pages.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Nutch in WebSphere

2009-10-27 Thread Joshua J Pavel


I'm very new at this, so forgive my novice questions.  I'm trying to
install nutch in WebSphere 6.1.  While I can see that others have done this
before, I've been unsuccessful.  I keep getting this error:

Error 500: java.lang.Error: java.lang.NoClassDefFoundError:
org.apache.jsp._search (wrong name:
   com/ibm/_jsp/_search)

I thought it was a conflict with the base WebSphere jars and the jars in
the nutch lib.  I attempted to resolve this by having the applications jars
load first, but I still get this error.

I'm not sure if this is complicated by the fact that I want to run my crawl
on a different node, and just use WebSphere to serve the results.  I'll be
exporting and importing the crawl directory next, so maybe I'll ask that
question as well... where should I place the crawl directory in relation to
my WebSphere war installation?  Inside the installedApps directory, or can
I specify where somehow?

Re: Deleting stale URLs from Nutch/Solr

2009-10-27 Thread Gora Mohanty
On Tue, 27 Oct 2009 07:29:10 +0100
Andrzej Bialecki a...@getopt.org wrote:
[...]
 I assume you mean that the generate step produces no new URL-s
 to fetch? That's expected, because they become eligible for
 re-fetching only after Nutch considers them expired, i.e. after
 the fetchTime + fetchInterval, and the default fetchInterval is
 30 days.

Yes, it was indeed stopping at the generate step, and your
explanation makes sense.

 You can pretend that the time moved on using the -adddays
 parameter.
[...]

Thanks. This worked exactly as you said. I have tested this,
and the removed page indeed shows up with status db_gone, and
I can now script a solution for my problem with stale URLs,
along the lines that you have suggested.

Thank you very much for this quick and thorough response. As
I imagine that this is a common requirement, I will write up
a brief blog entry on this by the weekend, along with a solution.

Regards,
Gora


Re: How to index files only with specific type

2009-10-27 Thread Dmitriy Fundak
If I disable html-parser(remove parse-(html from plugin.includes
property) html filed didn't get parsed
So didn't get outlinks to kml files from html.
So I can't parse and index kml files.
I might not be right, but I have a feeling that it's not possible
without modifying source code.

thx

2009/10/26 BELLINI ADAM mbel...@msn.com:

 disable the html-parser from the nutch-site and keep only your parser.
 you can also add in uour filter file this : -(htm|html)$

 thx



 Date: Mon, 26 Oct 2009 17:53:11 +0300
 Subject: How to index files only with specific type
 From: dfun...@gmail.com
 To: nutch-user@lucene.apache.org

 Hi, I've create parser and indexer to specific file type(geo xml meta
 file - kml).
 I am trying to crawl couple of sites, and index only files of this type.
 I don't want to index html or anything else.
 How can I achieve this?
 Thanks.-

 _
 Save up to 84% on Windows 7 until Jan 3—eligible CDN College  University 
 students only. Hurry—buy it now for $39.99!
 http://go.microsoft.com/?linkid=9691635


Re: How to index files only with specific type

2009-10-27 Thread Andrzej Bialecki

Dmitriy Fundak wrote:

If I disable html-parser(remove parse-(html from plugin.includes
property) html filed didn't get parsed
So didn't get outlinks to kml files from html.
So I can't parse and index kml files.
I might not be right, but I have a feeling that it's not possible
without modifying source code.


It's possible to do this with a custom indexing filter - see other 
indexing filters to get a feeling of what's involved. Or you could do 
this with a scoring filter, too, although the scoring API looks more 
complicated.


Either way, when you execute the Indexer, these filters are run in a 
chain, and if one of them returns null then that document is discarded, 
i.e. it's not added to the output index. So, it's easy to examine in 
your indexing filter the content type (or just a URL of the document) 
and either pass the document on or reject it by returning null.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to index files only with specific type

2009-10-27 Thread Dmitriy Fundak
Checking url postfix and returning null if it's not one I need helped.
Thanks, Andrzej.

2009/10/27 Andrzej Bialecki a...@getopt.org:
 Dmitriy Fundak wrote:

 If I disable html-parser(remove parse-(html from plugin.includes
 property) html filed didn't get parsed
 So didn't get outlinks to kml files from html.
 So I can't parse and index kml files.
 I might not be right, but I have a feeling that it's not possible
 without modifying source code.

 It's possible to do this with a custom indexing filter - see other indexing
 filters to get a feeling of what's involved. Or you could do this with a
 scoring filter, too, although the scoring API looks more complicated.

 Either way, when you execute the Indexer, these filters are run in a chain,
 and if one of them returns null then that document is discarded, i.e. it's
 not added to the output index. So, it's easy to examine in your indexing
 filter the content type (or just a URL of the document) and either pass the
 document on or reject it by returning null.


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




How to run fetch from local

2009-10-27 Thread saravan.krish

I had generated the segments after crawling process. Then I downloaded the
segments to local from crawldb. Below are the four segments I generated and
downloaded from crawldb. Now if I run fetch upon these four segments then I
get the below error. Please help me how to run fetch in local.

[nu...@devcluster01 search]$ ls -lrt db/segments/crawled_22/segments/
total 32
drwxr-xr-x 8 nutch users 4096 Oct 23 03:17 20091022065049
drwxr-xr-x 8 nutch users 4096 Oct 23 03:17 20091022065828
drwxr-xr-x 8 nutch users 4096 Oct 23 03:17 20091022071136
drwxr-xr-x 8 nutch users 4096 Oct 23 03:17 20091022104701
[nu...@devcluster01 search]$ bin/nutch fetch
db/segments/crawled_22/segments/20091022065049
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: db/segments/crawled_22/segments/20091022065049
Exception in thread main org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
hdfs://devcluster01:9000/user/nutch/db/segments/crawled_22/segments/20091022065049/crawl_generate
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
at
org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:101)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:969)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1003)


-- 
View this message in context: 
http://www.nabble.com/How-to-run-fetch-from-local-tp26075786p26075786.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Nutch indexes less pages, then it fetches

2009-10-27 Thread caezar

Hi All,

I've got a strange problem, that nutch indexes much less URLs then it
fetches. For example URL:
http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
I assume that if fetched sucessfully because in fetch logs it mentioned only
once:
2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm

But it was not sent to the indexer on indexing phase (I'm using custom
NutchIndexWriter and it logs every page for witch it's write method
executed). What could be possible reason? Is there a way to browse crawldb
to ensure that page really fetched? What else could I check?

Thanks
-- 
View this message in context: 
http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Redirect handling

2009-10-27 Thread caezar

Hi All,

I've done some googling, but found different answers, so I would appreciate
if you tell me which is the correct one:
- when page redirected, content of target page is fetched and associated
with the source (initial) page URL
- when page redirected, new entry with the redirect target url and contents
added to the db

If the second option is the correct one, then one more question. When I have
a NutchDocument instance which represents target URL, is that possible to
retrieve it's redirect source URL somehow?

Thanks
-- 
View this message in context: 
http://www.nabble.com/Redirect-handling-tp26079767p26079767.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Redirect handling

2009-10-27 Thread Paul Tomblin
There are two different types of redirect.  When a web site returns a
301 status (redirect permanent), it means the url you requested is no
longer valid, don't ask for it again.  When it returns a 307 status
(temporary redirect), it means keep asking for the url you asked for,
and I'll tell you where to go from there.  In the first case, Nutch
should remove the first URL from its database and put the redirection
target in in its place.  In the second case, Nutch should leave the
original URL in its database, but also go to the redirection target.
I don't know if that's actually what Nutch does, but I assume so.

On Tue, Oct 27, 2009 at 11:30 AM, caezar caeza...@gmail.com wrote:

 Hi All,

 I've done some googling, but found different answers, so I would appreciate
 if you tell me which is the correct one:
 - when page redirected, content of target page is fetched and associated
 with the source (initial) page URL
 - when page redirected, new entry with the redirect target url and contents
 added to the db

 If the second option is the correct one, then one more question. When I have
 a NutchDocument instance which represents target URL, is that possible to
 retrieve it's redirect source URL somehow?

 Thanks
 --
 View this message in context: 
 http://www.nabble.com/Redirect-handling-tp26079767p26079767.html
 Sent from the Nutch - User mailing list archive at Nabble.com.





-- 
http://www.linkedin.com/in/paultomblin


Nutch in Websphere

2009-10-27 Thread Joshua J Pavel


I'm very new at this, so forgive my novice questions.  I'm trying to
install nutch in WebSphere 6.1.  While I can see that others have done this
before, I've been unsuccessful.  I keep getting this error:

Error 500: java.lang.Error: java.lang.NoClassDefFoundError:
org.apache.jsp._search (wrong name:
   com/ibm/_jsp/_search)

I thought it was a conflict with the base WebSphere jars and the jars in
the nutch lib.  I attempted to resolve this by having the applications jars
load first, but I still get this error.

I'm not sure if this is complicated by the fact that I want to run my crawl
on a different node, and just use WebSphere to serve the results.  I'll be
exporting and importing the crawl directory next, so maybe I'll ask that
question as well... where should I place the crawl directory in relation to
my WebSphere war installation?  Inside the installedApps directory, or can
I specify where somehow?  Is there an install guide for WebSphere, instead
of Tomcat?

ERROR: Checksum Error

2009-10-27 Thread Eric Osgood

This is my second time receiving this error:

Map output lost, rescheduling: getMapOutput 
(attempt_200910271443_0012_m_01_0,0) failed :

org.apache.hadoop.fs.ChecksumException: Checksum Error
---
Does anyone know why I am getting this error and how to fix it? I  
tried deleting all my data nodes and formatting the namenode to no  
avail.

Thanks,

Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: Nutch indexes less pages, then it fetches

2009-10-27 Thread 皮皮
check the parse data first, maybe it parse unsuccessful.

2009/10/27 caezar caeza...@gmail.com


 Hi All,

 I've got a strange problem, that nutch indexes much less URLs then it
 fetches. For example URL:
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
 I assume that if fetched sucessfully because in fetch logs it mentioned
 only
 once:
 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm

 But it was not sent to the indexer on indexing phase (I'm using custom
 NutchIndexWriter and it logs every page for witch it's write method
 executed). What could be possible reason? Is there a way to browse crawldb
 to ensure that page really fetched? What else could I check?

 Thanks
 --
 View this message in context:
 http://www.nabble.com/Nutch-indexes-less-pages%2C-then-it-fetches-tp26078798p26078798.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




Re: Nutch indexes less pages, then it fetches

2009-10-27 Thread kevin chen
I have similar experience.

Reinhard schwab responded a possible fix.  See mail in this group from
Reinhard schwab  at 
Sun, 25 Oct 2009 10:03:41 +0100  (05:03 EDT)

I haven't have chance to try it out yet.
 
On Tue, 2009-10-27 at 07:34 -0700, caezar wrote:
 Hi All,
 
 I've got a strange problem, that nutch indexes much less URLs then it
 fetches. For example URL:
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm.
 I assume that if fetched sucessfully because in fetch logs it mentioned only
 once:
 2009-10-26 10:01:46,502 INFO org.apache.nutch.fetcher.Fetcher: fetching
 http://www.1stdirectory.com/Companies/1627406_ins_Catering_Limited.htm
 
 But it was not sent to the indexer on indexing phase (I'm using custom
 NutchIndexWriter and it logs every page for witch it's write method
 executed). What could be possible reason? Is there a way to browse crawldb
 to ensure that page really fetched? What else could I check?
 
 Thanks