Re: Why cant I inject a google link to the database?

2009-07-17 Thread Jake Jacobson
Larsson85,

Please read past responses.  Google is blocking all crawlers, not just
yours from indexing their search results.  Because of their robots.txt
file directives you will not be able to do this.

If you place a sign on your house, DO NOT ENTER, and I entered, you
would be very upset.  That is what the robots.txt file does for a
site.  It tells visiting bots what they can enter and what they can't
enter.

Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



On Fri, Jul 17, 2009 at 9:32 AM, Larsson85kristian1...@hotmail.com wrote:

 I think I need more help on how to do this.

 I tried using
 property
  namehttp.robots.agents/name
  valueMozilla/5.0*/value
  descriptionThe agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  /description
 /property

 If I dont have the star in the end I get the same as earlier, No URLs to
 fetch. And if I do I get 0 records selected for fetching, exiting



 reinhard schwab wrote:

 identify nutch as popular user agent such as firefox.

 Larsson85 schrieb:
 Any workaround for this? Making nutch identify as something else or
 something
 similar?


 reinhard schwab wrote:

 http://www.google.se/robots.txt

 google disallows it.

 User-agent: *
 Allow: /searchhistory/
 Disallow: /search


 Larsson85 schrieb:

 Why isnt nutch able to handle links from google?

 I tried to start a crawl from the following url
 http://www.google.se/search?q=site:sehl=svstart=100sa=N

 And all I get is no more URLs to fetch

 The reason for why I want to do this is because I had a tought on maby
 I
 could use google to generate my start list of urls by injecting pages
 of
 search result.

 Why wont this page be parsed and links extracted so the crawl can
 start?










 --
 View this message in context: 
 http://www.nabble.com/Why-cant-I-inject-a-google-link-to-the-database--tp24533162p24534522.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




Re: Job failed help

2009-07-16 Thread Jake Jacobson
Any suggestions on this problem?

Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



On Wed, Jul 15, 2009 at 8:41 AM, Jake Jacobsonjakecjacob...@gmail.com wrote:
 Did this with the same results.

 In my home directory I had a directory name linkdb-1292468754
 created with caused the process to run out of disk space.

 In the hadoop-site.xml I have this set up

 configuration
        property
                namehadoop.tmp.dir/name
                value/webroot/oscrawlers/nutch/tmp//value
                descriptionA base for other temporary
 directories./description
        /property

 /configuration

 I am using the following command line options to run Nutch 1.0

 /webroot/oscrawlers/nutch/bin/nutch crawl
 /webroot/oscrawlers/nutch/urls/seed.txt -dir
 /webroot/oscrawlers/nutch/crawl -depth 10 
 /webroot/oscrawlers/nutch/logs/crawl_log.txt

 In my log file I see this error message:

 LinkDb: adding segment:
 file:/webroot/oscrawlers/nutch/crawl/segments/20090714095100
 Exception in thread main java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
        at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)

 Jake Jacobson

 http://www.linkedin.com/in/jakejacobson
 http://www.facebook.com/jakecjacobson
 http://twitter.com/jakejacobson

 Our greatest fear should not be of failure,
 but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



 On Mon, Jul 13, 2009 at 9:00 AM, SunGodsun...@cheemer.org wrote:
 if you use hadoop run nutch

 please add

 property
  namehadoop.tmp.dir/name
  value/youtempfs/hadoop-${user.name}/value
  descriptionA base for other temporary directories./description
 /property

 to you hadoop-site.xml

 2009/7/13 Jake Jacobson jakecjacob...@gmail.com

 Hi,

 I have tried to run nutch 1.0 several times and it fails due to lack
 of disk space.  I have defined the crawl to place all files on a disk
 that has plenty of space but when it starts building the linkdb it
 wants to put temp files in the home dir which doesn't have enough
 space.  How can I force Nutch not to do this?

 Jake Jacobson

 http://www.linkedin.com/in/jakejacobson
 http://www.facebook.com/jakecjacobson
 http://twitter.com/jakejacobson

 Our greatest fear should not be of failure,
 but of succeeding at something that doesn't really matter.
   -- ANONYMOUS





Crawling with a PKI Cert

2009-07-16 Thread Jake Jacobson
Hi,

Has there been any work with Nutch to crawl with a PKI cert?  How
about sites that take username/password and set cookies?

Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS


Re: how to crawl a page but not index it

2009-07-15 Thread Jake Jacobson
Hi,

Nutch should follow the meta robots directives so in page A add this
meta directive.

meta name=robots content=noindex,follow

http://www.seoresource.net/robots-metatags.htm

Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



On Tue, Jul 14, 2009 at 8:32 AM, Beatstarun_agrawal...@yahoo.com wrote:

 hi,

 actually what i want is to crawl a web page say 'page A' and all its
 outlinks.
 i want to index all the content gathered by crawling the outlinks. But not
 the 'page A'.
 is there any way to do it in single run.

 with Regards

 Beats
 be...@yahoo.com



 SunGod wrote:

 1.create work dir test first


 2.insert url
 ../bin/nutch inject test -urlfile urls

 3.create fetchlist
 ../bin/nutch generate test test/segments

 4.fetch url
 s1=`ls -d crawl/segments/2* | tail -1`
 echo $s1
 ../bin/nutch fetch test/segments/20090628160619

 5.update crawldb
 ../bin/nutch updatedb test test/segments/20090628160619

 loop step 3 - 5, write a bash script running is best!

 next time please use google search first

 2009/7/13 Beats tarun_agrawal...@yahoo.com


 can anyone help me on this..

 i m using solr to index the nutch doc.
 So i think prune tool will not work.

 i do not want to index the document taken from a particular set of sites

 with regards Beats
 --
 View this message in context:
 http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24459435.html
  Sent from the Nutch - User mailing list archive at Nabble.com.





 --
 View this message in context: 
 http://www.nabble.com/how-to-crawl-a-page-but-not-index-it-tp24437901p24478530.html
 Sent from the Nutch - User mailing list archive at Nabble.com.




Re: Job failed help

2009-07-15 Thread Jake Jacobson
Did this with the same results.

In my home directory I had a directory name linkdb-1292468754
created with caused the process to run out of disk space.

In the hadoop-site.xml I have this set up

configuration
property
namehadoop.tmp.dir/name
value/webroot/oscrawlers/nutch/tmp//value
descriptionA base for other temporary
directories./description
/property

/configuration

I am using the following command line options to run Nutch 1.0

/webroot/oscrawlers/nutch/bin/nutch crawl
/webroot/oscrawlers/nutch/urls/seed.txt -dir
/webroot/oscrawlers/nutch/crawl -depth 10 
/webroot/oscrawlers/nutch/logs/crawl_log.txt

In my log file I see this error message:

LinkDb: adding segment:
file:/webroot/oscrawlers/nutch/crawl/segments/20090714095100
Exception in thread main java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:147)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:129)

Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



On Mon, Jul 13, 2009 at 9:00 AM, SunGodsun...@cheemer.org wrote:
 if you use hadoop run nutch

 please add

 property
  namehadoop.tmp.dir/name
  value/youtempfs/hadoop-${user.name}/value
  descriptionA base for other temporary directories./description
 /property

 to you hadoop-site.xml

 2009/7/13 Jake Jacobson jakecjacob...@gmail.com

 Hi,

 I have tried to run nutch 1.0 several times and it fails due to lack
 of disk space.  I have defined the crawl to place all files on a disk
 that has plenty of space but when it starts building the linkdb it
 wants to put temp files in the home dir which doesn't have enough
 space.  How can I force Nutch not to do this?

 Jake Jacobson

 http://www.linkedin.com/in/jakejacobson
 http://www.facebook.com/jakecjacobson
 http://twitter.com/jakejacobson

 Our greatest fear should not be of failure,
 but of succeeding at something that doesn't really matter.
   -- ANONYMOUS




Re: Nutch Tutorial 1.0 based off of the French Version

2009-07-14 Thread Jake Jacobson
I did attach it.

Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



On Mon, Jul 13, 2009 at 9:04 PM, alx...@aim.com wrote:




  Hi,

 Is it available on the internet? If not, could you please attach it.

 Thanks.
 A.




 -Original Message-
 From: Jake Jacobson jakecjacob...@gmail.com
 To: nutch-user@lucene.apache.org
 Sent: Mon, Jul 13, 2009 1:26 pm
 Subject: Nutch Tutorial 1.0 based off of the French Version










 Hi,

 Not finding any other Nutch 1.0 tutorial, I took the one
 b.bouzid.moha...@gmail.com posted a few days ago and ran it through
 the Google translation page.  I have not had time to go over the steps
 and I don't think I will for a few weeks but wanted to send this out
 to the community.  Hope it helps someone and we can add to it.

 Jake Jacobson

 http://www.linkedin.com/in/jakejacobson
 http://www.facebook.com/jakecjacobson
 http://twitter.com/jakejacobson

 Our greatest fear should not be of failure,
 but of succeeding at something that doesn't really matter.
   -- ANONYMOUS








Re: Nutch Tutorial 1.0 based off of the French Version

2009-07-14 Thread Jake Jacobson
Posted it to my blog,

http://jakecjacobson.blogspot.com/2009/07/nutch10installationguide.html

Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



On Mon, Jul 13, 2009 at 4:26 PM, Jake Jacobsonjakecjacob...@gmail.com wrote:
 Hi,

 Not finding any other Nutch 1.0 tutorial, I took the one
 b.bouzid.moha...@gmail.com posted a few days ago and ran it through
 the Google translation page.  I have not had time to go over the steps
 and I don't think I will for a few weeks but wanted to send this out
 to the community.  Hope it helps someone and we can add to it.

 Jake Jacobson

 http://www.linkedin.com/in/jakejacobson
 http://www.facebook.com/jakecjacobson
 http://twitter.com/jakejacobson

 Our greatest fear should not be of failure,
 but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



Job failed help

2009-07-13 Thread Jake Jacobson
Hi,

I have tried to run nutch 1.0 several times and it fails due to lack
of disk space.  I have defined the crawl to place all files on a disk
that has plenty of space but when it starts building the linkdb it
wants to put temp files in the home dir which doesn't have enough
space.  How can I force Nutch not to do this?

Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS


Nutch Tutorial 1.0 based off of the French Version

2009-07-13 Thread Jake Jacobson
Hi,

Not finding any other Nutch 1.0 tutorial, I took the one
b.bouzid.moha...@gmail.com posted a few days ago and ran it through
the Google translation page.  I have not had time to go over the steps
and I don't think I will for a few weeks but wanted to send this out
to the community.  Hope it helps someone and we can add to it.

Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/jakecjacobson
http://twitter.com/jakejacobson

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS


Script to crawl web

2009-07-09 Thread Jake Jacobson
Hi,

I was wondering if anyone has a simple script using Nutch 1.0 to crawl
an Intranet sites with multiple webservers.  I can use
/webroot/oscrawlers/nutch/bin/nutch crawl
/webroot/oscrawlers/nutch/urls/seed.txt -dir
/webroot/oscrawlers/nutch/crawl -depth 8 -topN 1000 and get a big
chunk of the files.  I then tried to follow the steps outlined on the
Nutch Tutorial, http://wiki.apache.org/nutch/NutchTutorial on crawling
Whole-web and nothing new seems to get into the index.  It seems to
be crawling the same URLs.  When I run the -stats command against
the database I get the same stats output.

Here is my script

#!/bin/sh

# nutch_crawler.sh

echo   Set UMASK ...;
umask 002;
echo 

# Set Variables
LIMIT=1 # Max loops to execute
A=0
NUTCHBINARY='/webroot/oscrawlers/nutch/bin/nutch'
NUTCHDB='/webroot/oscrawlers/nutch/crawl/crawldb'
NUTCHSEGMENTS='/webroot/oscrawlers/nutch/crawl/segments'
NUTCHINDEXES='/webroot/oscrawlers/nutch/crawl/indexes'
NUTCHLINKDB='/webroot/oscrawlers/nutch/crawl/linkdb'

# Inject starting URLs into the database
#echo   Injecting Starting URLs ...
#echo 
#$NUTCHBINARY inject $NUTCHDB /webroot/oscrawlers/nutch/urls/seed.txt
#sleep 30

while [ $A -le $LIMIT ]
do
# Generate a fetch list
echo   Generating fetch list ...
$NUTCHBINARY generate $NUTCHDB $NUTCHSEGMENTS -topN 1000

# Find the newest created segment
echo 
echo   Get segment ...
s1=`ls -d /webroot/oscrawlers/nutch/crawl/segments/2* | tail -1`
echo 
echo   Segment is: $s1 ...

# Fetch this segment
$NUTCHBINARY fetch $s1

# Add one to A and continue looping until LIMIT is reached
A=$(($A+1))
sleep 60
done

# Invert links
echo 
echo   Building inverted links ... 
$NUTCHBINARY invertlinks $NUTCHLINKDB -dir $NUTCHSEGMENTS

# Before I can do this, I need to delete the current indexes.  Doesn't
seem to affect the current searches
echo 
echo   Remove old indexes ...
rm -rf $NUTCHINDEXES

# Index Segments
echo 
echo   Build new indexes ...
$NUTCHBINARY index $NUTCHINDEXES $NUTCHDB $NUTCHLINKDB $NUTCHSEGMENTS/*
echo 
echo   Done ...;
###
Jake Jacobson

http://www.linkedin.com/in/jakejacobson
http://www.new.facebook.com/people/Jake_Jacobson/622727274

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS


Running Nutch on VMs

2009-07-08 Thread Jake Jacobson
Hi,

Has anyone had experience in running a large scale nutch on VM running
RedHat Linux?  Would like to setup a test bed that would index 80
million documents and support up to 5 searches per second.  If so, can
you provide me any guidance on how much ram, diskspace, and processors
needed for the configuration?

Does Nutch get any performance boost from running on 64 bit verses a 32 bit OS?

Jake Jacobson
http://www.linkedin.com/in/jakejacobson
http://www.new.facebook.com/people/Jake_Jacobson/622727274

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS


Re: Can Nutch crawler Impersonate user-agent?

2009-06-02 Thread Jake Jacobson
Hi,

Well I found out the problem.  In the nutch-default.xml there is a setting:

property
  namehttp.robots.agents/name
  value*/value
  descriptionThe agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  /description
/property

I copied this to my nutch-site.xml file, edited it with my user-agent
string and the magic worked.  I would suggest that this block of code
is added to the nutch-site.xml file by default.

Jake Jacobson
http://www.linkedin.com/in/jakejacobson
http://www.new.facebook.com/people/Jake_Jacobson/622727274

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



On Mon, Jun 1, 2009 at 2:23 PM, Jake Jacobson jakecjacob...@gmail.com wrote:
 Hi,

 I am testing out Nutch 1.0 and it doesn't seem to be able to crawl my
 website that has the following robots.txt file:

 User-agent: imo-robot-intelink
 Disallow: /App_Themes/
 Disallow: /app_themes/
 Disallow: /Archive/
 Disallow: /archive/
 Disallow: /Bin/
 Disallow: /bin/

 I have the nutch-site.xml defined as:
 configuration
        property
                namehttp.agent.name/name
                valueimo-robot-intelink/value
                descriptionICES Robots Name/description
        /property

        property
                namehttp.agent.version/name
                value/value
                description/description
        /property

        property
                namehttp.agent.description/name
                valueICES Open Source Web Crawler using Nutch 1.0/value
                description/description
        /property

        property
                namehttp.agent.url/name
                valuehttp://www.xxx.gov/search//value
                description/description
        /property

        property
                namehttp.agent.email/name
                value/value
                description/description
        /property
 /configuration

 When I run the following ./nutch crawl ../urls -dir ../crawl/ -depth
 3 -topN 50 from the command line I get:
 crawl started in: ../crawl
 rootUrlDir = ../urls
 threads = 10
 depth = 3
 topN = 50
 Injector: starting
 Injector: crawlDb: ../crawl/crawldb
 Injector: urlDir: ../urls
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: done
 Generator: Selecting best-scoring urls due for fetch.
 Generator: starting
 Generator: segment: ../crawl/segments/20090601180745
 Generator: filtering: true
 Generator: topN: 50
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls by host, for politeness.
 Generator: done.
 Fetcher: Your 'http.agent.name' value should be listed first in
 'http.robots.agents' property.
 Fetcher: starting
 Fetcher: segment: ../crawl/segments/20090601180745
 Fetcher: threads: 10
 QueueFeeder finished: total 3 records.
 fetching http://www.intelink.gov/
 fetching http://www.intelink.gov/blogs/
 fetching http://www.intelink.gov/wiki/Main_Page
 -finishing thread FetcherThread, activeThreads=9
 -finishing thread FetcherThread, activeThreads=8
 -finishing thread FetcherThread, activeThreads=7
 -finishing thread FetcherThread, activeThreads=6
 -finishing thread FetcherThread, activeThreads=5
 -finishing thread FetcherThread, activeThreads=4
 -finishing thread FetcherThread, activeThreads=3
 -finishing thread FetcherThread, activeThreads=2
 -finishing thread FetcherThread, activeThreads=1
 -finishing thread FetcherThread, activeThreads=0
 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
 -activeThreads=0
 Fetcher: done
 CrawlDb update: starting
 CrawlDb update: db: ../crawl/crawldb
 CrawlDb update: segments: [../crawl/segments/20090601180745]
 CrawlDb update: additions allowed: true
 CrawlDb update: URL normalizing: true
 CrawlDb update: URL filtering: true
 CrawlDb update: Merging segment data into db.
 CrawlDb update: done
 Generator: Selecting best-scoring urls due for fetch.
 Generator: starting
 Generator: segment: ../crawl/segments/20090601180757
 Generator: filtering: true
 Generator: topN: 50
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: 0 records selected for fetching, exiting ...
 Stopping at depth=1 - no more URLs to fetch.
 LinkDb: starting
 LinkDb: linkdb: ../crawl/linkdb
 LinkDb: URL normalize: true
 LinkDb: URL filter: true
 LinkDb: adding segment:
 file:/linsearchtools1o/oscrawlers/nutch-1.0/crawl/segments/20090601180745
 LinkDb: done
 Indexer: starting
 Indexer: done
 Dedup: starting
 Dedup: adding indexes in: ../crawl/indexes
 Dedup: done
 merging indexes to: ../crawl/index
 Adding file:/linsearchtools1o/oscrawlers/nutch-1.0/crawl/indexes/part-0
 done merging
 crawl finished: ../crawl

 I have a tail on my webserver log files and I see

Can Nutch crawler Impersonate user-agent?

2009-06-01 Thread Jake Jacobson
Hi,

I am testing out Nutch 1.0 and it doesn't seem to be able to crawl my
website that has the following robots.txt file:

User-agent: imo-robot-intelink
Disallow: /App_Themes/
Disallow: /app_themes/
Disallow: /Archive/
Disallow: /archive/
Disallow: /Bin/
Disallow: /bin/

I have the nutch-site.xml defined as:
configuration
property
namehttp.agent.name/name
valueimo-robot-intelink/value
descriptionICES Robots Name/description
/property

property
namehttp.agent.version/name
value/value
description/description
/property

property
namehttp.agent.description/name
valueICES Open Source Web Crawler using Nutch 1.0/value
description/description
/property

property
namehttp.agent.url/name
valuehttp://www.xxx.gov/search//value
description/description
/property

property
namehttp.agent.email/name
value/value
description/description
/property
/configuration

When I run the following ./nutch crawl ../urls -dir ../crawl/ -depth
3 -topN 50 from the command line I get:
crawl started in: ../crawl
rootUrlDir = ../urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: ../crawl/crawldb
Injector: urlDir: ../urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: ../crawl/segments/20090601180745
Generator: filtering: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting
Fetcher: segment: ../crawl/segments/20090601180745
Fetcher: threads: 10
QueueFeeder finished: total 3 records.
fetching http://www.intelink.gov/
fetching http://www.intelink.gov/blogs/
fetching http://www.intelink.gov/wiki/Main_Page
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: ../crawl/crawldb
CrawlDb update: segments: [../crawl/segments/20090601180745]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: ../crawl/segments/20090601180757
Generator: filtering: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: ../crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/linsearchtools1o/oscrawlers/nutch-1.0/crawl/segments/20090601180745
LinkDb: done
Indexer: starting
Indexer: done
Dedup: starting
Dedup: adding indexes in: ../crawl/indexes
Dedup: done
merging indexes to: ../crawl/index
Adding file:/linsearchtools1o/oscrawlers/nutch-1.0/crawl/indexes/part-0
done merging
crawl finished: ../crawl

I have a tail on my webserver log files and I see the robots.txt file
requested with a 200 but nothing gets into the index.  I see the error
message Fetcher: Your 'http.agent.name' value should be listed first
in 'http.robots.agents' property. which it is listed first.  Any help
given would be most appreciated.

Jake Jacobson
http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/people/Jake_Jacobson/622727274

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS


Re: Can Nutch crawler Impersonate user-agent?

2009-06-01 Thread Jake Jacobson
The Allow directive in the robots.txt is optional.  If you don't have
an explicit disallow statement, it means that directory or file is
available for indexing.

Jake Jacobson
http://www.linkedin.com/in/jakejacobson
http://www.facebook.com/people/Jake_Jacobson/622727274

Our greatest fear should not be of failure,
but of succeeding at something that doesn't really matter.
   -- ANONYMOUS



On Mon, Jun 1, 2009 at 2:46 PM, David M. Cole d...@colegroup.com wrote:
 At 2:23 PM -0400 6/1/09, Jake Jacobson wrote:

 User-agent: imo-robot-intelink
 Disallow: /App_Themes/
 Disallow: /app_themes/
 Disallow: /Archive/
 Disallow: /archive/
 Disallow: /Bin/
 Disallow: /bin/

 Jake:

 I think you need to add one more line after the last line:

 Allow: /

 \dmc

 --
 *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
   David M. Cole                                            ...@colegroup.com
   Editor  Publisher, NewsInc. http://newsinc.net        V: (650) 557-2993
   Consultant: The Cole Group http://colegroup.com/       F: (650) 475-8479
 *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+