Re: How to get all the crawled pages for perticular domain

2009-12-10 Thread Yves Petinot

Hi Bhavin,

other nutch users may comment on this, but it seems to me that working 
on top of the nutchbase branch might allow you to perform that type of 
processing quite easily.


-y

bhavin pandya wrote:

Hi,

I have setup nutch 1.0 on cluster of 3 nodes.

We are running two application.

1. Nutch based search application.
We have successfully crawled approx. 25m pages on 3 nodes.
It's working as per expectation.

2. I am running application which needs to extract some information
for perticular domain.
As of date this application uses heritrix based crawler which crawls
the given domain, algorithms goes into pages and extract required
information.

As we are crawling in Nutch in distributed mode. we don't want to
recrawl using other tool like Heritrix for 2nd application.
I want to utilize same crawled data for 2nd application also.

But extraction algorithms requires all the crawled pages for
perticular domain, to extract all relevant information about that
domain.

I have thought of if somehow by writing some plugin in Nutch if i can
feed nutch crawled data to 2nd application then it will really save
our work, money and effort by not recrawling again.

But how do i get all the crawled pages for perticular domain in my
plugin?  Where i should look in nutch code.

Any pointer / idea in this direction will really help.

Thanks.
Bhavin




  




Re: How to get all the crawled pages for perticular domain

2009-12-10 Thread Dennis Kubes

There is a domain-url filter.  Is that what you were looking for?

Dennis

Yves Petinot wrote:

Hi Bhavin,

other nutch users may comment on this, but it seems to me that working 
on top of the nutchbase branch might allow you to perform that type of 
processing quite easily.


-y

bhavin pandya wrote:

Hi,

I have setup nutch 1.0 on cluster of 3 nodes.

We are running two application.

1. Nutch based search application.
We have successfully crawled approx. 25m pages on 3 nodes.
It's working as per expectation.

2. I am running application which needs to extract some information
for perticular domain.
As of date this application uses heritrix based crawler which crawls
the given domain, algorithms goes into pages and extract required
information.

As we are crawling in Nutch in distributed mode. we don't want to
recrawl using other tool like Heritrix for 2nd application.
I want to utilize same crawled data for 2nd application also.

But extraction algorithms requires all the crawled pages for
perticular domain, to extract all relevant information about that
domain.

I have thought of if somehow by writing some plugin in Nutch if i can
feed nutch crawled data to 2nd application then it will really save
our work, money and effort by not recrawling again.

But how do i get all the crawled pages for perticular domain in my
plugin?  Where i should look in nutch code.

Any pointer / idea in this direction will really help.

Thanks.
Bhavin




  




NOINDEX, NOFOLLOW

2009-12-10 Thread BELLINI ADAM

hi,

i have a page with meta name=robots content=noindex,nofollow /, now i 
know that nutch obey to this tag because i dont find the content and the title 
in my index, but i was wondering that this document will not be present in the 
index. why he keep the document in my index with no title and no content ??

i'm using index-basic and index-more plugins, and i want to understand why 
nutch still filling the url, date, boostetc since he didnt it for title and 
content.

i was thinking that if nutch will obey to nofollow and noindex so it will skip 
all the document ! 

or mabe i missunderstood something, can you plz explain this behavior to me?

best regards.

  
_
Windows Live: Make it easier for your friends to see what you’re up to on 
Facebook.
http://go.microsoft.com/?linkid=9691816

RE: how to force nutch to do a recrawl

2009-12-10 Thread BELLINI ADAM

hi,
check the fetch time in your crawldb...you can dump all the crawldb like this:

./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db

entries will look like this:

http://www.YOUR_URL_TO_FETCH
Status: 2 (db_fetched)
Fetch time: Thu Dec 10 09:19:18 EST 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 18000 seconds (0 days)
Score: 0.0014977538
Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
Metadata: _pst_: success(1), lastModified=0


as you see the next time the page will be fetched is in fetch time  : 'Fetch 
time: Thu Dec 10 09:19:18 EST 2009'
and check the rety interval : it should be your 3600. 

hope it will help


 Subject: RE: how to force nutch to do a recrawl
 Date: Wed, 9 Dec 2009 16:06:58 -0500
 From: vijaya_pet...@sra.com
 To: nutch-user@lucene.apache.org
 
 Okay.  I'll dig a little deeper.  I saw a few scripts that people had
 created, but I couldn't get them to work.
 
 Thanks much.
 
 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184
 
 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the individual
 or entity named above.  If you are not the intended recipient, be aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.
 
 -Original Message-
 From: MilleBii [mailto:mille...@gmail.com] 
 Sent: Wednesday, December 09, 2009 4:05 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: how to force nutch to do a recrawl
 
 I don't that you can use nutch crawl command to do that, this is a one
 stop
 shop command.
 You probably want to use individual commands.
 Type nutch generate to get the help and you will see the option
 -adddays,
 read that page on the wiki to get a feel how you should do:
 http://wiki.apache.org/nutch/Crawl
 
 2009/12/9 Peters, Vijaya vijaya_pet...@sra.com
 
  I didn't see a setting to override in crawl-urlfilter.  How do I set
  numberDays? I have regular expressions to include/exclude certain
 extensions
  and certain urls, but that's all I have in there.
 
  Please send me an example and I'll give it a try.
 
  Thanks!
 
  Vijaya Peters
  SRA International, Inc.
  4350 Fair Lakes Court North
  Room 4004
  Fairfax, VA  22033
  Tel:  703-502-1184
 
  www.sra.com
  Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive
  years
  P Please consider the environment before printing this e-mail
  This electronic message transmission contains information from SRA
  International, Inc. which may be confidential, privileged or
 proprietary.
   The information is intended for the use of the individual or entity
 named
  above.  If you are not the intended recipient, be aware that any
 disclosure,
  copying, distribution, or use of the contents of this information is
  strictly prohibited.  If you have received this electronic information
 in
  error, please notify us immediately by telephone at 866-584-2143.
 
  -Original Message-
  From: xiao yang [mailto:yangxiao9...@gmail.com]
  Sent: Wednesday, December 09, 2009 1:41 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: how to force nutch to do a recrawl
 
  What about the configuration in crawl-urlfilter.txt?
 
  On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
 vijaya_pet...@sra.com
  wrote:
   I tried that too.
   in Nutch-site.xml, I added in the below, but this had no effect.
  
   property
namedb.default.fetch.interval/name
value0/value
description(DEPRECATED) The default number of days between
 re-fetches
  of a page.  value was 30
/description
   /property
  
   property
namedb.fetch.interval.default/name
value3600/value
descriptionThe default number of seconds between re-fetches of a
 page
  (30 days). value was 2592000 (30 days)
/description
   /property
  
   property
namedb.fetch.interval.max/name
value3600/value
descriptionThe maximum number of seconds between re-fetches of a
 page
(90 days). After this period every page in the db will be re-tried,
 no
matter what is its status.  value was 7776000
/description
   /property
  
   Vijaya Peters
   SRA International, Inc.
   4350 Fair Lakes Court North
   Room 4004
   Fairfax, VA  22033
   Tel:  703-502-1184
  
   www.sra.com
   Named to FORTUNE's 100 Best Companies to Work For list for 10
  consecutive years
   P Please consider the environment before printing this e-mail
   This electronic message transmission contains information from SRA
  International, Inc. which may be confidential, privileged 

domain vs www.domain?

2009-12-10 Thread Jesse Hires
I'm seeing a lot of duplicates where a single site is getting recognized as
two different sites. Specifically I am seeing www.domain.com and
domain.combeing recognized as two different sites.
I imagine there is a setting to prevent this. If so, what is the setting, if
not, what would you recomend doing to prevent this?


Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com


RE: how to force nutch to do a recrawl

2009-12-10 Thread Peters, Vijaya
Adam,
I tried running that command and get the following (it created a
whole_db directory, but it's not dumping out the contents to the
console):

$ bin/nutch readdb crawl/crawldb/ -dump whole_db
CrawlDb dump: starting
CrawlDb db: crawl/crawldb/
CrawlDb dump: done

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's 100 Best Companies to Work For list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.
-Original Message-
From: BELLINI ADAM [mailto:mbel...@msn.com] 
Sent: Thursday, December 10, 2009 1:40 PM
To: nutch-user@lucene.apache.org
Subject: RE: how to force nutch to do a recrawl


hi,
check the fetch time in your crawldb...you can dump all the crawldb like
this:

./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db

entries will look like this:

http://www.YOUR_URL_TO_FETCH
Status: 2 (db_fetched)
Fetch time: Thu Dec 10 09:19:18 EST 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 18000 seconds (0 days)
Score: 0.0014977538
Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
Metadata: _pst_: success(1), lastModified=0


as you see the next time the page will be fetched is in fetch time  :
'Fetch time: Thu Dec 10 09:19:18 EST 2009'
and check the rety interval : it should be your 3600. 

hope it will help


 Subject: RE: how to force nutch to do a recrawl
 Date: Wed, 9 Dec 2009 16:06:58 -0500
 From: vijaya_pet...@sra.com
 To: nutch-user@lucene.apache.org
 
 Okay.  I'll dig a little deeper.  I saw a few scripts that people had
 created, but I couldn't get them to work.
 
 Thanks much.
 
 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184
 
 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the
individual
 or entity named above.  If you are not the intended recipient, be
aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.
 
 -Original Message-
 From: MilleBii [mailto:mille...@gmail.com] 
 Sent: Wednesday, December 09, 2009 4:05 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: how to force nutch to do a recrawl
 
 I don't that you can use nutch crawl command to do that, this is a one
 stop
 shop command.
 You probably want to use individual commands.
 Type nutch generate to get the help and you will see the option
 -adddays,
 read that page on the wiki to get a feel how you should do:
 http://wiki.apache.org/nutch/Crawl
 
 2009/12/9 Peters, Vijaya vijaya_pet...@sra.com
 
  I didn't see a setting to override in crawl-urlfilter.  How do I set
  numberDays? I have regular expressions to include/exclude certain
 extensions
  and certain urls, but that's all I have in there.
 
  Please send me an example and I'll give it a try.
 
  Thanks!
 
  Vijaya Peters
  SRA International, Inc.
  4350 Fair Lakes Court North
  Room 4004
  Fairfax, VA  22033
  Tel:  703-502-1184
 
  www.sra.com
  Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive
  years
  P Please consider the environment before printing this e-mail
  This electronic message transmission contains information from SRA
  International, Inc. which may be confidential, privileged or
 proprietary.
   The information is intended for the use of the individual or entity
 named
  above.  If you are not the intended recipient, be aware that any
 disclosure,
  copying, distribution, or use of the contents of this information is
  strictly prohibited.  If you have received this electronic
information
 in
  error, please notify us immediately by telephone at 866-584-2143.
 
  -Original Message-
  From: xiao yang [mailto:yangxiao9...@gmail.com]
  Sent: Wednesday, December 09, 2009 1:41 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: how to force nutch to do a recrawl
 
  What about the configuration in crawl-urlfilter.txt?
 
  On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
 vijaya_pet...@sra.com
  wrote:
   I tried 

Re: NOINDEX, NOFOLLOW

2009-12-10 Thread Kirby Bohling
On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM mbel...@msn.com wrote:

 hi,

 i have a page with meta name=robots content=noindex,nofollow /, now i 
 know that nutch obey to this tag because i dont find the content and the 
 title in my index, but i was wondering that this document will not be present 
 in the index. why he keep the document in my index with no title and no 
 content ??

 i'm using index-basic and index-more plugins, and i want to understand why 
 nutch still filling the url, date, boostetc since he didnt it for title 
 and content.

 i was thinking that if nutch will obey to nofollow and noindex so it will 
 skip all the document !

 or mabe i missunderstood something, can you plz explain this behavior to me?

 best regards.


My guess is that the page is recorded to note that the page shouldn't
be fetched, I'm guessing the status is one of the magic values.  It
probably re-fetches the page periodically to ensure it has the list.
So the URL and the date make sense to me as to why they populate them.
 I don't know why it is computing the boost, other then the fact that
it might be part of the OPIC scoring algorithm.  If the scoring
algorithm ever uses the scores/boost of the pages that you point at as
a contributing factor, it would make total sense.  So even though it
doesn't index http://example/foo/bar;, knowing which pages point
there, and what their scores are could contribute scores of pages that
you do index, that contain an outlink to that page.

Kirby


RE: how to force nutch to do a recrawl

2009-12-10 Thread BELLINI ADAM

it will not dump to the console !
whole_db is a folder and you have to edit the file you will find in this folder



 Subject: RE: how to force nutch to do a recrawl
 Date: Thu, 10 Dec 2009 14:26:30 -0500
 From: vijaya_pet...@sra.com
 To: nutch-user@lucene.apache.org
 
 Adam,
 I tried running that command and get the following (it created a
 whole_db directory, but it's not dumping out the contents to the
 console):
 
 $ bin/nutch readdb crawl/crawldb/ -dump whole_db
 CrawlDb dump: starting
 CrawlDb db: crawl/crawldb/
 CrawlDb dump: done
 
 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184
 
 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the individual
 or entity named above.  If you are not the intended recipient, be aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.
 -Original Message-
 From: BELLINI ADAM [mailto:mbel...@msn.com] 
 Sent: Thursday, December 10, 2009 1:40 PM
 To: nutch-user@lucene.apache.org
 Subject: RE: how to force nutch to do a recrawl
 
 
 hi,
 check the fetch time in your crawldb...you can dump all the crawldb like
 this:
 
 ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
 
 entries will look like this:
 
 http://www.YOUR_URL_TO_FETCH
 Status: 2 (db_fetched)
 Fetch time: Thu Dec 10 09:19:18 EST 2009
 Modified time: Wed Dec 31 19:00:00 EST 1969
 Retries since fetch: 0
 Retry interval: 18000 seconds (0 days)
 Score: 0.0014977538
 Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
 Metadata: _pst_: success(1), lastModified=0
 
 
 as you see the next time the page will be fetched is in fetch time  :
 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
 and check the rety interval : it should be your 3600. 
 
 hope it will help
 
 
  Subject: RE: how to force nutch to do a recrawl
  Date: Wed, 9 Dec 2009 16:06:58 -0500
  From: vijaya_pet...@sra.com
  To: nutch-user@lucene.apache.org
  
  Okay.  I'll dig a little deeper.  I saw a few scripts that people had
  created, but I couldn't get them to work.
  
  Thanks much.
  
  Vijaya Peters
  SRA International, Inc.
  4350 Fair Lakes Court North
  Room 4004
  Fairfax, VA  22033
  Tel:  703-502-1184
  
  www.sra.com
  Named to FORTUNE's 100 Best Companies to Work For list for 10
  consecutive years
  P Please consider the environment before printing this e-mail
  This electronic message transmission contains information from SRA
  International, Inc. which may be confidential, privileged or
  proprietary.  The information is intended for the use of the
 individual
  or entity named above.  If you are not the intended recipient, be
 aware
  that any disclosure, copying, distribution, or use of the contents of
  this information is strictly prohibited.  If you have received this
  electronic information in error, please notify us immediately by
  telephone at 866-584-2143.
  
  -Original Message-
  From: MilleBii [mailto:mille...@gmail.com] 
  Sent: Wednesday, December 09, 2009 4:05 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: how to force nutch to do a recrawl
  
  I don't that you can use nutch crawl command to do that, this is a one
  stop
  shop command.
  You probably want to use individual commands.
  Type nutch generate to get the help and you will see the option
  -adddays,
  read that page on the wiki to get a feel how you should do:
  http://wiki.apache.org/nutch/Crawl
  
  2009/12/9 Peters, Vijaya vijaya_pet...@sra.com
  
   I didn't see a setting to override in crawl-urlfilter.  How do I set
   numberDays? I have regular expressions to include/exclude certain
  extensions
   and certain urls, but that's all I have in there.
  
   Please send me an example and I'll give it a try.
  
   Thanks!
  
   Vijaya Peters
   SRA International, Inc.
   4350 Fair Lakes Court North
   Room 4004
   Fairfax, VA  22033
   Tel:  703-502-1184
  
   www.sra.com
   Named to FORTUNE's 100 Best Companies to Work For list for 10
  consecutive
   years
   P Please consider the environment before printing this e-mail
   This electronic message transmission contains information from SRA
   International, Inc. which may be confidential, privileged or
  proprietary.
The information is intended for the use of the individual or entity
  named
   above.  If you are not the intended recipient, be aware that any
  disclosure,
   copying, distribution, or use of the contents of this information is
   strictly prohibited.  If you have received this electronic
 information
  in
   error, please 

RE: NOINDEX, NOFOLLOW

2009-12-10 Thread BELLINI ADAM

hi,

thx for these informations, but since i'm using solr index, and when i make a 
search i get a blank result...
for example if i will have 10 documents as  a search result, 9 will be ok 
(because i display the title and 4 first lines of content), but i obtain one 
blank result becoz of this page (with no content and no title) ! i dont 
understans why it is in the index since it was setted as  noindex !?

here an example:

searchin  for word1:

results: 

1- title 1 : content1
2- title 1 : content2
3- title 1 : content3
4- title 1 : content4
5- title 1 : content5
6- title 1 : content6
7- title 1 : content7
8- title 1 : content8
9-BLANK..
10- title 1 : content10





 From: kirby.bohl...@gmail.com
 Date: Thu, 10 Dec 2009 13:33:18 -0600
 Subject: Re: NOINDEX, NOFOLLOW
 To: nutch-user@lucene.apache.org
 
 On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM mbel...@msn.com wrote:
 
  hi,
 
  i have a page with meta name=robots content=noindex,nofollow /, now i 
  know that nutch obey to this tag because i dont find the content and the 
  title in my index, but i was wondering that this document will not be 
  present in the index. why he keep the document in my index with no title 
  and no content ??
 
  i'm using index-basic and index-more plugins, and i want to understand why 
  nutch still filling the url, date, boostetc since he didnt it for title 
  and content.
 
  i was thinking that if nutch will obey to nofollow and noindex so it will 
  skip all the document !
 
  or mabe i missunderstood something, can you plz explain this behavior to me?
 
  best regards.
 
 
 My guess is that the page is recorded to note that the page shouldn't
 be fetched, I'm guessing the status is one of the magic values.  It
 probably re-fetches the page periodically to ensure it has the list.
 So the URL and the date make sense to me as to why they populate them.
  I don't know why it is computing the boost, other then the fact that
 it might be part of the OPIC scoring algorithm.  If the scoring
 algorithm ever uses the scores/boost of the pages that you point at as
 a contributing factor, it would make total sense.  So even though it
 doesn't index http://example/foo/bar;, knowing which pages point
 there, and what their scores are could contribute scores of pages that
 you do index, that contain an outlink to that page.
 
 Kirby
  
_
Windows Live: Keep your friends up to date with what you do online.
http://go.microsoft.com/?linkid=9691815

Re: NOINDEX, NOFOLLOW

2009-12-10 Thread Andrzej Bialecki

On 2009-12-10 20:33, Kirby Bohling wrote:

On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAMmbel...@msn.com  wrote:


hi,

i have a page withmeta name=robots content=noindex,nofollow /, now i know 
that nutch obey to this tag because i dont find the content and the title in my index, but i was 
wondering that this document will not be present in the index. why he keep the document in my index with 
no title and no content ??

i'm using index-basic and index-more plugins, and i want to understand why 
nutch still filling the url, date, boostetc since he didnt it for title and 
content.

i was thinking that if nutch will obey to nofollow and noindex so it will skip 
all the document !

or mabe i missunderstood something, can you plz explain this behavior to me?

best regards.



My guess is that the page is recorded to note that the page shouldn't
be fetched, I'm guessing the status is one of the magic values.  It
probably re-fetches the page periodically to ensure it has the list.
So the URL and the date make sense to me as to why they populate them.
  I don't know why it is computing the boost, other then the fact that
it might be part of the OPIC scoring algorithm.  If the scoring
algorithm ever uses the scores/boost of the pages that you point at as
a contributing factor, it would make total sense.  So even though it
doesn't index http://example/foo/bar;, knowing which pages point
there, and what their scores are could contribute scores of pages that
you do index, that contain an outlink to that page.


Very good explanation, that's exactly the reasons why Nutch never 
discards such pages. If you really want to ignore certain pages, then 
use URLFilters and/or ScoringFilters.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: how to force nutch to do a recrawl

2009-12-10 Thread Peters, Vijaya
Adam,
What do I use to open a CRC file? I tried QuickSFV.  Thanks in advance!

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's 100 Best Companies to Work For list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-Original Message-
From: BELLINI ADAM [mailto:mbel...@msn.com] 
Sent: Thursday, December 10, 2009 3:48 PM
To: nutch-user@lucene.apache.org
Subject: RE: how to force nutch to do a recrawl


it will not dump to the console !
whole_db is a folder and you have to edit the file you will find in this
folder



 Subject: RE: how to force nutch to do a recrawl
 Date: Thu, 10 Dec 2009 14:26:30 -0500
 From: vijaya_pet...@sra.com
 To: nutch-user@lucene.apache.org
 
 Adam,
 I tried running that command and get the following (it created a
 whole_db directory, but it's not dumping out the contents to the
 console):
 
 $ bin/nutch readdb crawl/crawldb/ -dump whole_db
 CrawlDb dump: starting
 CrawlDb db: crawl/crawldb/
 CrawlDb dump: done
 
 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184
 
 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the
individual
 or entity named above.  If you are not the intended recipient, be
aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.
 -Original Message-
 From: BELLINI ADAM [mailto:mbel...@msn.com] 
 Sent: Thursday, December 10, 2009 1:40 PM
 To: nutch-user@lucene.apache.org
 Subject: RE: how to force nutch to do a recrawl
 
 
 hi,
 check the fetch time in your crawldb...you can dump all the crawldb
like
 this:
 
 ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
 
 entries will look like this:
 
 http://www.YOUR_URL_TO_FETCH
 Status: 2 (db_fetched)
 Fetch time: Thu Dec 10 09:19:18 EST 2009
 Modified time: Wed Dec 31 19:00:00 EST 1969
 Retries since fetch: 0
 Retry interval: 18000 seconds (0 days)
 Score: 0.0014977538
 Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
 Metadata: _pst_: success(1), lastModified=0
 
 
 as you see the next time the page will be fetched is in fetch time  :
 'Fetch time: Thu Dec 10 09:19:18 EST 2009'
 and check the rety interval : it should be your 3600. 
 
 hope it will help
 
 
  Subject: RE: how to force nutch to do a recrawl
  Date: Wed, 9 Dec 2009 16:06:58 -0500
  From: vijaya_pet...@sra.com
  To: nutch-user@lucene.apache.org
  
  Okay.  I'll dig a little deeper.  I saw a few scripts that people
had
  created, but I couldn't get them to work.
  
  Thanks much.
  
  Vijaya Peters
  SRA International, Inc.
  4350 Fair Lakes Court North
  Room 4004
  Fairfax, VA  22033
  Tel:  703-502-1184
  
  www.sra.com
  Named to FORTUNE's 100 Best Companies to Work For list for 10
  consecutive years
  P Please consider the environment before printing this e-mail
  This electronic message transmission contains information from SRA
  International, Inc. which may be confidential, privileged or
  proprietary.  The information is intended for the use of the
 individual
  or entity named above.  If you are not the intended recipient, be
 aware
  that any disclosure, copying, distribution, or use of the contents
of
  this information is strictly prohibited.  If you have received this
  electronic information in error, please notify us immediately by
  telephone at 866-584-2143.
  
  -Original Message-
  From: MilleBii [mailto:mille...@gmail.com] 
  Sent: Wednesday, December 09, 2009 4:05 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: how to force nutch to do a recrawl
  
  I don't that you can use nutch crawl command to do that, this is a
one
  stop
  shop command.
  You probably want to use individual commands.
  Type nutch generate to get the help and you will see the option
  -adddays,
  read that page on the wiki to get a feel how you should do:
  http://wiki.apache.org/nutch/Crawl
  
  2009/12/9 Peters, Vijaya vijaya_pet...@sra.com
  
   I didn't see a 

RE: how to force nutch to do a recrawl

2009-12-10 Thread BELLINI ADAM

jus use vi or vim


i use vi to edit the file





 Subject: RE: how to force nutch to do a recrawl
 Date: Thu, 10 Dec 2009 15:58:24 -0500
 From: vijaya_pet...@sra.com
 To: nutch-user@lucene.apache.org
 
 Adam,
 What do I use to open a CRC file? I tried QuickSFV.  Thanks in advance!
 
 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184
 
 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the individual
 or entity named above.  If you are not the intended recipient, be aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.
 
 -Original Message-
 From: BELLINI ADAM [mailto:mbel...@msn.com] 
 Sent: Thursday, December 10, 2009 3:48 PM
 To: nutch-user@lucene.apache.org
 Subject: RE: how to force nutch to do a recrawl
 
 
 it will not dump to the console !
 whole_db is a folder and you have to edit the file you will find in this
 folder
 
 
 
  Subject: RE: how to force nutch to do a recrawl
  Date: Thu, 10 Dec 2009 14:26:30 -0500
  From: vijaya_pet...@sra.com
  To: nutch-user@lucene.apache.org
  
  Adam,
  I tried running that command and get the following (it created a
  whole_db directory, but it's not dumping out the contents to the
  console):
  
  $ bin/nutch readdb crawl/crawldb/ -dump whole_db
  CrawlDb dump: starting
  CrawlDb db: crawl/crawldb/
  CrawlDb dump: done
  
  Vijaya Peters
  SRA International, Inc.
  4350 Fair Lakes Court North
  Room 4004
  Fairfax, VA  22033
  Tel:  703-502-1184
  
  www.sra.com
  Named to FORTUNE's 100 Best Companies to Work For list for 10
  consecutive years
  P Please consider the environment before printing this e-mail
  This electronic message transmission contains information from SRA
  International, Inc. which may be confidential, privileged or
  proprietary.  The information is intended for the use of the
 individual
  or entity named above.  If you are not the intended recipient, be
 aware
  that any disclosure, copying, distribution, or use of the contents of
  this information is strictly prohibited.  If you have received this
  electronic information in error, please notify us immediately by
  telephone at 866-584-2143.
  -Original Message-
  From: BELLINI ADAM [mailto:mbel...@msn.com] 
  Sent: Thursday, December 10, 2009 1:40 PM
  To: nutch-user@lucene.apache.org
  Subject: RE: how to force nutch to do a recrawl
  
  
  hi,
  check the fetch time in your crawldb...you can dump all the crawldb
 like
  this:
  
  ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
  
  entries will look like this:
  
  http://www.YOUR_URL_TO_FETCH
  Status: 2 (db_fetched)
  Fetch time: Thu Dec 10 09:19:18 EST 2009
  Modified time: Wed Dec 31 19:00:00 EST 1969
  Retries since fetch: 0
  Retry interval: 18000 seconds (0 days)
  Score: 0.0014977538
  Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
  Metadata: _pst_: success(1), lastModified=0
  
  
  as you see the next time the page will be fetched is in fetch time  :
  'Fetch time: Thu Dec 10 09:19:18 EST 2009'
  and check the rety interval : it should be your 3600. 
  
  hope it will help
  
  
   Subject: RE: how to force nutch to do a recrawl
   Date: Wed, 9 Dec 2009 16:06:58 -0500
   From: vijaya_pet...@sra.com
   To: nutch-user@lucene.apache.org
   
   Okay.  I'll dig a little deeper.  I saw a few scripts that people
 had
   created, but I couldn't get them to work.
   
   Thanks much.
   
   Vijaya Peters
   SRA International, Inc.
   4350 Fair Lakes Court North
   Room 4004
   Fairfax, VA  22033
   Tel:  703-502-1184
   
   www.sra.com
   Named to FORTUNE's 100 Best Companies to Work For list for 10
   consecutive years
   P Please consider the environment before printing this e-mail
   This electronic message transmission contains information from SRA
   International, Inc. which may be confidential, privileged or
   proprietary.  The information is intended for the use of the
  individual
   or entity named above.  If you are not the intended recipient, be
  aware
   that any disclosure, copying, distribution, or use of the contents
 of
   this information is strictly prohibited.  If you have received this
   electronic information in error, please notify us immediately by
   telephone at 866-584-2143.
   
   -Original Message-
   From: MilleBii [mailto:mille...@gmail.com] 
   Sent: Wednesday, December 09, 2009 4:05 PM
   To: nutch-user@lucene.apache.org
   Subject: Re: how to force nutch to do a recrawl
   
   I don't that you can use nutch crawl 

Re: domain vs www.domain?

2009-12-10 Thread Andrzej Bialecki

On 2009-12-10 19:59, Jesse Hires wrote:

I'm seeing a lot of duplicates where a single site is getting recognized as
two different sites. Specifically I am seeing www.domain.com and
domain.combeing recognized as two different sites.
I imagine there is a setting to prevent this. If so, what is the setting, if
not, what would you recomend doing to prevent this?


This is a surprisingly difficult problem to solve in general case, 
because it's not always true that 'www.domain' equals 'domain'. If you 
do know this is true in your particular case, you can add a rule to 
regex-urlnormalizer that changes the matching urls to e.g. always lose 
the 'www.' part.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: NOINDEX, NOFOLLOW

2009-12-10 Thread Kirby Bohling
On Thu, Dec 10, 2009 at 2:55 PM, BELLINI ADAM mbel...@msn.com wrote:

 hi,

 thx for these informations, but since i'm using solr index, and when i make a 
 search i get a blank result...
 for example if i will have 10 documents as  a search result, 9 will be ok 
 (because i display the title and 4 first lines of content), but i obtain one 
 blank result becoz of this page (with no content and no title) ! i dont 
 understans why it is in the index since it was setted as  noindex !?


I've never used the Solr integration, so I'm unable to help you.  This
sounds like a bug to me, but I'm not sure.  Hopefully one of the Solr
users will help us out and let you know what they think.

Thanks,
   Kirby
...Snip...


RE: how to force nutch to do a recrawl

2009-12-10 Thread Peters, Vijaya
Adam,
I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
recognize vi.  Any idea for opening it in windows?  Notepad didn't work
either.

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's 100 Best Companies to Work For list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-Original Message-
From: BELLINI ADAM [mailto:mbel...@msn.com] 
Sent: Thursday, December 10, 2009 4:01 PM
To: nutch-user@lucene.apache.org
Subject: RE: how to force nutch to do a recrawl


jus use vi or vim


i use vi to edit the file





 Subject: RE: how to force nutch to do a recrawl
 Date: Thu, 10 Dec 2009 15:58:24 -0500
 From: vijaya_pet...@sra.com
 To: nutch-user@lucene.apache.org
 
 Adam,
 What do I use to open a CRC file? I tried QuickSFV.  Thanks in
advance!
 
 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184
 
 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the
individual
 or entity named above.  If you are not the intended recipient, be
aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.
 
 -Original Message-
 From: BELLINI ADAM [mailto:mbel...@msn.com] 
 Sent: Thursday, December 10, 2009 3:48 PM
 To: nutch-user@lucene.apache.org
 Subject: RE: how to force nutch to do a recrawl
 
 
 it will not dump to the console !
 whole_db is a folder and you have to edit the file you will find in
this
 folder
 
 
 
  Subject: RE: how to force nutch to do a recrawl
  Date: Thu, 10 Dec 2009 14:26:30 -0500
  From: vijaya_pet...@sra.com
  To: nutch-user@lucene.apache.org
  
  Adam,
  I tried running that command and get the following (it created a
  whole_db directory, but it's not dumping out the contents to the
  console):
  
  $ bin/nutch readdb crawl/crawldb/ -dump whole_db
  CrawlDb dump: starting
  CrawlDb db: crawl/crawldb/
  CrawlDb dump: done
  
  Vijaya Peters
  SRA International, Inc.
  4350 Fair Lakes Court North
  Room 4004
  Fairfax, VA  22033
  Tel:  703-502-1184
  
  www.sra.com
  Named to FORTUNE's 100 Best Companies to Work For list for 10
  consecutive years
  P Please consider the environment before printing this e-mail
  This electronic message transmission contains information from SRA
  International, Inc. which may be confidential, privileged or
  proprietary.  The information is intended for the use of the
 individual
  or entity named above.  If you are not the intended recipient, be
 aware
  that any disclosure, copying, distribution, or use of the contents
of
  this information is strictly prohibited.  If you have received this
  electronic information in error, please notify us immediately by
  telephone at 866-584-2143.
  -Original Message-
  From: BELLINI ADAM [mailto:mbel...@msn.com] 
  Sent: Thursday, December 10, 2009 1:40 PM
  To: nutch-user@lucene.apache.org
  Subject: RE: how to force nutch to do a recrawl
  
  
  hi,
  check the fetch time in your crawldb...you can dump all the crawldb
 like
  this:
  
  ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
  
  entries will look like this:
  
  http://www.YOUR_URL_TO_FETCH
  Status: 2 (db_fetched)
  Fetch time: Thu Dec 10 09:19:18 EST 2009
  Modified time: Wed Dec 31 19:00:00 EST 1969
  Retries since fetch: 0
  Retry interval: 18000 seconds (0 days)
  Score: 0.0014977538
  Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
  Metadata: _pst_: success(1), lastModified=0
  
  
  as you see the next time the page will be fetched is in fetch time
:
  'Fetch time: Thu Dec 10 09:19:18 EST 2009'
  and check the rety interval : it should be your 3600. 
  
  hope it will help
  
  
   Subject: RE: how to force nutch to do a recrawl
   Date: Wed, 9 Dec 2009 16:06:58 -0500
   From: vijaya_pet...@sra.com
   To: nutch-user@lucene.apache.org
   
   Okay.  I'll dig a little deeper.  I saw a few scripts that people
 had
   created, but I couldn't get them to 

RE: how to force nutch to do a recrawl

2009-12-10 Thread BELLINI ADAM


bu8t how you are running sh scripts...
you have to use cygwin to be able to edit linux files




 Subject: RE: how to force nutch to do a recrawl
 Date: Thu, 10 Dec 2009 16:09:13 -0500
 From: vijaya_pet...@sra.com
 To: nutch-user@lucene.apache.org
 
 Adam,
 I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
 recognize vi.  Any idea for opening it in windows?  Notepad didn't work
 either.
 
 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184
 
 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the individual
 or entity named above.  If you are not the intended recipient, be aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.
 
 -Original Message-
 From: BELLINI ADAM [mailto:mbel...@msn.com] 
 Sent: Thursday, December 10, 2009 4:01 PM
 To: nutch-user@lucene.apache.org
 Subject: RE: how to force nutch to do a recrawl
 
 
 jus use vi or vim
 
 
 i use vi to edit the file
 
 
 
 
 
  Subject: RE: how to force nutch to do a recrawl
  Date: Thu, 10 Dec 2009 15:58:24 -0500
  From: vijaya_pet...@sra.com
  To: nutch-user@lucene.apache.org
  
  Adam,
  What do I use to open a CRC file? I tried QuickSFV.  Thanks in
 advance!
  
  Vijaya Peters
  SRA International, Inc.
  4350 Fair Lakes Court North
  Room 4004
  Fairfax, VA  22033
  Tel:  703-502-1184
  
  www.sra.com
  Named to FORTUNE's 100 Best Companies to Work For list for 10
  consecutive years
  P Please consider the environment before printing this e-mail
  This electronic message transmission contains information from SRA
  International, Inc. which may be confidential, privileged or
  proprietary.  The information is intended for the use of the
 individual
  or entity named above.  If you are not the intended recipient, be
 aware
  that any disclosure, copying, distribution, or use of the contents of
  this information is strictly prohibited.  If you have received this
  electronic information in error, please notify us immediately by
  telephone at 866-584-2143.
  
  -Original Message-
  From: BELLINI ADAM [mailto:mbel...@msn.com] 
  Sent: Thursday, December 10, 2009 3:48 PM
  To: nutch-user@lucene.apache.org
  Subject: RE: how to force nutch to do a recrawl
  
  
  it will not dump to the console !
  whole_db is a folder and you have to edit the file you will find in
 this
  folder
  
  
  
   Subject: RE: how to force nutch to do a recrawl
   Date: Thu, 10 Dec 2009 14:26:30 -0500
   From: vijaya_pet...@sra.com
   To: nutch-user@lucene.apache.org
   
   Adam,
   I tried running that command and get the following (it created a
   whole_db directory, but it's not dumping out the contents to the
   console):
   
   $ bin/nutch readdb crawl/crawldb/ -dump whole_db
   CrawlDb dump: starting
   CrawlDb db: crawl/crawldb/
   CrawlDb dump: done
   
   Vijaya Peters
   SRA International, Inc.
   4350 Fair Lakes Court North
   Room 4004
   Fairfax, VA  22033
   Tel:  703-502-1184
   
   www.sra.com
   Named to FORTUNE's 100 Best Companies to Work For list for 10
   consecutive years
   P Please consider the environment before printing this e-mail
   This electronic message transmission contains information from SRA
   International, Inc. which may be confidential, privileged or
   proprietary.  The information is intended for the use of the
  individual
   or entity named above.  If you are not the intended recipient, be
  aware
   that any disclosure, copying, distribution, or use of the contents
 of
   this information is strictly prohibited.  If you have received this
   electronic information in error, please notify us immediately by
   telephone at 866-584-2143.
   -Original Message-
   From: BELLINI ADAM [mailto:mbel...@msn.com] 
   Sent: Thursday, December 10, 2009 1:40 PM
   To: nutch-user@lucene.apache.org
   Subject: RE: how to force nutch to do a recrawl
   
   
   hi,
   check the fetch time in your crawldb...you can dump all the crawldb
  like
   this:
   
   ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
   
   entries will look like this:
   
   http://www.YOUR_URL_TO_FETCH
   Status: 2 (db_fetched)
   Fetch time: Thu Dec 10 09:19:18 EST 2009
   Modified time: Wed Dec 31 19:00:00 EST 1969
   Retries since fetch: 0
   Retry interval: 18000 seconds (0 days)
   Score: 0.0014977538
   Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c
   Metadata: _pst_: success(1), lastModified=0
   
   
   as you see the next time the page will be fetched is in fetch time
 :
   

Re: domain vs www.domain?

2009-12-10 Thread Jesse Hires
For the specific case I was running into (on a single known domain) using
regex-urlnormalizer did the trick. Thanks!



Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
// Guaranteed to be random
} // xkcd.com



On Thu, Dec 10, 2009 at 1:01 PM, Andrzej Bialecki a...@getopt.org wrote:

 On 2009-12-10 19:59, Jesse Hires wrote:

 I'm seeing a lot of duplicates where a single site is getting recognized
 as
 two different sites. Specifically I am seeing www.domain.com and
 domain.combeing recognized as two different sites.

 I imagine there is a setting to prevent this. If so, what is the setting,
 if
 not, what would you recomend doing to prevent this?


 This is a surprisingly difficult problem to solve in general case, because
 it's not always true that 'www.domain' equals 'domain'. If you do know this
 is true in your particular case, you can add a rule to regex-urlnormalizer
 that changes the matching urls to e.g. always lose the 'www.' part.



 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com