Re: nutch's design document

2009-12-14 Thread MilleBii
Welcome !!!

Nutch is different from anything else I have seen before, but its
great and also difficult. So expect to spend some time.

Best way to learn is practice to understand what it does.

1. Front-End (search) : is a web site which wraps a Lucene based
index. If you are not familiar with Lucene you can buy yourself the
book Lucene in action, but it is not really necessary. You can also
use Solr as a more sophisticated front end.

2. Back-End (crawling to indexing)

crawling is done in a number of steps (read the wiki) and uses two
critical database crawldb and linkdb to maintain a graph of where the
engine has gone.
It will fetch, parse, index pages...

3. Cluster / Cloud computing
Based on hadoop it uses map/reduce parallel processing technique for
the different steps.
There is an Hadoop book you can buy.

Good luck and see you on the mailing list.

2009/12/11, mengel men...@163.com:
 Hello,Dear:
I am a freshman for Nutch. I want to learn nutch, but I can't find a
 document for design such as architecture. Can you give me some advice for
 how to learn Nutch.Thank you very much.

  Mengel




-- 
-MilleBii-


Optimization in crawling and indexing

2009-12-14 Thread Rupesh Mankar
I want to see if there is any possible bandwidth optimization while using Nutch.


a)Crawling: After initial crawl, ONLY fetch updated document? Re-crawl 
command after every 6 hours will crawl and fetch all documents. 
['db.fetch.interval.default' is 6 hours]. It should just bring updated 
documents only.



Does Nutch internally use HEAD request to check whether that document (html, 
PDFs and Docs) has changed or not?



b)Indexing: Can I find out based on a timestamp, how many documents have 
changed after last re-crawl?


Thanks,
Rupesh

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Adilson Oliveira Cruz
 Hi all,

 Anyone successfully used nutch to index Office 2007 documents? I know that
this question has already been asked, but considering the number of e-mails
asking the same question, looks like that Nutch does not support Office 2007
documents.

 Best,

 Adilson

On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com wrote:

 Hi,



 I'm also curious as to whether anyone has had success with Nutch and
 parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
 errors as seen here -
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
 cuments-in-Nutch-1.0-td26640949.html#a26640949http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949



 Is a separate plugin required to parse these documents (i.e.,
 parse-msexcel, parse-mspowerpoint, etc. will *not* work?)



 I noticed the comment on the above thread - docx should be parsed,A
 plugin can be used to Parsed docx file. you get some
 help info from parse-html plugin and so on. - but didn't find it really
 helpful.



 Regards,

 Joe




 This message is confidential to Prodea Systems, Inc unless otherwise
 indicated
 or apparent from its nature. This message is directed to the intended
 recipient
 only, who may be readily determined by the sender of this message and its
 contents. If the reader of this message is not the intended recipient, or
 an
 employee or agent responsible for delivering this message to the intended
 recipient:(a)any dissemination or copying of this message is strictly
 prohibited; and(b)immediately notify the sender by return message and
 destroy
 any copies of this message in any form(electronic, paper or otherwise) that
 you
 have.The delivery of this message and its information is neither intended
 to be
 nor constitutes a disclosure or waiver of any trade secrets, intellectual
 property, attorney work product, or attorney-client communications. The
 authority of the individual sending this message to legally bind Prodea
 Systems
 is neither apparent nor implied,and must be independently verified.


Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Julien Nioche
Hi,

There is a Tika plugin in JIRA (
https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page
the support for the Office 2007 was imminent in POI (which Tika uses
internally). The plan for Nutch is to progressively delegate the parsing to
Tika; Nutch-766 has been implemented for this. I haven't checked whether
Tika currently supports Office 2007 but I suggest that you try parsing docs
at this format with Tika, if it does work then you'll get that automatically
via Nutch-766

Makes sense?

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com

  Hi all,

  Anyone successfully used nutch to index Office 2007 documents? I know that
 this question has already been asked, but considering the number of e-mails
 asking the same question, looks like that Nutch does not support Office
 2007
 documents.

  Best,

  Adilson

 On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com
 wrote:

  Hi,
 
 
 
  I'm also curious as to whether anyone has had success with Nutch and
  parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
  errors as seen here -
  http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
  cuments-in-Nutch-1.0-td26640949.html#a26640949
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
 
 
 
 
  Is a separate plugin required to parse these documents (i.e.,
  parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
 
 
 
  I noticed the comment on the above thread - docx should be parsed,A
  plugin can be used to Parsed docx file. you get some
  help info from parse-html plugin and so on. - but didn't find it really
  helpful.
 
 
 
  Regards,
 
  Joe
 
 
 
 
  This message is confidential to Prodea Systems, Inc unless otherwise
  indicated
  or apparent from its nature. This message is directed to the intended
  recipient
  only, who may be readily determined by the sender of this message and its
  contents. If the reader of this message is not the intended recipient, or
  an
  employee or agent responsible for delivering this message to the intended
  recipient:(a)any dissemination or copying of this message is strictly
  prohibited; and(b)immediately notify the sender by return message and
  destroy
  any copies of this message in any form(electronic, paper or otherwise)
 that
  you
  have.The delivery of this message and its information is neither intended
  to be
  nor constitutes a disclosure or waiver of any trade secrets, intellectual
  property, attorney work product, or attorney-client communications. The
  authority of the individual sending this message to legally bind Prodea
  Systems
  is neither apparent nor implied,and must be independently verified.



Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Adilson Oliveira Cruz
 Hi,

 Thanks for the reply. I will try to use Tika with Nutch to parse the
documents. My current Nutch setup is working quite nice and I don't want to
configure another Nutch instance.

 If I manage to put it to work I will write here a mini how-to.

 Best,

 Adilson

On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi,

 There is a Tika plugin in JIRA (
 https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page
 the support for the Office 2007 was imminent in POI (which Tika uses
 internally). The plan for Nutch is to progressively delegate the parsing to
 Tika; Nutch-766 has been implemented for this. I haven't checked whether
 Tika currently supports Office 2007 but I suggest that you try parsing docs
 at this format with Tika, if it does work then you'll get that
 automatically
 via Nutch-766

 Makes sense?

 Julien

 --
 DigitalPebble Ltd
 http://www.digitalpebble.com

 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com

   Hi all,
 
   Anyone successfully used nutch to index Office 2007 documents? I know
 that
  this question has already been asked, but considering the number of
 e-mails
  asking the same question, looks like that Nutch does not support Office
  2007
  documents.
 
   Best,
 
   Adilson
 
  On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com
  wrote:
 
   Hi,
  
  
  
   I'm also curious as to whether anyone has had success with Nutch and
   parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
   errors as seen here -
  
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
   cuments-in-Nutch-1.0-td26640949.html#a26640949
 
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
  
  
  
  
   Is a separate plugin required to parse these documents (i.e.,
   parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
  
  
  
   I noticed the comment on the above thread - docx should be parsed,A
   plugin can be used to Parsed docx file. you get some
   help info from parse-html plugin and so on. - but didn't find it really
   helpful.
  
  
  
   Regards,
  
   Joe
  
  
  
  
   This message is confidential to Prodea Systems, Inc unless otherwise
   indicated
   or apparent from its nature. This message is directed to the intended
   recipient
   only, who may be readily determined by the sender of this message and
 its
   contents. If the reader of this message is not the intended recipient,
 or
   an
   employee or agent responsible for delivering this message to the
 intended
   recipient:(a)any dissemination or copying of this message is strictly
   prohibited; and(b)immediately notify the sender by return message and
   destroy
   any copies of this message in any form(electronic, paper or otherwise)
  that
   you
   have.The delivery of this message and its information is neither
 intended
   to be
   nor constitutes a disclosure or waiver of any trade secrets,
 intellectual
   property, attorney work product, or attorney-client communications. The
   authority of the individual sending this message to legally bind Prodea
   Systems
   is neither apparent nor implied,and must be independently verified.
 



Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Julien Nioche

  If I manage to put it to work I will write here a mini how-to.


The Nutch Wiki would be the right place for doing that. It would be nice to
have a page there listing the differences between the capabilities of the
Tika plugin and the existing Nutch parsing plugins as there might be
differences between them (support for Office 2007 being potentially one of
them)

Note that the Tika plugin is VERY beta

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com

  Hi,

  Thanks for the reply. I will try to use Tika with Nutch to parse the
 documents. My current Nutch setup is working quite nice and I don't want to
 configure another Nutch instance.

  If I manage to put it to work I will write here a mini how-to.

  Best,

  Adilson

 On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

  Hi,
 
  There is a Tika plugin in JIRA (
  https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's
 page
  the support for the Office 2007 was imminent in POI (which Tika uses
  internally). The plan for Nutch is to progressively delegate the parsing
 to
  Tika; Nutch-766 has been implemented for this. I haven't checked whether
  Tika currently supports Office 2007 but I suggest that you try parsing
 docs
  at this format with Tika, if it does work then you'll get that
  automatically
  via Nutch-766
 
  Makes sense?
 
  Julien
 
  --
  DigitalPebble Ltd
  http://www.digitalpebble.com
 
  2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com
 
Hi all,
  
Anyone successfully used nutch to index Office 2007 documents? I know
  that
   this question has already been asked, but considering the number of
  e-mails
   asking the same question, looks like that Nutch does not support Office
   2007
   documents.
  
Best,
  
Adilson
  
   On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com
   wrote:
  
Hi,
   
   
   
I'm also curious as to whether anyone has had success with Nutch and
parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
errors as seen here -
   
  http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
cuments-in-Nutch-1.0-td26640949.html#a26640949
  
 
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
   
   
   
   
Is a separate plugin required to parse these documents (i.e.,
parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
   
   
   
I noticed the comment on the above thread - docx should be parsed,A
plugin can be used to Parsed docx file. you get some
help info from parse-html plugin and so on. - but didn't find it
 really
helpful.
   
   
   
Regards,
   
Joe
   
   
   
   
This message is confidential to Prodea Systems, Inc unless otherwise
indicated
or apparent from its nature. This message is directed to the intended
recipient
only, who may be readily determined by the sender of this message and
  its
contents. If the reader of this message is not the intended
 recipient,
  or
an
employee or agent responsible for delivering this message to the
  intended
recipient:(a)any dissemination or copying of this message is strictly
prohibited; and(b)immediately notify the sender by return message and
destroy
any copies of this message in any form(electronic, paper or
 otherwise)
   that
you
have.The delivery of this message and its information is neither
  intended
to be
nor constitutes a disclosure or waiver of any trade secrets,
  intellectual
property, attorney work product, or attorney-client communications.
 The
authority of the individual sending this message to legally bind
 Prodea
Systems
is neither apparent nor implied,and must be independently verified.
  
 



Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Julien Nioche
Have create a page http://wiki.apache.org/nutch/TikaPlugin; feel free to use
it for your how-to

J.

2009/12/14 Julien Nioche lists.digitalpeb...@gmail.com

  If I manage to put it to work I will write here a mini how-to.


 The Nutch Wiki would be the right place for doing that. It would be nice to
 have a page there listing the differences between the capabilities of the
 Tika plugin and the existing Nutch parsing plugins as there might be
 differences between them (support for Office 2007 being potentially one of
 them)

 Note that the Tika plugin is VERY beta

 Julien
 --
 DigitalPebble Ltd
 http://www.digitalpebble.com

 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com

  Hi,

  Thanks for the reply. I will try to use Tika with Nutch to parse the
 documents. My current Nutch setup is working quite nice and I don't want
 to
 configure another Nutch instance.

  If I manage to put it to work I will write here a mini how-to.

  Best,

  Adilson

 On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

  Hi,
 
  There is a Tika plugin in JIRA (
  https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's
 page
  the support for the Office 2007 was imminent in POI (which Tika uses
  internally). The plan for Nutch is to progressively delegate the parsing
 to
  Tika; Nutch-766 has been implemented for this. I haven't checked whether
  Tika currently supports Office 2007 but I suggest that you try parsing
 docs
  at this format with Tika, if it does work then you'll get that
  automatically
  via Nutch-766
 
  Makes sense?
 
  Julien
 
  --
  DigitalPebble Ltd
  http://www.digitalpebble.com
 
  2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com
 
Hi all,
  
Anyone successfully used nutch to index Office 2007 documents? I know
  that
   this question has already been asked, but considering the number of
  e-mails
   asking the same question, looks like that Nutch does not support
 Office
   2007
   documents.
  
Best,
  
Adilson
  
   On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com
   wrote:
  
Hi,
   
   
   
I'm also curious as to whether anyone has had success with Nutch and
parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
errors as seen here -
   
 
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
cuments-in-Nutch-1.0-td26640949.html#a26640949
  
 
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
   
   
   
   
Is a separate plugin required to parse these documents (i.e.,
parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
   
   
   
I noticed the comment on the above thread - docx should be parsed,A
plugin can be used to Parsed docx file. you get some
help info from parse-html plugin and so on. - but didn't find it
 really
helpful.
   
   
   
Regards,
   
Joe
   
   
   
   
This message is confidential to Prodea Systems, Inc unless otherwise
indicated
or apparent from its nature. This message is directed to the
 intended
recipient
only, who may be readily determined by the sender of this message
 and
  its
contents. If the reader of this message is not the intended
 recipient,
  or
an
employee or agent responsible for delivering this message to the
  intended
recipient:(a)any dissemination or copying of this message is
 strictly
prohibited; and(b)immediately notify the sender by return message
 and
destroy
any copies of this message in any form(electronic, paper or
 otherwise)
   that
you
have.The delivery of this message and its information is neither
  intended
to be
nor constitutes a disclosure or waiver of any trade secrets,
  intellectual
property, attorney work product, or attorney-client communications.
 The
authority of the individual sending this message to legally bind
 Prodea
Systems
is neither apparent nor implied,and must be independently verified.
  
 








-- 
DigitalPebble Ltd
http://www.digitalpebble.com


Re: Distributed Search problem

2009-12-14 Thread Dennis Kubes
Index and segments is the minimum yes.  You only need the segments for 
the indexes that you are serving on the local box.


Dennis

MilleBii wrote:

Ok I don't per say need distributed search.
I was trying to avoid a copy to local file system to optimize on
ressources working off HDFS

What is the minimum to copy over index and segments ? Not crawldb ?
All data in segments ?

2009/12/13, Dennis Kubes ku...@apache.org:

The assumption is wrong.  Distributed search is done from indexes on
local file systems not HDFS.

It doesn't return because lucene is trying to search across the indexes
in HDFS in real time which doesn't work because of network overhead.
Depending on the size of the indexes it may actually return after some
time but I have seen it timeout even for small indexes.

Short of it is, move the indexes and segments to a local file system,
then point the distributed search server at their parent directory.
Something like this:

bin/nutch server 8100 /full/path/to/parent/of/local/indexes

It technically doesn't have to be a full path.  Then point the
searcher.dir to a directory with search-servers.txt as you have done.
The search-servers.txt points like you have it.

Dennis

MilleBii wrote:

I'm trying to search directly from the index in hdfs so in distributed
mode

What do I have wrong ?

created  nutch/conf/search-servers.txt with
 localhost 8100

pointed  search.dir in nutch-site.xml to nutch/conf

tried to start search server with either :
 + nutch server 8100  crawl
 + nutch server 8100 hdfs://localhost:9000/user/nutch/crawl

The nutch server command doesn't return to prompt ???
Is this normal should I wait ?

And of course if I try a search it doesn't work






Re: OR support

2009-12-14 Thread BrunoWL

Nobody?
Please, any answer would good.
-- 
View this message in context: 
http://old.nabble.com/OR-support-tp26680899p26779229.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: OR support

2009-12-14 Thread Andrzej Bialecki

On 2009-12-14 16:05, BrunoWL wrote:


Nobody?
Please, any answer would good.


Please check this issue:

https://issues.apache.org/jira/browse/NUTCH-479

That's the current status, i.e. this functionality is available only as 
a patch.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: how to force nutch to do a recrawl

2009-12-14 Thread Peters, Vijaya
Adam,
I finally go the command to work on another server (see below).  to
change the retry interval, should I just add the two properties into
nutch-site.xml (though I tried this before and it didn't work):

http://mysite/  Version: 7
Status: 2 (db_fetched)
Fetch time: Fri Jan 08 15:42:33 EST 2010  
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)  
Score: 1.0
Signature: e04ab1ac06075fc273dbe1334a6c6dc5
Metadata: _pst_: success(1), lastModified=0


property
namedb.fetch.interval.default/name
value3600/value
descriptionThe default number of seconds between re-fetches of 
a page 30 days). 
/description
/property

property
namedb.fetch.interval.max/name
value3600/value
descriptionThe maximum number of seconds between re-fetches of 
a page(90 days). After this period every page in the db will be 
re-tried, no matter what is its status.  /description 
/property


Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's 100 Best Companies to Work For list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-Original Message-
From: BELLINI ADAM [mailto:mbel...@msn.com] 
Sent: Friday, December 11, 2009 3:11 PM
To: nutch-user@lucene.apache.org
Subject: RE: how to force nutch to do a recrawl


hi,

you shouldnt open the crc file you have to open the other one, which is
part-0.
use vi top edit part-.
if you will not find this file so your dump failed...just check the
logs/hadoop.log file






 Subject: RE: how to force nutch to do a recrawl
 Date: Fri, 11 Dec 2009 09:14:26 -0500
 From: vijaya_pet...@sra.com
 To: nutch-user@lucene.apache.org
 
 Adam,
 I'm using cygwin to run the scripts.  I use EditPlus to edit the
files.  But EditPlus won't allow me to edit the crc file.  I'll see if I
can ftp the file to a unix machine.
 
 
 Vijaya Peters
 SRA International, Inc.
 12500 Fair Lakes Circle
 Room 3507
 Fairfax, VA 22033
 Tel:  703-222-9207
 
 www.sra.com
 This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.
 
 
 
 -Original Message-
 From: BELLINI ADAM [mailto:mbel...@msn.com]
 Sent: Thu 12/10/2009 6:43 PM
 To: nutch-user@lucene.apache.org
 Subject: RE: how to force nutch to do a recrawl
  
 
 
 bu8t how you are running sh scripts...
 you have to use cygwin to be able to edit linux files
 
 
 
 
  Subject: RE: how to force nutch to do a recrawl
  Date: Thu, 10 Dec 2009 16:09:13 -0500
  From: vijaya_pet...@sra.com
  To: nutch-user@lucene.apache.org
  
  Adam,
  I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
  recognize vi.  Any idea for opening it in windows?  Notepad didn't
work
  either.
  
  Vijaya Peters
  SRA International, Inc.
  4350 Fair Lakes Court North
  Room 4004
  Fairfax, VA  22033
  Tel:  703-502-1184
  
  www.sra.com
  Named to FORTUNE's 100 Best Companies to Work For list for 10
  consecutive years
  P Please consider the environment before printing this e-mail
  This electronic message transmission contains information from SRA
  International, Inc. which may be confidential, privileged or
  proprietary.  The information is intended for the use of the
individual
  or entity named above.  If you are not the intended recipient, be
aware
  that any disclosure, copying, distribution, or use of the contents
of
  this information is strictly prohibited.  If you have received this
  electronic information in error, please notify us immediately by
  telephone at 866-584-2143.
  
  -Original Message-
  From: BELLINI ADAM [mailto:mbel...@msn.com] 
  Sent: Thursday, December 10, 2009 4:01 PM
  To: nutch-user@lucene.apache.org
  Subject: RE: how to force nutch to do a recrawl
  
  
  jus use vi or vim
  
  
  i use vi to edit the file
  
  
  
  
  
   Subject: RE: how to force nutch to do a recrawl
   Date: Thu, 10 Dec 2009 15:58:24 -0500
   From: vijaya_pet...@sra.com
   To: nutch-user@lucene.apache.org
   
   Adam,
   What do I use to open a 

RE: how to force nutch to do a recrawl

2009-12-14 Thread BELLINI ADAM

yes just add those config in the nutch-site.xml and it should work.   but are 
you going to recrawl every hour ??? i see 3600 secondes !!

another thing is  you have to make an initial clean crawl with the new 
fetchtime , because in the crawldb it will not change the fetch time 
automaticly . (in my case it didnt change, i just deleted the crawldb and made 
a clean crawl and it works)
mabe someone can tell you how to change the fecthtime in the crawldb without 
deleting it for an intial clean crawl.

thx


 Subject: RE: how to force nutch to do a recrawl
 Date: Mon, 14 Dec 2009 11:26:31 -0500
 From: vijaya_pet...@sra.com
 To: nutch-user@lucene.apache.org
 
 Adam,
 I finally go the command to work on another server (see below).  to
 change the retry interval, should I just add the two properties into
 nutch-site.xml (though I tried this before and it didn't work):
 
 http://mysite/Version: 7
 Status: 2 (db_fetched)
 Fetch time: Fri Jan 08 15:42:33 EST 2010  
 Modified time: Wed Dec 31 19:00:00 EST 1969
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)  
 Score: 1.0
 Signature: e04ab1ac06075fc273dbe1334a6c6dc5
 Metadata: _pst_: success(1), lastModified=0
 
 
 property
 namedb.fetch.interval.default/name
 value3600/value
 descriptionThe default number of seconds between re-fetches of 
 a page 30 days). 
 /description
 /property
 
 property
 namedb.fetch.interval.max/name
 value3600/value
 descriptionThe maximum number of seconds between re-fetches of 
 a page(90 days). After this period every page in the db will be 
 re-tried, no matter what is its status.  /description 
 /property
 
 
 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184
 
 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the individual
 or entity named above.  If you are not the intended recipient, be aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.
 
 -Original Message-
 From: BELLINI ADAM [mailto:mbel...@msn.com] 
 Sent: Friday, December 11, 2009 3:11 PM
 To: nutch-user@lucene.apache.org
 Subject: RE: how to force nutch to do a recrawl
 
 
 hi,
 
 you shouldnt open the crc file you have to open the other one, which is
 part-0.
 use vi top edit part-.
 if you will not find this file so your dump failed...just check the
 logs/hadoop.log file
 
 
 
 
 
 
  Subject: RE: how to force nutch to do a recrawl
  Date: Fri, 11 Dec 2009 09:14:26 -0500
  From: vijaya_pet...@sra.com
  To: nutch-user@lucene.apache.org
  
  Adam,
  I'm using cygwin to run the scripts.  I use EditPlus to edit the
 files.  But EditPlus won't allow me to edit the crc file.  I'll see if I
 can ftp the file to a unix machine.
  
  
  Vijaya Peters
  SRA International, Inc.
  12500 Fair Lakes Circle
  Room 3507
  Fairfax, VA 22033
  Tel:  703-222-9207
  
  www.sra.com
  This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the individual
 or entity named above.  If you are not the intended recipient, be aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.
  
  
  
  -Original Message-
  From: BELLINI ADAM [mailto:mbel...@msn.com]
  Sent: Thu 12/10/2009 6:43 PM
  To: nutch-user@lucene.apache.org
  Subject: RE: how to force nutch to do a recrawl
   
  
  
  bu8t how you are running sh scripts...
  you have to use cygwin to be able to edit linux files
  
  
  
  
   Subject: RE: how to force nutch to do a recrawl
   Date: Thu, 10 Dec 2009 16:09:13 -0500
   From: vijaya_pet...@sra.com
   To: nutch-user@lucene.apache.org
   
   Adam,
   I'm on windows unfortunately!!  I'm using cygdrive, but it doesn't
   recognize vi.  Any idea for opening it in windows?  Notepad didn't
 work
   either.
   
   Vijaya Peters
   SRA International, Inc.
   4350 Fair Lakes Court North
   Room 4004
   Fairfax, VA  22033
   Tel:  703-502-1184
   
   www.sra.com
   Named to FORTUNE's 100 Best Companies to Work For list for 10
   consecutive years
   P Please consider the environment before printing this e-mail
   This electronic message transmission contains information from SRA
   International, Inc. which may be confidential, privileged or
   proprietary.  The information is intended for the use 

RE: how to force nutch to do a recrawl

2009-12-14 Thread Peters, Vijaya
Thanks.
I'm on a development system, so every hour is okay.  
I guess that's why the last time I changed the properties file it didn't
take any effect (because crawldb won't change the fetch time
automatically).

I'll give this a try - thanks much.

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's 100 Best Companies to Work For list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-Original Message-
From: BELLINI ADAM [mailto:mbel...@msn.com] 
Sent: Monday, December 14, 2009 11:38 AM
To: nutch-user@lucene.apache.org
Subject: RE: how to force nutch to do a recrawl


yes just add those config in the nutch-site.xml and it should work.
but are you going to recrawl every hour ??? i see 3600 secondes !!

another thing is  you have to make an initial clean crawl with the new
fetchtime , because in the crawldb it will not change the fetch time
automaticly . (in my case it didnt change, i just deleted the crawldb
and made a clean crawl and it works)
mabe someone can tell you how to change the fecthtime in the crawldb
without deleting it for an intial clean crawl.

thx


 Subject: RE: how to force nutch to do a recrawl
 Date: Mon, 14 Dec 2009 11:26:31 -0500
 From: vijaya_pet...@sra.com
 To: nutch-user@lucene.apache.org
 
 Adam,
 I finally go the command to work on another server (see below).  to
 change the retry interval, should I just add the two properties into
 nutch-site.xml (though I tried this before and it didn't work):
 
 http://mysite/Version: 7
 Status: 2 (db_fetched)
 Fetch time: Fri Jan 08 15:42:33 EST 2010  
 Modified time: Wed Dec 31 19:00:00 EST 1969
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)  
 Score: 1.0
 Signature: e04ab1ac06075fc273dbe1334a6c6dc5
 Metadata: _pst_: success(1), lastModified=0
 
 
 property
 namedb.fetch.interval.default/name
 value3600/value
 descriptionThe default number of seconds between re-fetches of 
 a page 30 days). 
 /description
 /property
 
 property
 namedb.fetch.interval.max/name
 value3600/value
 descriptionThe maximum number of seconds between re-fetches of 
 a page(90 days). After this period every page in the db will be 
 re-tried, no matter what is its status.  /description 
 /property
 
 
 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184
 
 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the
individual
 or entity named above.  If you are not the intended recipient, be
aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.
 
 -Original Message-
 From: BELLINI ADAM [mailto:mbel...@msn.com] 
 Sent: Friday, December 11, 2009 3:11 PM
 To: nutch-user@lucene.apache.org
 Subject: RE: how to force nutch to do a recrawl
 
 
 hi,
 
 you shouldnt open the crc file you have to open the other one, which
is
 part-0.
 use vi top edit part-.
 if you will not find this file so your dump failed...just check the
 logs/hadoop.log file
 
 
 
 
 
 
  Subject: RE: how to force nutch to do a recrawl
  Date: Fri, 11 Dec 2009 09:14:26 -0500
  From: vijaya_pet...@sra.com
  To: nutch-user@lucene.apache.org
  
  Adam,
  I'm using cygwin to run the scripts.  I use EditPlus to edit the
 files.  But EditPlus won't allow me to edit the crc file.  I'll see if
I
 can ftp the file to a unix machine.
  
  
  Vijaya Peters
  SRA International, Inc.
  12500 Fair Lakes Circle
  Room 3507
  Fairfax, VA 22033
  Tel:  703-222-9207
  
  www.sra.com
  This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the
individual
 or entity named above.  If you are not the intended recipient, be
aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in 

RE: how to force nutch to do a recrawl

2009-12-14 Thread BELLINI ADAM

but just think about one thing...if you are recrawling to much urls and the 
crawl time will be more than 1 hours, so your crawl will not finish...becoz 
every time it find and url so it will find that the fetchtime is ready and it 
fetch it again
to well sett your fetchtime you have to crawl a first time and see how much 
time your crawl wil take to finish.
let us say it will take 3 hours...so you have to set the fetchtime to like 5 
hours, give it 2 hours in the case of some tiemout pages that nutch will 
retry


i hv met this probleme and my crawl took like 24 hours...becoz of the small 
fetchtime (fecthtime smaller than the crawl time)
thx



 Subject: RE: how to force nutch to do a recrawl
 Date: Mon, 14 Dec 2009 11:42:40 -0500
 From: vijaya_pet...@sra.com
 To: nutch-user@lucene.apache.org
 
 Thanks.
 I'm on a development system, so every hour is okay.  
 I guess that's why the last time I changed the properties file it didn't
 take any effect (because crawldb won't change the fetch time
 automatically).
 
 I'll give this a try - thanks much.
 
 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184
 
 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the individual
 or entity named above.  If you are not the intended recipient, be aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.
 
 -Original Message-
 From: BELLINI ADAM [mailto:mbel...@msn.com] 
 Sent: Monday, December 14, 2009 11:38 AM
 To: nutch-user@lucene.apache.org
 Subject: RE: how to force nutch to do a recrawl
 
 
 yes just add those config in the nutch-site.xml and it should work.
 but are you going to recrawl every hour ??? i see 3600 secondes !!
 
 another thing is  you have to make an initial clean crawl with the new
 fetchtime , because in the crawldb it will not change the fetch time
 automaticly . (in my case it didnt change, i just deleted the crawldb
 and made a clean crawl and it works)
 mabe someone can tell you how to change the fecthtime in the crawldb
 without deleting it for an intial clean crawl.
 
 thx
 
 
  Subject: RE: how to force nutch to do a recrawl
  Date: Mon, 14 Dec 2009 11:26:31 -0500
  From: vijaya_pet...@sra.com
  To: nutch-user@lucene.apache.org
  
  Adam,
  I finally go the command to work on another server (see below).  to
  change the retry interval, should I just add the two properties into
  nutch-site.xml (though I tried this before and it didn't work):
  
  http://mysite/  Version: 7
  Status: 2 (db_fetched)
  Fetch time: Fri Jan 08 15:42:33 EST 2010  
  Modified time: Wed Dec 31 19:00:00 EST 1969
  Retries since fetch: 0
  Retry interval: 2592000 seconds (30 days)  
  Score: 1.0
  Signature: e04ab1ac06075fc273dbe1334a6c6dc5
  Metadata: _pst_: success(1), lastModified=0
  
  
  property
  namedb.fetch.interval.default/name
  value3600/value
  descriptionThe default number of seconds between re-fetches of 
  a page 30 days). 
  /description
  /property
  
  property
  namedb.fetch.interval.max/name
  value3600/value
  descriptionThe maximum number of seconds between re-fetches of 
  a page(90 days). After this period every page in the db will be 
  re-tried, no matter what is its status.  /description 
  /property
  
  
  Vijaya Peters
  SRA International, Inc.
  4350 Fair Lakes Court North
  Room 4004
  Fairfax, VA  22033
  Tel:  703-502-1184
  
  www.sra.com
  Named to FORTUNE's 100 Best Companies to Work For list for 10
  consecutive years
  P Please consider the environment before printing this e-mail
  This electronic message transmission contains information from SRA
  International, Inc. which may be confidential, privileged or
  proprietary.  The information is intended for the use of the
 individual
  or entity named above.  If you are not the intended recipient, be
 aware
  that any disclosure, copying, distribution, or use of the contents of
  this information is strictly prohibited.  If you have received this
  electronic information in error, please notify us immediately by
  telephone at 866-584-2143.
  
  -Original Message-
  From: BELLINI ADAM [mailto:mbel...@msn.com] 
  Sent: Friday, December 11, 2009 3:11 PM
  To: nutch-user@lucene.apache.org
  Subject: RE: how to force nutch to do a recrawl
  
  
  hi,
  
  you shouldnt open the crc file you have to open the other one, which
 is
  part-0.
  use vi top edit part-.
  if you will not find this file so your dump failed...just check the
  logs/hadoop.log file
  
 

RE: how to force nutch to do a recrawl

2009-12-14 Thread Peters, Vijaya
Okay.  Our fetch finishes in less than 10 minutes (just intranet).  But,
I'll set it to 2 hours. 

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's 100 Best Companies to Work For list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-Original Message-
From: BELLINI ADAM [mailto:mbel...@msn.com] 
Sent: Monday, December 14, 2009 11:50 AM
To: nutch-user@lucene.apache.org
Subject: RE: how to force nutch to do a recrawl


but just think about one thing...if you are recrawling to much urls and
the crawl time will be more than 1 hours, so your crawl will not
finish...becoz every time it find and url so it will find that the
fetchtime is ready and it fetch it again
to well sett your fetchtime you have to crawl a first time and see how
much time your crawl wil take to finish.
let us say it will take 3 hours...so you have to set the fetchtime to
like 5 hours, give it 2 hours in the case of some tiemout pages that
nutch will retry


i hv met this probleme and my crawl took like 24 hours...becoz of the
small fetchtime (fecthtime smaller than the crawl time)
thx



 Subject: RE: how to force nutch to do a recrawl
 Date: Mon, 14 Dec 2009 11:42:40 -0500
 From: vijaya_pet...@sra.com
 To: nutch-user@lucene.apache.org
 
 Thanks.
 I'm on a development system, so every hour is okay.  
 I guess that's why the last time I changed the properties file it
didn't
 take any effect (because crawldb won't change the fetch time
 automatically).
 
 I'll give this a try - thanks much.
 
 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184
 
 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the
individual
 or entity named above.  If you are not the intended recipient, be
aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.
 
 -Original Message-
 From: BELLINI ADAM [mailto:mbel...@msn.com] 
 Sent: Monday, December 14, 2009 11:38 AM
 To: nutch-user@lucene.apache.org
 Subject: RE: how to force nutch to do a recrawl
 
 
 yes just add those config in the nutch-site.xml and it should work.
 but are you going to recrawl every hour ??? i see 3600 secondes !!
 
 another thing is  you have to make an initial clean crawl with the new
 fetchtime , because in the crawldb it will not change the fetch time
 automaticly . (in my case it didnt change, i just deleted the crawldb
 and made a clean crawl and it works)
 mabe someone can tell you how to change the fecthtime in the crawldb
 without deleting it for an intial clean crawl.
 
 thx
 
 
  Subject: RE: how to force nutch to do a recrawl
  Date: Mon, 14 Dec 2009 11:26:31 -0500
  From: vijaya_pet...@sra.com
  To: nutch-user@lucene.apache.org
  
  Adam,
  I finally go the command to work on another server (see below).  to
  change the retry interval, should I just add the two properties into
  nutch-site.xml (though I tried this before and it didn't work):
  
  http://mysite/  Version: 7
  Status: 2 (db_fetched)
  Fetch time: Fri Jan 08 15:42:33 EST 2010  
  Modified time: Wed Dec 31 19:00:00 EST 1969
  Retries since fetch: 0
  Retry interval: 2592000 seconds (30 days)  
  Score: 1.0
  Signature: e04ab1ac06075fc273dbe1334a6c6dc5
  Metadata: _pst_: success(1), lastModified=0
  
  
  property
  namedb.fetch.interval.default/name
  value3600/value
  descriptionThe default number of seconds between re-fetches of 
  a page 30 days). 
  /description
  /property
  
  property
  namedb.fetch.interval.max/name
  value3600/value
  descriptionThe maximum number of seconds between re-fetches of 
  a page(90 days). After this period every page in the db will be 
  re-tried, no matter what is its status.  /description 
  /property
  
  
  Vijaya Peters
  SRA International, Inc.
  4350 Fair Lakes Court North
  Room 4004
  Fairfax, VA  22033
  Tel:  703-502-1184
  
  www.sra.com
  Named to FORTUNE's 100 Best Companies to 

converting nutch crawl output to human readable content

2009-12-14 Thread Ted Yu
Hi,
I used crawl command of bin/nutch and obtained the following:

 ls crawl/crawldb/current/part-0/
data.data.crc   index   .index.crc

How do I convert the output to human readable format ?

Thanks


Why readdb and readseg shows different figures?

2009-12-14 Thread bhavin pandya
Hi,

I am using Nutch 1.0.

For simple excercise i have crawled one single domain and after that i
tried both command readdb and readseg...
Both showing different figures. Which one i should consider? does
something went wrong while crawling?

Here is the output of both command.

OUTPUT FROM READDB:

CrawlDb statistics start: crawled/crawldb
Statistics for CrawlDb: crawled/crawldb
TOTAL urls: 84178
retry 0:84175
retry 1:3
min score:  0.0
avg score:  7.1693314E-5
max score:  1.2
status 1 (db_unfetched):80475
status 2 (db_fetched):  3634
status 3 (db_gone): 8
status 4 (db_redir_temp):   29
status 5 (db_redir_perm):   32
CrawlDb statistics: done


OUTPUT FROM READSEG:
---
NAMEGENERATED   FETCHER START   FETCHER END
 FETCHED PARSED
20091212212627  1   2009-12-12T21:28:29
2009-12-12T21:28:29 1   1
20091212212951  81  2009-12-12T21:32:20
2009-12-12T21:32:54 105 80
20091212213347  36912009-12-12T21:36:13
2009-12-12T22:16:39 37383621
2009121210  84178   2009-12-12T22:24:30
2009-12-13T11:08:28 85189   81806
20091213151344  84178   2009-12-13T15:16:37
2009-12-14T05:50:45 85195   81824


Thanks.
Bhavin


Re: Why readdb and readseg shows different figures?

2009-12-14 Thread MilleBii
Every thing seems right.
Both stats are interesting and it all depends on what you are looking for.

Readdb gives you global stats where readseg is about each segments ie
fetch/parse run.

2009/12/15, bhavin pandya bvnpan...@gmail.com:
 Hi,

 I am using Nutch 1.0.

 For simple excercise i have crawled one single domain and after that i
 tried both command readdb and readseg...
 Both showing different figures. Which one i should consider? does
 something went wrong while crawling?

 Here is the output of both command.

 OUTPUT FROM READDB:
 
 CrawlDb statistics start: crawled/crawldb
 Statistics for CrawlDb: crawled/crawldb
 TOTAL urls: 84178
 retry 0:84175
 retry 1:3
 min score:  0.0
 avg score:  7.1693314E-5
 max score:  1.2
 status 1 (db_unfetched):80475
 status 2 (db_fetched):  3634
 status 3 (db_gone): 8
 status 4 (db_redir_temp):   29
 status 5 (db_redir_perm):   32
 CrawlDb statistics: done


 OUTPUT FROM READSEG:
 ---
 NAMEGENERATED   FETCHER START   FETCHER END
  FETCHED PARSED
 20091212212627  1   2009-12-12T21:28:29
 2009-12-12T21:28:29 1   1
 20091212212951  81  2009-12-12T21:32:20
 2009-12-12T21:32:54 105 80
 20091212213347  36912009-12-12T21:36:13
 2009-12-12T22:16:39 37383621
 2009121210  84178   2009-12-12T22:24:30
 2009-12-13T11:08:28 85189   81806
 20091213151344  84178   2009-12-13T15:16:37
 2009-12-14T05:50:45 85195   81824


 Thanks.
 Bhavin



-- 
-MilleBii-