Re: nutch's design document
Welcome !!! Nutch is different from anything else I have seen before, but its great and also difficult. So expect to spend some time. Best way to learn is practice to understand what it does. 1. Front-End (search) : is a web site which wraps a Lucene based index. If you are not familiar with Lucene you can buy yourself the book Lucene in action, but it is not really necessary. You can also use Solr as a more sophisticated front end. 2. Back-End (crawling to indexing) crawling is done in a number of steps (read the wiki) and uses two critical database crawldb and linkdb to maintain a graph of where the engine has gone. It will fetch, parse, index pages... 3. Cluster / Cloud computing Based on hadoop it uses map/reduce parallel processing technique for the different steps. There is an Hadoop book you can buy. Good luck and see you on the mailing list. 2009/12/11, mengel men...@163.com: Hello,Dear: I am a freshman for Nutch. I want to learn nutch, but I can't find a document for design such as architecture. Can you give me some advice for how to learn Nutch.Thank you very much. Mengel -- -MilleBii-
Optimization in crawling and indexing
I want to see if there is any possible bandwidth optimization while using Nutch. a)Crawling: After initial crawl, ONLY fetch updated document? Re-crawl command after every 6 hours will crawl and fetch all documents. ['db.fetch.interval.default' is 6 hours]. It should just bring updated documents only. Does Nutch internally use HEAD request to check whether that document (html, PDFs and Docs) has changed or not? b)Indexing: Can I find out based on a timestamp, how many documents have changed after last re-crawl? Thanks, Rupesh DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: Nutch 1.0 and Office 2007 documents
Hi all, Anyone successfully used nutch to index Office 2007 documents? I know that this question has already been asked, but considering the number of e-mails asking the same question, looks like that Nutch does not support Office 2007 documents. Best, Adilson On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com wrote: Hi, I'm also curious as to whether anyone has had success with Nutch and parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same errors as seen here - http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do cuments-in-Nutch-1.0-td26640949.html#a26640949http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949 Is a separate plugin required to parse these documents (i.e., parse-msexcel, parse-mspowerpoint, etc. will *not* work?) I noticed the comment on the above thread - docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. - but didn't find it really helpful. Regards, Joe This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified.
Re: Nutch 1.0 and Office 2007 documents
Hi, There is a Tika plugin in JIRA ( https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page the support for the Office 2007 was imminent in POI (which Tika uses internally). The plan for Nutch is to progressively delegate the parsing to Tika; Nutch-766 has been implemented for this. I haven't checked whether Tika currently supports Office 2007 but I suggest that you try parsing docs at this format with Tika, if it does work then you'll get that automatically via Nutch-766 Makes sense? Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com Hi all, Anyone successfully used nutch to index Office 2007 documents? I know that this question has already been asked, but considering the number of e-mails asking the same question, looks like that Nutch does not support Office 2007 documents. Best, Adilson On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com wrote: Hi, I'm also curious as to whether anyone has had success with Nutch and parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same errors as seen here - http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do cuments-in-Nutch-1.0-td26640949.html#a26640949 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949 Is a separate plugin required to parse these documents (i.e., parse-msexcel, parse-mspowerpoint, etc. will *not* work?) I noticed the comment on the above thread - docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. - but didn't find it really helpful. Regards, Joe This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified.
Re: Nutch 1.0 and Office 2007 documents
Hi, Thanks for the reply. I will try to use Tika with Nutch to parse the documents. My current Nutch setup is working quite nice and I don't want to configure another Nutch instance. If I manage to put it to work I will write here a mini how-to. Best, Adilson On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, There is a Tika plugin in JIRA ( https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page the support for the Office 2007 was imminent in POI (which Tika uses internally). The plan for Nutch is to progressively delegate the parsing to Tika; Nutch-766 has been implemented for this. I haven't checked whether Tika currently supports Office 2007 but I suggest that you try parsing docs at this format with Tika, if it does work then you'll get that automatically via Nutch-766 Makes sense? Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com Hi all, Anyone successfully used nutch to index Office 2007 documents? I know that this question has already been asked, but considering the number of e-mails asking the same question, looks like that Nutch does not support Office 2007 documents. Best, Adilson On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com wrote: Hi, I'm also curious as to whether anyone has had success with Nutch and parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same errors as seen here - http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do cuments-in-Nutch-1.0-td26640949.html#a26640949 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949 Is a separate plugin required to parse these documents (i.e., parse-msexcel, parse-mspowerpoint, etc. will *not* work?) I noticed the comment on the above thread - docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. - but didn't find it really helpful. Regards, Joe This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified.
Re: Nutch 1.0 and Office 2007 documents
If I manage to put it to work I will write here a mini how-to. The Nutch Wiki would be the right place for doing that. It would be nice to have a page there listing the differences between the capabilities of the Tika plugin and the existing Nutch parsing plugins as there might be differences between them (support for Office 2007 being potentially one of them) Note that the Tika plugin is VERY beta Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com Hi, Thanks for the reply. I will try to use Tika with Nutch to parse the documents. My current Nutch setup is working quite nice and I don't want to configure another Nutch instance. If I manage to put it to work I will write here a mini how-to. Best, Adilson On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, There is a Tika plugin in JIRA ( https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page the support for the Office 2007 was imminent in POI (which Tika uses internally). The plan for Nutch is to progressively delegate the parsing to Tika; Nutch-766 has been implemented for this. I haven't checked whether Tika currently supports Office 2007 but I suggest that you try parsing docs at this format with Tika, if it does work then you'll get that automatically via Nutch-766 Makes sense? Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com Hi all, Anyone successfully used nutch to index Office 2007 documents? I know that this question has already been asked, but considering the number of e-mails asking the same question, looks like that Nutch does not support Office 2007 documents. Best, Adilson On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com wrote: Hi, I'm also curious as to whether anyone has had success with Nutch and parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same errors as seen here - http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do cuments-in-Nutch-1.0-td26640949.html#a26640949 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949 Is a separate plugin required to parse these documents (i.e., parse-msexcel, parse-mspowerpoint, etc. will *not* work?) I noticed the comment on the above thread - docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. - but didn't find it really helpful. Regards, Joe This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified.
Re: Nutch 1.0 and Office 2007 documents
Have create a page http://wiki.apache.org/nutch/TikaPlugin; feel free to use it for your how-to J. 2009/12/14 Julien Nioche lists.digitalpeb...@gmail.com If I manage to put it to work I will write here a mini how-to. The Nutch Wiki would be the right place for doing that. It would be nice to have a page there listing the differences between the capabilities of the Tika plugin and the existing Nutch parsing plugins as there might be differences between them (support for Office 2007 being potentially one of them) Note that the Tika plugin is VERY beta Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com Hi, Thanks for the reply. I will try to use Tika with Nutch to parse the documents. My current Nutch setup is working quite nice and I don't want to configure another Nutch instance. If I manage to put it to work I will write here a mini how-to. Best, Adilson On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, There is a Tika plugin in JIRA ( https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page the support for the Office 2007 was imminent in POI (which Tika uses internally). The plan for Nutch is to progressively delegate the parsing to Tika; Nutch-766 has been implemented for this. I haven't checked whether Tika currently supports Office 2007 but I suggest that you try parsing docs at this format with Tika, if it does work then you'll get that automatically via Nutch-766 Makes sense? Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com Hi all, Anyone successfully used nutch to index Office 2007 documents? I know that this question has already been asked, but considering the number of e-mails asking the same question, looks like that Nutch does not support Office 2007 documents. Best, Adilson On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com wrote: Hi, I'm also curious as to whether anyone has had success with Nutch and parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same errors as seen here - http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do cuments-in-Nutch-1.0-td26640949.html#a26640949 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949 Is a separate plugin required to parse these documents (i.e., parse-msexcel, parse-mspowerpoint, etc. will *not* work?) I noticed the comment on the above thread - docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. - but didn't find it really helpful. Regards, Joe This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified. -- DigitalPebble Ltd http://www.digitalpebble.com
Re: Distributed Search problem
Index and segments is the minimum yes. You only need the segments for the indexes that you are serving on the local box. Dennis MilleBii wrote: Ok I don't per say need distributed search. I was trying to avoid a copy to local file system to optimize on ressources working off HDFS What is the minimum to copy over index and segments ? Not crawldb ? All data in segments ? 2009/12/13, Dennis Kubes ku...@apache.org: The assumption is wrong. Distributed search is done from indexes on local file systems not HDFS. It doesn't return because lucene is trying to search across the indexes in HDFS in real time which doesn't work because of network overhead. Depending on the size of the indexes it may actually return after some time but I have seen it timeout even for small indexes. Short of it is, move the indexes and segments to a local file system, then point the distributed search server at their parent directory. Something like this: bin/nutch server 8100 /full/path/to/parent/of/local/indexes It technically doesn't have to be a full path. Then point the searcher.dir to a directory with search-servers.txt as you have done. The search-servers.txt points like you have it. Dennis MilleBii wrote: I'm trying to search directly from the index in hdfs so in distributed mode What do I have wrong ? created nutch/conf/search-servers.txt with localhost 8100 pointed search.dir in nutch-site.xml to nutch/conf tried to start search server with either : + nutch server 8100 crawl + nutch server 8100 hdfs://localhost:9000/user/nutch/crawl The nutch server command doesn't return to prompt ??? Is this normal should I wait ? And of course if I try a search it doesn't work
Re: OR support
Nobody? Please, any answer would good. -- View this message in context: http://old.nabble.com/OR-support-tp26680899p26779229.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: OR support
On 2009-12-14 16:05, BrunoWL wrote: Nobody? Please, any answer would good. Please check this issue: https://issues.apache.org/jira/browse/NUTCH-479 That's the current status, i.e. this functionality is available only as a patch. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: how to force nutch to do a recrawl
Adam, I finally go the command to work on another server (see below). to change the retry interval, should I just add the two properties into nutch-site.xml (though I tried this before and it didn't work): http://mysite/ Version: 7 Status: 2 (db_fetched) Fetch time: Fri Jan 08 15:42:33 EST 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: e04ab1ac06075fc273dbe1334a6c6dc5 Metadata: _pst_: success(1), lastModified=0 property namedb.fetch.interval.default/name value3600/value descriptionThe default number of seconds between re-fetches of a page 30 days). /description /property property namedb.fetch.interval.max/name value3600/value descriptionThe maximum number of seconds between re-fetches of a page(90 days). After this period every page in the db will be re-tried, no matter what is its status. /description /property Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Friday, December 11, 2009 3:11 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl hi, you shouldnt open the crc file you have to open the other one, which is part-0. use vi top edit part-. if you will not find this file so your dump failed...just check the logs/hadoop.log file Subject: RE: how to force nutch to do a recrawl Date: Fri, 11 Dec 2009 09:14:26 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I'm using cygwin to run the scripts. I use EditPlus to edit the files. But EditPlus won't allow me to edit the crc file. I'll see if I can ftp the file to a unix machine. Vijaya Peters SRA International, Inc. 12500 Fair Lakes Circle Room 3507 Fairfax, VA 22033 Tel: 703-222-9207 www.sra.com This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thu 12/10/2009 6:43 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl bu8t how you are running sh scripts... you have to use cygwin to be able to edit linux files Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 16:09:13 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I'm on windows unfortunately!! I'm using cygdrive, but it doesn't recognize vi. Any idea for opening it in windows? Notepad didn't work either. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thursday, December 10, 2009 4:01 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl jus use vi or vim i use vi to edit the file Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 15:58:24 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, What do I use to open a
RE: how to force nutch to do a recrawl
yes just add those config in the nutch-site.xml and it should work. but are you going to recrawl every hour ??? i see 3600 secondes !! another thing is you have to make an initial clean crawl with the new fetchtime , because in the crawldb it will not change the fetch time automaticly . (in my case it didnt change, i just deleted the crawldb and made a clean crawl and it works) mabe someone can tell you how to change the fecthtime in the crawldb without deleting it for an intial clean crawl. thx Subject: RE: how to force nutch to do a recrawl Date: Mon, 14 Dec 2009 11:26:31 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I finally go the command to work on another server (see below). to change the retry interval, should I just add the two properties into nutch-site.xml (though I tried this before and it didn't work): http://mysite/Version: 7 Status: 2 (db_fetched) Fetch time: Fri Jan 08 15:42:33 EST 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: e04ab1ac06075fc273dbe1334a6c6dc5 Metadata: _pst_: success(1), lastModified=0 property namedb.fetch.interval.default/name value3600/value descriptionThe default number of seconds between re-fetches of a page 30 days). /description /property property namedb.fetch.interval.max/name value3600/value descriptionThe maximum number of seconds between re-fetches of a page(90 days). After this period every page in the db will be re-tried, no matter what is its status. /description /property Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Friday, December 11, 2009 3:11 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl hi, you shouldnt open the crc file you have to open the other one, which is part-0. use vi top edit part-. if you will not find this file so your dump failed...just check the logs/hadoop.log file Subject: RE: how to force nutch to do a recrawl Date: Fri, 11 Dec 2009 09:14:26 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I'm using cygwin to run the scripts. I use EditPlus to edit the files. But EditPlus won't allow me to edit the crc file. I'll see if I can ftp the file to a unix machine. Vijaya Peters SRA International, Inc. 12500 Fair Lakes Circle Room 3507 Fairfax, VA 22033 Tel: 703-222-9207 www.sra.com This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thu 12/10/2009 6:43 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl bu8t how you are running sh scripts... you have to use cygwin to be able to edit linux files Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 16:09:13 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I'm on windows unfortunately!! I'm using cygdrive, but it doesn't recognize vi. Any idea for opening it in windows? Notepad didn't work either. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use
RE: how to force nutch to do a recrawl
Thanks. I'm on a development system, so every hour is okay. I guess that's why the last time I changed the properties file it didn't take any effect (because crawldb won't change the fetch time automatically). I'll give this a try - thanks much. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Monday, December 14, 2009 11:38 AM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl yes just add those config in the nutch-site.xml and it should work. but are you going to recrawl every hour ??? i see 3600 secondes !! another thing is you have to make an initial clean crawl with the new fetchtime , because in the crawldb it will not change the fetch time automaticly . (in my case it didnt change, i just deleted the crawldb and made a clean crawl and it works) mabe someone can tell you how to change the fecthtime in the crawldb without deleting it for an intial clean crawl. thx Subject: RE: how to force nutch to do a recrawl Date: Mon, 14 Dec 2009 11:26:31 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I finally go the command to work on another server (see below). to change the retry interval, should I just add the two properties into nutch-site.xml (though I tried this before and it didn't work): http://mysite/Version: 7 Status: 2 (db_fetched) Fetch time: Fri Jan 08 15:42:33 EST 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: e04ab1ac06075fc273dbe1334a6c6dc5 Metadata: _pst_: success(1), lastModified=0 property namedb.fetch.interval.default/name value3600/value descriptionThe default number of seconds between re-fetches of a page 30 days). /description /property property namedb.fetch.interval.max/name value3600/value descriptionThe maximum number of seconds between re-fetches of a page(90 days). After this period every page in the db will be re-tried, no matter what is its status. /description /property Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Friday, December 11, 2009 3:11 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl hi, you shouldnt open the crc file you have to open the other one, which is part-0. use vi top edit part-. if you will not find this file so your dump failed...just check the logs/hadoop.log file Subject: RE: how to force nutch to do a recrawl Date: Fri, 11 Dec 2009 09:14:26 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I'm using cygwin to run the scripts. I use EditPlus to edit the files. But EditPlus won't allow me to edit the crc file. I'll see if I can ftp the file to a unix machine. Vijaya Peters SRA International, Inc. 12500 Fair Lakes Circle Room 3507 Fairfax, VA 22033 Tel: 703-222-9207 www.sra.com This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in
RE: how to force nutch to do a recrawl
but just think about one thing...if you are recrawling to much urls and the crawl time will be more than 1 hours, so your crawl will not finish...becoz every time it find and url so it will find that the fetchtime is ready and it fetch it again to well sett your fetchtime you have to crawl a first time and see how much time your crawl wil take to finish. let us say it will take 3 hours...so you have to set the fetchtime to like 5 hours, give it 2 hours in the case of some tiemout pages that nutch will retry i hv met this probleme and my crawl took like 24 hours...becoz of the small fetchtime (fecthtime smaller than the crawl time) thx Subject: RE: how to force nutch to do a recrawl Date: Mon, 14 Dec 2009 11:42:40 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Thanks. I'm on a development system, so every hour is okay. I guess that's why the last time I changed the properties file it didn't take any effect (because crawldb won't change the fetch time automatically). I'll give this a try - thanks much. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Monday, December 14, 2009 11:38 AM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl yes just add those config in the nutch-site.xml and it should work. but are you going to recrawl every hour ??? i see 3600 secondes !! another thing is you have to make an initial clean crawl with the new fetchtime , because in the crawldb it will not change the fetch time automaticly . (in my case it didnt change, i just deleted the crawldb and made a clean crawl and it works) mabe someone can tell you how to change the fecthtime in the crawldb without deleting it for an intial clean crawl. thx Subject: RE: how to force nutch to do a recrawl Date: Mon, 14 Dec 2009 11:26:31 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I finally go the command to work on another server (see below). to change the retry interval, should I just add the two properties into nutch-site.xml (though I tried this before and it didn't work): http://mysite/ Version: 7 Status: 2 (db_fetched) Fetch time: Fri Jan 08 15:42:33 EST 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: e04ab1ac06075fc273dbe1334a6c6dc5 Metadata: _pst_: success(1), lastModified=0 property namedb.fetch.interval.default/name value3600/value descriptionThe default number of seconds between re-fetches of a page 30 days). /description /property property namedb.fetch.interval.max/name value3600/value descriptionThe maximum number of seconds between re-fetches of a page(90 days). After this period every page in the db will be re-tried, no matter what is its status. /description /property Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Friday, December 11, 2009 3:11 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl hi, you shouldnt open the crc file you have to open the other one, which is part-0. use vi top edit part-. if you will not find this file so your dump failed...just check the logs/hadoop.log file
RE: how to force nutch to do a recrawl
Okay. Our fetch finishes in less than 10 minutes (just intranet). But, I'll set it to 2 hours. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Monday, December 14, 2009 11:50 AM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl but just think about one thing...if you are recrawling to much urls and the crawl time will be more than 1 hours, so your crawl will not finish...becoz every time it find and url so it will find that the fetchtime is ready and it fetch it again to well sett your fetchtime you have to crawl a first time and see how much time your crawl wil take to finish. let us say it will take 3 hours...so you have to set the fetchtime to like 5 hours, give it 2 hours in the case of some tiemout pages that nutch will retry i hv met this probleme and my crawl took like 24 hours...becoz of the small fetchtime (fecthtime smaller than the crawl time) thx Subject: RE: how to force nutch to do a recrawl Date: Mon, 14 Dec 2009 11:42:40 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Thanks. I'm on a development system, so every hour is okay. I guess that's why the last time I changed the properties file it didn't take any effect (because crawldb won't change the fetch time automatically). I'll give this a try - thanks much. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Monday, December 14, 2009 11:38 AM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl yes just add those config in the nutch-site.xml and it should work. but are you going to recrawl every hour ??? i see 3600 secondes !! another thing is you have to make an initial clean crawl with the new fetchtime , because in the crawldb it will not change the fetch time automaticly . (in my case it didnt change, i just deleted the crawldb and made a clean crawl and it works) mabe someone can tell you how to change the fecthtime in the crawldb without deleting it for an intial clean crawl. thx Subject: RE: how to force nutch to do a recrawl Date: Mon, 14 Dec 2009 11:26:31 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I finally go the command to work on another server (see below). to change the retry interval, should I just add the two properties into nutch-site.xml (though I tried this before and it didn't work): http://mysite/ Version: 7 Status: 2 (db_fetched) Fetch time: Fri Jan 08 15:42:33 EST 2010 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: e04ab1ac06075fc273dbe1334a6c6dc5 Metadata: _pst_: success(1), lastModified=0 property namedb.fetch.interval.default/name value3600/value descriptionThe default number of seconds between re-fetches of a page 30 days). /description /property property namedb.fetch.interval.max/name value3600/value descriptionThe maximum number of seconds between re-fetches of a page(90 days). After this period every page in the db will be re-tried, no matter what is its status. /description /property Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to
converting nutch crawl output to human readable content
Hi, I used crawl command of bin/nutch and obtained the following: ls crawl/crawldb/current/part-0/ data.data.crc index .index.crc How do I convert the output to human readable format ? Thanks
Why readdb and readseg shows different figures?
Hi, I am using Nutch 1.0. For simple excercise i have crawled one single domain and after that i tried both command readdb and readseg... Both showing different figures. Which one i should consider? does something went wrong while crawling? Here is the output of both command. OUTPUT FROM READDB: CrawlDb statistics start: crawled/crawldb Statistics for CrawlDb: crawled/crawldb TOTAL urls: 84178 retry 0:84175 retry 1:3 min score: 0.0 avg score: 7.1693314E-5 max score: 1.2 status 1 (db_unfetched):80475 status 2 (db_fetched): 3634 status 3 (db_gone): 8 status 4 (db_redir_temp): 29 status 5 (db_redir_perm): 32 CrawlDb statistics: done OUTPUT FROM READSEG: --- NAMEGENERATED FETCHER START FETCHER END FETCHED PARSED 20091212212627 1 2009-12-12T21:28:29 2009-12-12T21:28:29 1 1 20091212212951 81 2009-12-12T21:32:20 2009-12-12T21:32:54 105 80 20091212213347 36912009-12-12T21:36:13 2009-12-12T22:16:39 37383621 2009121210 84178 2009-12-12T22:24:30 2009-12-13T11:08:28 85189 81806 20091213151344 84178 2009-12-13T15:16:37 2009-12-14T05:50:45 85195 81824 Thanks. Bhavin
Re: Why readdb and readseg shows different figures?
Every thing seems right. Both stats are interesting and it all depends on what you are looking for. Readdb gives you global stats where readseg is about each segments ie fetch/parse run. 2009/12/15, bhavin pandya bvnpan...@gmail.com: Hi, I am using Nutch 1.0. For simple excercise i have crawled one single domain and after that i tried both command readdb and readseg... Both showing different figures. Which one i should consider? does something went wrong while crawling? Here is the output of both command. OUTPUT FROM READDB: CrawlDb statistics start: crawled/crawldb Statistics for CrawlDb: crawled/crawldb TOTAL urls: 84178 retry 0:84175 retry 1:3 min score: 0.0 avg score: 7.1693314E-5 max score: 1.2 status 1 (db_unfetched):80475 status 2 (db_fetched): 3634 status 3 (db_gone): 8 status 4 (db_redir_temp): 29 status 5 (db_redir_perm): 32 CrawlDb statistics: done OUTPUT FROM READSEG: --- NAMEGENERATED FETCHER START FETCHER END FETCHED PARSED 20091212212627 1 2009-12-12T21:28:29 2009-12-12T21:28:29 1 1 20091212212951 81 2009-12-12T21:32:20 2009-12-12T21:32:54 105 80 20091212213347 36912009-12-12T21:36:13 2009-12-12T22:16:39 37383621 2009121210 84178 2009-12-12T22:24:30 2009-12-13T11:08:28 85189 81806 20091213151344 84178 2009-12-13T15:16:37 2009-12-14T05:50:45 85195 81824 Thanks. Bhavin -- -MilleBii-