Re: How to get all the crawled pages for perticular domain
Hi Bhavin, other nutch users may comment on this, but it seems to me that working on top of the nutchbase branch might allow you to perform that type of processing quite easily. -y bhavin pandya wrote: Hi, I have setup nutch 1.0 on cluster of 3 nodes. We are running two application. 1. Nutch based search application. We have successfully crawled approx. 25m pages on 3 nodes. It's working as per expectation. 2. I am running application which needs to extract some information for perticular domain. As of date this application uses heritrix based crawler which crawls the given domain, algorithms goes into pages and extract required information. As we are crawling in Nutch in distributed mode. we don't want to recrawl using other tool like Heritrix for 2nd application. I want to utilize same crawled data for 2nd application also. But extraction algorithms requires all the crawled pages for perticular domain, to extract all relevant information about that domain. I have thought of if somehow by writing some plugin in Nutch if i can feed nutch crawled data to 2nd application then it will really save our work, money and effort by not recrawling again. But how do i get all the crawled pages for perticular domain in my plugin? Where i should look in nutch code. Any pointer / idea in this direction will really help. Thanks. Bhavin
Re: How to get all the crawled pages for perticular domain
There is a domain-url filter. Is that what you were looking for? Dennis Yves Petinot wrote: Hi Bhavin, other nutch users may comment on this, but it seems to me that working on top of the nutchbase branch might allow you to perform that type of processing quite easily. -y bhavin pandya wrote: Hi, I have setup nutch 1.0 on cluster of 3 nodes. We are running two application. 1. Nutch based search application. We have successfully crawled approx. 25m pages on 3 nodes. It's working as per expectation. 2. I am running application which needs to extract some information for perticular domain. As of date this application uses heritrix based crawler which crawls the given domain, algorithms goes into pages and extract required information. As we are crawling in Nutch in distributed mode. we don't want to recrawl using other tool like Heritrix for 2nd application. I want to utilize same crawled data for 2nd application also. But extraction algorithms requires all the crawled pages for perticular domain, to extract all relevant information about that domain. I have thought of if somehow by writing some plugin in Nutch if i can feed nutch crawled data to 2nd application then it will really save our work, money and effort by not recrawling again. But how do i get all the crawled pages for perticular domain in my plugin? Where i should look in nutch code. Any pointer / idea in this direction will really help. Thanks. Bhavin
NOINDEX, NOFOLLOW
hi, i have a page with meta name=robots content=noindex,nofollow /, now i know that nutch obey to this tag because i dont find the content and the title in my index, but i was wondering that this document will not be present in the index. why he keep the document in my index with no title and no content ?? i'm using index-basic and index-more plugins, and i want to understand why nutch still filling the url, date, boostetc since he didnt it for title and content. i was thinking that if nutch will obey to nofollow and noindex so it will skip all the document ! or mabe i missunderstood something, can you plz explain this behavior to me? best regards. _ Windows Live: Make it easier for your friends to see what you’re up to on Facebook. http://go.microsoft.com/?linkid=9691816
RE: how to force nutch to do a recrawl
hi, check the fetch time in your crawldb...you can dump all the crawldb like this: ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db entries will look like this: http://www.YOUR_URL_TO_FETCH Status: 2 (db_fetched) Fetch time: Thu Dec 10 09:19:18 EST 2009 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 18000 seconds (0 days) Score: 0.0014977538 Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c Metadata: _pst_: success(1), lastModified=0 as you see the next time the page will be fetched is in fetch time : 'Fetch time: Thu Dec 10 09:19:18 EST 2009' and check the rety interval : it should be your 3600. hope it will help Subject: RE: how to force nutch to do a recrawl Date: Wed, 9 Dec 2009 16:06:58 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Okay. I'll dig a little deeper. I saw a few scripts that people had created, but I couldn't get them to work. Thanks much. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: MilleBii [mailto:mille...@gmail.com] Sent: Wednesday, December 09, 2009 4:05 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl I don't that you can use nutch crawl command to do that, this is a one stop shop command. You probably want to use individual commands. Type nutch generate to get the help and you will see the option -adddays, read that page on the wiki to get a feel how you should do: http://wiki.apache.org/nutch/Crawl 2009/12/9 Peters, Vijaya vijaya_pet...@sra.com I didn't see a setting to override in crawl-urlfilter. How do I set numberDays? I have regular expressions to include/exclude certain extensions and certain urls, but that's all I have in there. Please send me an example and I'll give it a try. Thanks! Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: xiao yang [mailto:yangxiao9...@gmail.com] Sent: Wednesday, December 09, 2009 1:41 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl What about the configuration in crawl-urlfilter.txt? On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I tried that too. in Nutch-site.xml, I added in the below, but this had no effect. property namedb.default.fetch.interval/name value0/value description(DEPRECATED) The default number of days between re-fetches of a page. value was 30 /description /property property namedb.fetch.interval.default/name value3600/value descriptionThe default number of seconds between re-fetches of a page (30 days). value was 2592000 (30 days) /description /property property namedb.fetch.interval.max/name value3600/value descriptionThe maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re-tried, no matter what is its status. value was 7776000 /description /property Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged
domain vs www.domain?
I'm seeing a lot of duplicates where a single site is getting recognized as two different sites. Specifically I am seeing www.domain.com and domain.combeing recognized as two different sites. I imagine there is a setting to prevent this. If so, what is the setting, if not, what would you recomend doing to prevent this? Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com
RE: how to force nutch to do a recrawl
Adam, I tried running that command and get the following (it created a whole_db directory, but it's not dumping out the contents to the console): $ bin/nutch readdb crawl/crawldb/ -dump whole_db CrawlDb dump: starting CrawlDb db: crawl/crawldb/ CrawlDb dump: done Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thursday, December 10, 2009 1:40 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl hi, check the fetch time in your crawldb...you can dump all the crawldb like this: ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db entries will look like this: http://www.YOUR_URL_TO_FETCH Status: 2 (db_fetched) Fetch time: Thu Dec 10 09:19:18 EST 2009 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 18000 seconds (0 days) Score: 0.0014977538 Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c Metadata: _pst_: success(1), lastModified=0 as you see the next time the page will be fetched is in fetch time : 'Fetch time: Thu Dec 10 09:19:18 EST 2009' and check the rety interval : it should be your 3600. hope it will help Subject: RE: how to force nutch to do a recrawl Date: Wed, 9 Dec 2009 16:06:58 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Okay. I'll dig a little deeper. I saw a few scripts that people had created, but I couldn't get them to work. Thanks much. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: MilleBii [mailto:mille...@gmail.com] Sent: Wednesday, December 09, 2009 4:05 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl I don't that you can use nutch crawl command to do that, this is a one stop shop command. You probably want to use individual commands. Type nutch generate to get the help and you will see the option -adddays, read that page on the wiki to get a feel how you should do: http://wiki.apache.org/nutch/Crawl 2009/12/9 Peters, Vijaya vijaya_pet...@sra.com I didn't see a setting to override in crawl-urlfilter. How do I set numberDays? I have regular expressions to include/exclude certain extensions and certain urls, but that's all I have in there. Please send me an example and I'll give it a try. Thanks! Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: xiao yang [mailto:yangxiao9...@gmail.com] Sent: Wednesday, December 09, 2009 1:41 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl What about the configuration in crawl-urlfilter.txt? On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I tried
Re: NOINDEX, NOFOLLOW
On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM mbel...@msn.com wrote: hi, i have a page with meta name=robots content=noindex,nofollow /, now i know that nutch obey to this tag because i dont find the content and the title in my index, but i was wondering that this document will not be present in the index. why he keep the document in my index with no title and no content ?? i'm using index-basic and index-more plugins, and i want to understand why nutch still filling the url, date, boostetc since he didnt it for title and content. i was thinking that if nutch will obey to nofollow and noindex so it will skip all the document ! or mabe i missunderstood something, can you plz explain this behavior to me? best regards. My guess is that the page is recorded to note that the page shouldn't be fetched, I'm guessing the status is one of the magic values. It probably re-fetches the page periodically to ensure it has the list. So the URL and the date make sense to me as to why they populate them. I don't know why it is computing the boost, other then the fact that it might be part of the OPIC scoring algorithm. If the scoring algorithm ever uses the scores/boost of the pages that you point at as a contributing factor, it would make total sense. So even though it doesn't index http://example/foo/bar;, knowing which pages point there, and what their scores are could contribute scores of pages that you do index, that contain an outlink to that page. Kirby
RE: how to force nutch to do a recrawl
it will not dump to the console ! whole_db is a folder and you have to edit the file you will find in this folder Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 14:26:30 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I tried running that command and get the following (it created a whole_db directory, but it's not dumping out the contents to the console): $ bin/nutch readdb crawl/crawldb/ -dump whole_db CrawlDb dump: starting CrawlDb db: crawl/crawldb/ CrawlDb dump: done Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thursday, December 10, 2009 1:40 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl hi, check the fetch time in your crawldb...you can dump all the crawldb like this: ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db entries will look like this: http://www.YOUR_URL_TO_FETCH Status: 2 (db_fetched) Fetch time: Thu Dec 10 09:19:18 EST 2009 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 18000 seconds (0 days) Score: 0.0014977538 Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c Metadata: _pst_: success(1), lastModified=0 as you see the next time the page will be fetched is in fetch time : 'Fetch time: Thu Dec 10 09:19:18 EST 2009' and check the rety interval : it should be your 3600. hope it will help Subject: RE: how to force nutch to do a recrawl Date: Wed, 9 Dec 2009 16:06:58 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Okay. I'll dig a little deeper. I saw a few scripts that people had created, but I couldn't get them to work. Thanks much. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: MilleBii [mailto:mille...@gmail.com] Sent: Wednesday, December 09, 2009 4:05 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl I don't that you can use nutch crawl command to do that, this is a one stop shop command. You probably want to use individual commands. Type nutch generate to get the help and you will see the option -adddays, read that page on the wiki to get a feel how you should do: http://wiki.apache.org/nutch/Crawl 2009/12/9 Peters, Vijaya vijaya_pet...@sra.com I didn't see a setting to override in crawl-urlfilter. How do I set numberDays? I have regular expressions to include/exclude certain extensions and certain urls, but that's all I have in there. Please send me an example and I'll give it a try. Thanks! Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please
RE: NOINDEX, NOFOLLOW
hi, thx for these informations, but since i'm using solr index, and when i make a search i get a blank result... for example if i will have 10 documents as a search result, 9 will be ok (because i display the title and 4 first lines of content), but i obtain one blank result becoz of this page (with no content and no title) ! i dont understans why it is in the index since it was setted as noindex !? here an example: searchin for word1: results: 1- title 1 : content1 2- title 1 : content2 3- title 1 : content3 4- title 1 : content4 5- title 1 : content5 6- title 1 : content6 7- title 1 : content7 8- title 1 : content8 9-BLANK.. 10- title 1 : content10 From: kirby.bohl...@gmail.com Date: Thu, 10 Dec 2009 13:33:18 -0600 Subject: Re: NOINDEX, NOFOLLOW To: nutch-user@lucene.apache.org On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAM mbel...@msn.com wrote: hi, i have a page with meta name=robots content=noindex,nofollow /, now i know that nutch obey to this tag because i dont find the content and the title in my index, but i was wondering that this document will not be present in the index. why he keep the document in my index with no title and no content ?? i'm using index-basic and index-more plugins, and i want to understand why nutch still filling the url, date, boostetc since he didnt it for title and content. i was thinking that if nutch will obey to nofollow and noindex so it will skip all the document ! or mabe i missunderstood something, can you plz explain this behavior to me? best regards. My guess is that the page is recorded to note that the page shouldn't be fetched, I'm guessing the status is one of the magic values. It probably re-fetches the page periodically to ensure it has the list. So the URL and the date make sense to me as to why they populate them. I don't know why it is computing the boost, other then the fact that it might be part of the OPIC scoring algorithm. If the scoring algorithm ever uses the scores/boost of the pages that you point at as a contributing factor, it would make total sense. So even though it doesn't index http://example/foo/bar;, knowing which pages point there, and what their scores are could contribute scores of pages that you do index, that contain an outlink to that page. Kirby _ Windows Live: Keep your friends up to date with what you do online. http://go.microsoft.com/?linkid=9691815
Re: NOINDEX, NOFOLLOW
On 2009-12-10 20:33, Kirby Bohling wrote: On Thu, Dec 10, 2009 at 12:22 PM, BELLINI ADAMmbel...@msn.com wrote: hi, i have a page withmeta name=robots content=noindex,nofollow /, now i know that nutch obey to this tag because i dont find the content and the title in my index, but i was wondering that this document will not be present in the index. why he keep the document in my index with no title and no content ?? i'm using index-basic and index-more plugins, and i want to understand why nutch still filling the url, date, boostetc since he didnt it for title and content. i was thinking that if nutch will obey to nofollow and noindex so it will skip all the document ! or mabe i missunderstood something, can you plz explain this behavior to me? best regards. My guess is that the page is recorded to note that the page shouldn't be fetched, I'm guessing the status is one of the magic values. It probably re-fetches the page periodically to ensure it has the list. So the URL and the date make sense to me as to why they populate them. I don't know why it is computing the boost, other then the fact that it might be part of the OPIC scoring algorithm. If the scoring algorithm ever uses the scores/boost of the pages that you point at as a contributing factor, it would make total sense. So even though it doesn't index http://example/foo/bar;, knowing which pages point there, and what their scores are could contribute scores of pages that you do index, that contain an outlink to that page. Very good explanation, that's exactly the reasons why Nutch never discards such pages. If you really want to ignore certain pages, then use URLFilters and/or ScoringFilters. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: how to force nutch to do a recrawl
Adam, What do I use to open a CRC file? I tried QuickSFV. Thanks in advance! Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thursday, December 10, 2009 3:48 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl it will not dump to the console ! whole_db is a folder and you have to edit the file you will find in this folder Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 14:26:30 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I tried running that command and get the following (it created a whole_db directory, but it's not dumping out the contents to the console): $ bin/nutch readdb crawl/crawldb/ -dump whole_db CrawlDb dump: starting CrawlDb db: crawl/crawldb/ CrawlDb dump: done Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thursday, December 10, 2009 1:40 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl hi, check the fetch time in your crawldb...you can dump all the crawldb like this: ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db entries will look like this: http://www.YOUR_URL_TO_FETCH Status: 2 (db_fetched) Fetch time: Thu Dec 10 09:19:18 EST 2009 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 18000 seconds (0 days) Score: 0.0014977538 Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c Metadata: _pst_: success(1), lastModified=0 as you see the next time the page will be fetched is in fetch time : 'Fetch time: Thu Dec 10 09:19:18 EST 2009' and check the rety interval : it should be your 3600. hope it will help Subject: RE: how to force nutch to do a recrawl Date: Wed, 9 Dec 2009 16:06:58 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Okay. I'll dig a little deeper. I saw a few scripts that people had created, but I couldn't get them to work. Thanks much. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: MilleBii [mailto:mille...@gmail.com] Sent: Wednesday, December 09, 2009 4:05 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl I don't that you can use nutch crawl command to do that, this is a one stop shop command. You probably want to use individual commands. Type nutch generate to get the help and you will see the option -adddays, read that page on the wiki to get a feel how you should do: http://wiki.apache.org/nutch/Crawl 2009/12/9 Peters, Vijaya vijaya_pet...@sra.com I didn't see a
RE: how to force nutch to do a recrawl
jus use vi or vim i use vi to edit the file Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 15:58:24 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, What do I use to open a CRC file? I tried QuickSFV. Thanks in advance! Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thursday, December 10, 2009 3:48 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl it will not dump to the console ! whole_db is a folder and you have to edit the file you will find in this folder Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 14:26:30 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I tried running that command and get the following (it created a whole_db directory, but it's not dumping out the contents to the console): $ bin/nutch readdb crawl/crawldb/ -dump whole_db CrawlDb dump: starting CrawlDb db: crawl/crawldb/ CrawlDb dump: done Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thursday, December 10, 2009 1:40 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl hi, check the fetch time in your crawldb...you can dump all the crawldb like this: ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db entries will look like this: http://www.YOUR_URL_TO_FETCH Status: 2 (db_fetched) Fetch time: Thu Dec 10 09:19:18 EST 2009 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 18000 seconds (0 days) Score: 0.0014977538 Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c Metadata: _pst_: success(1), lastModified=0 as you see the next time the page will be fetched is in fetch time : 'Fetch time: Thu Dec 10 09:19:18 EST 2009' and check the rety interval : it should be your 3600. hope it will help Subject: RE: how to force nutch to do a recrawl Date: Wed, 9 Dec 2009 16:06:58 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Okay. I'll dig a little deeper. I saw a few scripts that people had created, but I couldn't get them to work. Thanks much. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: MilleBii [mailto:mille...@gmail.com] Sent: Wednesday, December 09, 2009 4:05 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl I don't that you can use nutch crawl
Re: domain vs www.domain?
On 2009-12-10 19:59, Jesse Hires wrote: I'm seeing a lot of duplicates where a single site is getting recognized as two different sites. Specifically I am seeing www.domain.com and domain.combeing recognized as two different sites. I imagine there is a setting to prevent this. If so, what is the setting, if not, what would you recomend doing to prevent this? This is a surprisingly difficult problem to solve in general case, because it's not always true that 'www.domain' equals 'domain'. If you do know this is true in your particular case, you can add a rule to regex-urlnormalizer that changes the matching urls to e.g. always lose the 'www.' part. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: NOINDEX, NOFOLLOW
On Thu, Dec 10, 2009 at 2:55 PM, BELLINI ADAM mbel...@msn.com wrote: hi, thx for these informations, but since i'm using solr index, and when i make a search i get a blank result... for example if i will have 10 documents as a search result, 9 will be ok (because i display the title and 4 first lines of content), but i obtain one blank result becoz of this page (with no content and no title) ! i dont understans why it is in the index since it was setted as noindex !? I've never used the Solr integration, so I'm unable to help you. This sounds like a bug to me, but I'm not sure. Hopefully one of the Solr users will help us out and let you know what they think. Thanks, Kirby ...Snip...
RE: how to force nutch to do a recrawl
Adam, I'm on windows unfortunately!! I'm using cygdrive, but it doesn't recognize vi. Any idea for opening it in windows? Notepad didn't work either. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thursday, December 10, 2009 4:01 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl jus use vi or vim i use vi to edit the file Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 15:58:24 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, What do I use to open a CRC file? I tried QuickSFV. Thanks in advance! Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thursday, December 10, 2009 3:48 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl it will not dump to the console ! whole_db is a folder and you have to edit the file you will find in this folder Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 14:26:30 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I tried running that command and get the following (it created a whole_db directory, but it's not dumping out the contents to the console): $ bin/nutch readdb crawl/crawldb/ -dump whole_db CrawlDb dump: starting CrawlDb db: crawl/crawldb/ CrawlDb dump: done Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thursday, December 10, 2009 1:40 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl hi, check the fetch time in your crawldb...you can dump all the crawldb like this: ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db entries will look like this: http://www.YOUR_URL_TO_FETCH Status: 2 (db_fetched) Fetch time: Thu Dec 10 09:19:18 EST 2009 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 18000 seconds (0 days) Score: 0.0014977538 Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c Metadata: _pst_: success(1), lastModified=0 as you see the next time the page will be fetched is in fetch time : 'Fetch time: Thu Dec 10 09:19:18 EST 2009' and check the rety interval : it should be your 3600. hope it will help Subject: RE: how to force nutch to do a recrawl Date: Wed, 9 Dec 2009 16:06:58 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Okay. I'll dig a little deeper. I saw a few scripts that people had created, but I couldn't get them to
RE: how to force nutch to do a recrawl
bu8t how you are running sh scripts... you have to use cygwin to be able to edit linux files Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 16:09:13 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I'm on windows unfortunately!! I'm using cygdrive, but it doesn't recognize vi. Any idea for opening it in windows? Notepad didn't work either. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thursday, December 10, 2009 4:01 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl jus use vi or vim i use vi to edit the file Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 15:58:24 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, What do I use to open a CRC file? I tried QuickSFV. Thanks in advance! Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thursday, December 10, 2009 3:48 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl it will not dump to the console ! whole_db is a folder and you have to edit the file you will find in this folder Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 14:26:30 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I tried running that command and get the following (it created a whole_db directory, but it's not dumping out the contents to the console): $ bin/nutch readdb crawl/crawldb/ -dump whole_db CrawlDb dump: starting CrawlDb db: crawl/crawldb/ CrawlDb dump: done Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: BELLINI ADAM [mailto:mbel...@msn.com] Sent: Thursday, December 10, 2009 1:40 PM To: nutch-user@lucene.apache.org Subject: RE: how to force nutch to do a recrawl hi, check the fetch time in your crawldb...you can dump all the crawldb like this: ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db entries will look like this: http://www.YOUR_URL_TO_FETCH Status: 2 (db_fetched) Fetch time: Thu Dec 10 09:19:18 EST 2009 Modified time: Wed Dec 31 19:00:00 EST 1969 Retries since fetch: 0 Retry interval: 18000 seconds (0 days) Score: 0.0014977538 Signature: 15cdb66f15eb4dd3e1b16581b78XcfC5c Metadata: _pst_: success(1), lastModified=0 as you see the next time the page will be fetched is in fetch time :
Re: domain vs www.domain?
For the specific case I was running into (on a single known domain) using regex-urlnormalizer did the trick. Thanks! Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Thu, Dec 10, 2009 at 1:01 PM, Andrzej Bialecki a...@getopt.org wrote: On 2009-12-10 19:59, Jesse Hires wrote: I'm seeing a lot of duplicates where a single site is getting recognized as two different sites. Specifically I am seeing www.domain.com and domain.combeing recognized as two different sites. I imagine there is a setting to prevent this. If so, what is the setting, if not, what would you recomend doing to prevent this? This is a surprisingly difficult problem to solve in general case, because it's not always true that 'www.domain' equals 'domain'. If you do know this is true in your particular case, you can add a rule to regex-urlnormalizer that changes the matching urls to e.g. always lose the 'www.' part. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com