http keep alive
hi. is there a way for using http-keep-alive with nutch? supports protocol-http or protocol-httpclient keep alive? i cant find the using of http-keep-alive inside the code or in configuration files? thanks marko
Re: http keep alive
Marko Bauhardt wrote: hi. is there a way for using http-keep-alive with nutch? supports protocol-http or protocol-httpclient keep alive? i cant find the using of http-keep-alive inside the code or in configuration files? protocol-httpclient can support keep-alive. However, I think that it won't help you much. Please consider that Fetcher needs to wait some time between requests, and in the meantime it will issue requests to other sites. This means that if you want to use keep-alive connections then the number of open connections will climb up quickly, depending on the number of unique sites on your fetchlist, until you run out of available sockets. On the other hand, if the number of unique sites is small, then most of the time the Fetcher will wait anyway, so the benefit from keep-alives (for you as a client) will be small - though there will be still some benefit for the server side. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Recrawling Nutch
nutch doesn't do a good job on storing or testing the Last-Modified time of pages it's crawled. I made the following changes which seem to help a lot: snowbird:~/src/nutch/trunk svn diff Index: src/java/org/apache/nutch/fetcher/Fetcher.java === --- src/java/org/apache/nutch/fetcher/Fetcher.java (revision 817382) +++ src/java/org/apache/nutch/fetcher/Fetcher.java (working copy) @@ -21,6 +21,7 @@ import java.net.MalformedURLException; import java.net.URL; import java.net.UnknownHostException; +import java.text.ParseException; import java.util.*; import java.util.Map.Entry; import java.util.concurrent.atomic.AtomicInteger; @@ -42,6 +43,7 @@ import org.apache.nutch.metadata.Metadata; import org.apache.nutch.metadata.Nutch; import org.apache.nutch.net.*; +import org.apache.nutch.net.protocols.HttpDateFormat; import org.apache.nutch.protocol.*; import org.apache.nutch.parse.*; import org.apache.nutch.scoring.ScoringFilters; @@ -742,6 +744,23 @@ datum.setStatus(status); datum.setFetchTime(System.currentTimeMillis()); + LOG.debug(metadata = + (content != null ? content.getMetadata() : content-null)); + LOG.debug(modified? = + ((content != null content.getMetadata() != null) ? content.getMetadata().get(Last-Modified) : content-null)); + if (content != null content.getMetadata() != null content.getMetadata().get(Last-Modified) != null) + { + String lastModifiedStr = content.getMetadata().get(Last-Modified); + + try + { + long lastModifiedDate = HttpDateFormat.toLong(lastModifiedStr); + LOG.debug(last modified = + lastModifiedStr + = + lastModifiedDate); + datum.setModifiedTime(lastModifiedDate); + } + catch (ParseException e) + { + LOG.error(unable to parse + lastModifiedStr, e); + } + } if (pstatus != null) datum.getMetaData().put(Nutch.WRITABLE_PROTO_STATUS_KEY, pstatus); ParseResult parseResult = null; Index: src/java/org/apache/nutch/indexer/IndexerMapReduce.java === --- src/java/org/apache/nutch/indexer/IndexerMapReduce.java (revision 817382) +++ src/java/org/apache/nutch/indexer/IndexerMapReduce.java (working copy) @@ -84,8 +84,10 @@ if (CrawlDatum.hasDbStatus(datum)) dbDatum = datum; else if (CrawlDatum.hasFetchStatus(datum)) { - // don't index unmodified (empty) pages - if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) + /* + * Where did this person get the idea that unmodified pages are empty? + // don't index unmodified (empty) pages + if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) */ fetchDatum = datum; } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() || CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) { @@ -108,7 +110,7 @@ } if (!parseData.getStatus().isSuccess() || -fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) { +(fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)) { return; } Index: src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java === --- src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java (revision 817382) +++ src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java (working copy) @@ -124,11 +124,14 @@ reqStr.append(\r\n); } - reqStr.append(\r\n); if (datum.getModifiedTime() 0) { -reqStr.append(If-Modified-Since: + HttpDateFormat.toString(datum.getModifiedTime())); + String httpDate = + HttpDateFormat.toString(datum.getModifiedTime()); + Http.LOG.debug(modified time: + httpDate); +reqStr.append(If-Modified-Since: + httpDate); reqStr.append(\r\n); } + reqStr.append(\r\n); byte[] reqBytes= reqStr.toString().getBytes(); On Wed, Oct 14, 2009 at 9:40 AM, sprabhu_PN shreekanth.pra...@pinakilabs.com wrote: We are looking at picking up updates in a recrawl - How do I get the the fetcher to read the recently built segment, get to the url and decide whether to get the content based on whether the url has been updated since? Shreekanth Prabhu -- View this message in context: http://www.nabble.com/Recrawling--Nutch-tp25891294p25891294.html Sent from the Nutch - User mailing list archive at Nabble.com. -- http://www.linkedin.com/in/paultomblin
RE: http keep alive
I'd like to add: Keep-Alive is not polite. It uses dedicated listener on server-side. Establishing TCP socket via specific IP handshake takes time, that's why KeepAlive exists for web servers - to improve performance of subsequent requests. However, it allocated dedicated listener for specific IP port / remote client... What will happen with classic setting of 150 processes in HTTPD 1.3 in case of 150 robots trying to use Keep-Alive feature? == http://www.linkedin.com/in/liferay protocol-httpclient can support keep-alive. However, I think that it won't help you much. Please consider that Fetcher needs to wait some time between requests, and in the meantime it will issue requests to other sites. This means that if you want to use keep-alive connections then the number of open connections will climb up quickly, depending on the number of unique sites on your fetchlist, until you run out of available sockets. On the other hand, if the number of unique sites is small, then most of the time the Fetcher will wait anyway, so the benefit from keep-alives (for you as a client) will be small - though there will be still some benefit for the server side. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Nutch-based Application for Windows - New Release
All, Version 2.1 has now been release. This version adds the following: 1. Updated Nutch from 1.0-dev (build 2008-10-28) to 1.1-dev (build 2009-09-09) 2. Updated Tomcat from 6.0.16 to 6.0.20. 3. Fixed bugs related to running in non-English locales. 4. Fixed bug in uninstaller. (Improved handling of situations where application is still running while attempting uninstall..) 5. Added support for scheduled crawls*. 6. Added support for automatic startup on reboot (Scheduled restart*). The following page describes the application in detail, and contains links to the download sites: http://www.whelanlabs.com/content/SearchEngineManager.htm http://www.whelanlabs.com/content/SearchEngineManager.htm Enjoy! Regards, John * Scheduled Tasks has been succssfully tested on XP. This feature not yet supported on Vista. -- View this message in context: http://www.nabble.com/Nutch-based-Application-for-Windows---New-Release-tp25902543p25902543.html Sent from the Nutch - User mailing list archive at Nabble.com.
NUTCH_CRAWLING
Hai, bin/nutch crawl urls -dir crawl_NEW1 -depth 3 -topN 50 I have used the above command to crawl. I am getting the following error. Dedup: adding indexes in: crawl_NEW1/indexes Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java :439) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) can anyone help me to resolve this problem. Thank you in advance. -- View this message in context: http://www.nabble.com/NUTCH_CRAWLING-tp25903220p25903220.html Sent from the Nutch - User mailing list archive at Nabble.com.