http keep alive

2009-10-14 Thread Marko Bauhardt

hi.
is there a way for using http-keep-alive with nutch?
supports protocol-http or protocol-httpclient keep alive?

i cant find the using of http-keep-alive inside the code or in  
configuration files?


thanks
marko



Re: http keep alive

2009-10-14 Thread Andrzej Bialecki

Marko Bauhardt wrote:

hi.
is there a way for using http-keep-alive with nutch?
supports protocol-http or protocol-httpclient keep alive?

i cant find the using of http-keep-alive inside the code or in 
configuration files?


protocol-httpclient can support keep-alive. However, I think that it 
won't help you much. Please consider that Fetcher needs to wait some 
time between requests, and in the meantime it will issue requests to 
other sites. This means that if you want to use keep-alive connections 
then the number of open connections will climb up quickly, depending on 
the number of unique sites on your fetchlist, until you run out of 
available sockets. On the other hand, if the number of unique sites is 
small, then most of the time the Fetcher will wait anyway, so the 
benefit from keep-alives (for you as a client) will be small - though 
there will be still some benefit for the server side.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Recrawling Nutch

2009-10-14 Thread Paul Tomblin
nutch doesn't do a good job on storing or testing the Last-Modified
time of pages it's crawled.  I made the following changes which seem
to help a lot:

snowbird:~/src/nutch/trunk svn diff
Index: src/java/org/apache/nutch/fetcher/Fetcher.java
===
--- src/java/org/apache/nutch/fetcher/Fetcher.java  (revision 817382)
+++ src/java/org/apache/nutch/fetcher/Fetcher.java  (working copy)
@@ -21,6 +21,7 @@
 import java.net.MalformedURLException;
 import java.net.URL;
 import java.net.UnknownHostException;
+import java.text.ParseException;
 import java.util.*;
 import java.util.Map.Entry;
 import java.util.concurrent.atomic.AtomicInteger;
@@ -42,6 +43,7 @@
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.metadata.Nutch;
 import org.apache.nutch.net.*;
+import org.apache.nutch.net.protocols.HttpDateFormat;
 import org.apache.nutch.protocol.*;
 import org.apache.nutch.parse.*;
 import org.apache.nutch.scoring.ScoringFilters;
@@ -742,6 +744,23 @@

   datum.setStatus(status);
   datum.setFetchTime(System.currentTimeMillis());
+  LOG.debug(metadata =  + (content != null ?
content.getMetadata() : content-null));
+  LOG.debug(modified? =  + ((content != null 
content.getMetadata() != null) ?
content.getMetadata().get(Last-Modified) : content-null));
+  if (content != null  content.getMetadata() != null 
content.getMetadata().get(Last-Modified) != null)
+  {
+  String lastModifiedStr = content.getMetadata().get(Last-Modified);
+
+  try
+  {
+  long lastModifiedDate = HttpDateFormat.toLong(lastModifiedStr);
+  LOG.debug(last modified =  + lastModifiedStr +  = 
+ lastModifiedDate);
+  datum.setModifiedTime(lastModifiedDate);
+  }
+  catch (ParseException e)
+  {
+  LOG.error(unable to parse  + lastModifiedStr, e);
+  }
+  }
   if (pstatus != null)
datum.getMetaData().put(Nutch.WRITABLE_PROTO_STATUS_KEY, pstatus);

   ParseResult parseResult = null;
Index: src/java/org/apache/nutch/indexer/IndexerMapReduce.java
===
--- src/java/org/apache/nutch/indexer/IndexerMapReduce.java (revision 
817382)
+++ src/java/org/apache/nutch/indexer/IndexerMapReduce.java (working copy)
@@ -84,8 +84,10 @@
 if (CrawlDatum.hasDbStatus(datum))
   dbDatum = datum;
 else if (CrawlDatum.hasFetchStatus(datum)) {
-  // don't index unmodified (empty) pages
-  if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)
+  /*
+   * Where did this person get the idea that unmodified pages
are empty?
+   // don't index unmodified (empty) pages
+  if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) */
 fetchDatum = datum;
 } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
@@ -108,7 +110,7 @@
 }

 if (!parseData.getStatus().isSuccess() ||
-fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
+(fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS 
fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)) {
   return;
 }

Index: 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
===
--- 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
  (revision
817382)
+++ 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
  (working
copy)
@@ -124,11 +124,14 @@
 reqStr.append(\r\n);
   }

-  reqStr.append(\r\n);
   if (datum.getModifiedTime()  0) {
-reqStr.append(If-Modified-Since:  +
HttpDateFormat.toString(datum.getModifiedTime()));
+   String httpDate =
+ HttpDateFormat.toString(datum.getModifiedTime());
+   Http.LOG.debug(modified time:  + httpDate);
+reqStr.append(If-Modified-Since:  + httpDate);
 reqStr.append(\r\n);
   }
+  reqStr.append(\r\n);

   byte[] reqBytes= reqStr.toString().getBytes();



On Wed, Oct 14, 2009 at 9:40 AM, sprabhu_PN
shreekanth.pra...@pinakilabs.com wrote:

 We are looking at picking up updates in a recrawl - How do I get the the
 fetcher to read the recently built segment, get to the url and decide
 whether to get the content based on whether the url has been updated since?
 

 Shreekanth Prabhu
 --
 View this message in context: 
 http://www.nabble.com/Recrawling--Nutch-tp25891294p25891294.html
 Sent from the Nutch - User mailing list archive at Nabble.com.





-- 
http://www.linkedin.com/in/paultomblin


RE: http keep alive

2009-10-14 Thread Fuad Efendi
I'd like to add:

Keep-Alive is not polite. It uses dedicated listener on server-side. 
Establishing TCP socket via specific IP handshake takes time, that's why 
KeepAlive exists for web servers - to improve performance of subsequent 
requests. However, it allocated dedicated listener for specific IP port / 
remote client... 

What will happen with classic setting of 150 processes in HTTPD 1.3 in case of 
150 robots trying to use Keep-Alive feature?

==
http://www.linkedin.com/in/liferay


 
 protocol-httpclient can support keep-alive. However, I think that it
 won't help you much. Please consider that Fetcher needs to wait some
 time between requests, and in the meantime it will issue requests to
 other sites. This means that if you want to use keep-alive connections
 then the number of open connections will climb up quickly, depending on
 the number of unique sites on your fetchlist, until you run out of
 available sockets. On the other hand, if the number of unique sites is
 small, then most of the time the Fetcher will wait anyway, so the
 benefit from keep-alives (for you as a client) will be small - though
 there will be still some benefit for the server side.
 
 
 
 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





Nutch-based Application for Windows - New Release

2009-10-14 Thread John Whelan

All,

Version 2.1 has now been release. This version adds the following:

1. Updated Nutch from 1.0-dev (build 2008-10-28) to 1.1-dev (build
2009-09-09)
2. Updated Tomcat from 6.0.16 to 6.0.20.
3. Fixed bugs related to running in non-English locales.
4. Fixed bug in uninstaller. (Improved handling of situations where
application is still running
while attempting uninstall..)
5. Added support for scheduled crawls*.
6. Added support for automatic startup on reboot (Scheduled restart*).

The following page describes the application in detail, and contains links
to the download sites:
http://www.whelanlabs.com/content/SearchEngineManager.htm
http://www.whelanlabs.com/content/SearchEngineManager.htm 

Enjoy!

Regards,
John

* Scheduled Tasks has been succssfully tested on XP. This feature not yet
supported on Vista.
-- 
View this message in context: 
http://www.nabble.com/Nutch-based-Application-for-Windows---New-Release-tp25902543p25902543.html
Sent from the Nutch - User mailing list archive at Nabble.com.



NUTCH_CRAWLING

2009-10-14 Thread meh

Hai, 

bin/nutch crawl urls -dir crawl_NEW1 -depth 3 -topN 50 

I have used the above command to crawl. 

I am getting the following error. 

Dedup: adding indexes in: crawl_NEW1/indexes 
Exception in thread main java.io.IOException: Job failed! 
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) 
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java 
:439) 
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) 


can anyone help me to resolve this problem. 

Thank you in advance. 

-- 
View this message in context: 
http://www.nabble.com/NUTCH_CRAWLING-tp25903220p25903220.html
Sent from the Nutch - User mailing list archive at Nabble.com.