java.net.URL synchronization

2009-12-09 Thread Otis Gospodnetic
Hello,

Has anyone seen this:
http://www.supermind.org/blog/580/java-net-url-synchronization-bottleneck ?

Is this something that needs to be addressed in Nutch (and thus in Bixo, and 
thus in the common crawler project)?


Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



RE: java.net.URL synchronization

2009-12-09 Thread Fuad Efendi
I checked java.net.URL; yes, Nutch and BIXO implicitly use synchronized
Hashtable:
  
  
public URL(String protocol, String host, int port, String file,
   URLStreamHandler handler) throws MalformedURLException {

...
if (handler == null 
(handler = getURLStreamHandler(protocol)) == null) {
throw new MalformedURLException(unknown protocol:  +
protocol);
}

...


However, I don't think it hurts because both architecture (at least, BIXO)
run single thread in a single JVM to process, for instance, Outlinks. Only
Fetch part is multithreaded, but it doesn't use URL class.


Not sure about Nutch, how the fetch list is generated... if multithreaded
then shared between threads RegexUrlNormalizer is even bigger problem... 


Fuad Efendi
+1 416-993-2060
http://www.tokenizer.ca/
Data Mining, Vertical Search


 -Original Message-
 From: Otis Gospodnetic [mailto:ogjunk-nu...@yahoo.com]
 Sent: December-09-09 5:12 PM
 To: nutch-dev@lucene.apache.org
 Subject: java.net.URL synchronization
 
 Hello,
 
 Has anyone seen this:
 http://www.supermind.org/blog/580/java-net-url-synchronization-bottleneck
 ?
 
 Is this something that needs to be addressed in Nutch (and thus in Bixo,
 and thus in the common crawler project)?
 
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch





RE: java.net.URL synchronization

2009-12-09 Thread Fuad Efendi
Tomcat uses own slightly different version of URL class:

http://tomcat.apache.org/tomcat-5.5-doc/catalina/docs/api/index.html
URL is designed to provide public APIs for parsing and synthesizing Uniform
Resource Locators as similar as possible to the APIs of java.net.URL, but
without the ability to open a stream or connection. One of the consequences
of this is that you can construct URLs for protocols for which a
URLStreamHandler is not available (such as an https URL when JSSE is not
installed).



Synchonized staff in java.net.URL is URLStreamHandler -related.


 -Original Message-
 From: Fuad Efendi [mailto:f...@efendi.ca]
 Sent: December-09-09 5:40 PM
 To: nutch-dev@lucene.apache.org
 Subject: RE: java.net.URL synchronization
 
 I checked java.net.URL; yes, Nutch and BIXO implicitly use synchronized
 Hashtable:
 
 
 public URL(String protocol, String host, int port, String file,
  URLStreamHandler handler) throws MalformedURLException {
 
 ...
   if (handler == null 
 (handler = getURLStreamHandler(protocol)) == null) {
 throw new MalformedURLException(unknown protocol:  +
 protocol);
 }
 
 ...
 
 
 However, I don't think it hurts because both architecture (at least, BIXO)
 run single thread in a single JVM to process, for instance, Outlinks. Only
 Fetch part is multithreaded, but it doesn't use URL class.
 
 
 Not sure about Nutch, how the fetch list is generated... if multithreaded
 then shared between threads RegexUrlNormalizer is even bigger problem...
 
 
 Fuad Efendi
 +1 416-993-2060
 http://www.tokenizer.ca/
 Data Mining, Vertical Search
 
 
  -Original Message-
  From: Otis Gospodnetic [mailto:ogjunk-nu...@yahoo.com]
  Sent: December-09-09 5:12 PM
  To: nutch-dev@lucene.apache.org
  Subject: java.net.URL synchronization
 
  Hello,
 
  Has anyone seen this:
  http://www.supermind.org/blog/580/java-net-url-synchronization-
 bottleneck
  ?
 
  Is this something that needs to be addressed in Nutch (and thus in Bixo,
  and thus in the common crawler project)?
 
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
 
 


Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay





Build failed in Hudson: Nutch-trunk #1007

2009-12-09 Thread Apache Hudson Server
See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1007/changes

Changes:

[kubes] Remove old jetty jars that should have been removed with NUTCH-768, 
upgrade to Hadoop 0.20.1

--
[...truncated 4727 lines...]
jar:

init:

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: lib-regex-filter

compile-test:

compile:
 [echo] Compiling plugin: urlfilter-regex
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-regex/urlfilter-regex.jar

deps-test:

init:

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: lib-regex-filter

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-regex

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-suffix
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/classes
[javac] Note: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
 uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-suffix/urlfilter-suffix.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-suffix

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlfilter-validator/urlfilter-validator.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlfilter-validator

init:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/test

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] Compiling 1 source file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/classes

jar:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar

deps-test:

deploy:
[mkdir] Created dir: 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic

copy-generated-lib:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-basic

init:
[mkdir] Created dir: