[jira] Closed: (NUTCH-239) I changed httpclient to use javax.net.ssl instead of com.sun.net.ssl

2006-03-25 Thread Piotr Kosiorowski (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-239?page=all ]
 
Piotr Kosiorowski closed NUTCH-239:
---

Fix Version: 0.7.2-dev
 Resolution: Fixed
  Assign To: Piotr Kosiorowski

Applied with JavaDoc changes. Thanks.

 I changed httpclient to use javax.net.ssl instead of com.sun.net.ssl
 

  Key: NUTCH-239
  URL: http://issues.apache.org/jira/browse/NUTCH-239
  Project: Nutch
 Type: Improvement
   Components: fetcher
 Versions: 0.7.2-dev
  Environment: RedHat Enterprise Linux
 Reporter: Jake Vanderdray
 Assignee: Piotr Kosiorowski
 Priority: Trivial
  Fix For: 0.7.2-dev


 I made the following changes in order to get the dependency on com.sun.ssl 
 out of the 0.7 branch.  The same changes have already been applied to the 0.8 
 branch (Revision 379215) thanks to ab.  There is still a dependency on using 
 the Sun JRE.  In order to get it to work with the IBM JRE I had to change 
 SunX509 to IbmX509, but I didn't include that change in this patch.  
 Thanks,
 Jake.
 Index: DummySSLProtocolSocketFactory.java
 ===
 --- DummySSLProtocolSocketFactory.java  (revision 388638)
 +++ DummySSLProtocolSocketFactory.java  (working copy)
 @@ -22,8 +22,8 @@
  import org.apache.commons.logging.Log;
  import org.apache.commons.logging.LogFactory;
  
 -import com.sun.net.ssl.SSLContext;
 -import com.sun.net.ssl.TrustManager;
 +import javax.net.ssl.SSLContext;
 +import javax.net.ssl.TrustManager;
  
  public class DummySSLProtocolSocketFactory implements ProtocolSocketFactory {
  
 Index: DummyX509TrustManager.java
 ===
 --- DummyX509TrustManager.java  (revision 388638)
 +++ DummyX509TrustManager.java  (working copy)
 @@ -10,9 +10,9 @@
  import java.security.cert.CertificateException;
  import java.security.cert.X509Certificate;
  
 -import com.sun.net.ssl.TrustManagerFactory;
 -import com.sun.net.ssl.TrustManager;
 -import com.sun.net.ssl.X509TrustManager;
 +import javax.net.ssl.TrustManagerFactory;
 +import javax.net.ssl.TrustManager;
 +import javax.net.ssl.X509TrustManager;
  import org.apache.commons.logging.Log; 
  import org.apache.commons.logging.LogFactory;
  
 @@ -57,4 +57,12 @@
  public X509Certificate[] getAcceptedIssuers() {
  return this.standardTrustManager.getAcceptedIssuers();
  }
 +   
 +public void checkClientTrusted(X509Certificate[] arg0, String arg1) 
 throws CertificateException {
 +   // do nothing
 +}
 +
 +public void checkServerTrusted(X509Certificate[] arg0, String arg1) 
 throws CertificateException {
 +   // do nothing
 +}
  }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-94) MapFile.Writer throwing 'File exists error'.

2006-03-25 Thread Piotr Kosiorowski (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-94?page=all ]
 
Piotr Kosiorowski closed NUTCH-94:
--

Fix Version: 0.7.2-dev
 Resolution: Duplicate
  Assign To: Piotr Kosiorowski

Duplicate ofNUTCH-117.

 MapFile.Writer throwing 'File exists error'.
 

  Key: NUTCH-94
  URL: http://issues.apache.org/jira/browse/NUTCH-94
  Project: Nutch
 Type: Bug
   Components: fetcher
 Versions: 0.6
  Environment: Server 2003, Resin, 1.4.2_05
 Reporter: Michael Couck
 Assignee: Piotr Kosiorowski
  Fix For: 0.7.2-dev


 Running Nutch inside a server JVM or multiple times in the same JVM, 
 MapFile.Writer doesn't get collected or closed by the WebDBWriter and the 
 associated files and directories are not deleted, consequently throws a File 
 exists error in the constructor of MapFile.Writer.
 Seems that this portion of code is very heavily integrated into Nutch and I 
 am hesitant to look for a solution personally as a retrofit will be necessary 
 with every release.
 Has anyone got any ideas, had the same issue, any solutions?
 Regards
 Michael

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-210) Context.xml file for Nutch web application

2006-03-25 Thread Jerome Charron (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-210?page=all ]

Jerome Charron updated NUTCH-210:
-

Attachment: NUTCH-210.060325.patch

I Chris,

I made some minor changes to your patch (see my attached patch 
NUTCH-210.060325.patch):
* Refactoring of the xsl code, and add query.* properties to the nutch.xml
* Remove the JspUtil class and move the code to a 
NutchConfiguration.get(ServletContext) method.

I used this patch = very usefull, I like it.
If no objections about it, I will commit it in the next few days.

Thanks Chris

Jérôme

 Context.xml file for Nutch web application
 --

  Key: NUTCH-210
  URL: http://issues.apache.org/jira/browse/NUTCH-210
  Project: Nutch
 Type: Improvement
   Components: web gui
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: iMAC G5 2.3 Ghz, Mac OS X Tiger (10.4.3), 1.5 GB RAM, although 
 improvement is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1
  Attachments: NUTCH-210.060325.patch, NUTCH-210.Mattmann.patch.txt

 Currently the nutch web gui references a few parameters that are highly 
 dynamic, e.g., searcher.dir. These dynamic properties are read from the 
 configuration files, such as nutch-default.xml. One problem I'm noticing 
 however is that in order to change the parameter in the built webapp (the WAR 
 file), I am required to change the parameter first in the checked out Nutch 
 source tree, rebuild the webapp, then redploy. Or, if I'm feeling really 
 gutsty, I can go poke around in the unpackaged WAR file if the servlet 
 container exposes it to me, and try and modify the nutch-default.xml file 
 that way. However, I think that it would be really nice (and highly useful 
 for that matter) to factor out some of the more dynamic parameters of the web 
 application into a separate deliverable Context.xml file that would accompany 
 the webapp. The Context.xml file would be deployed in the webapps directory, 
 as oppossed to the WAR file itself, and the parameters could be updated 
 there, and changed as many times as necessary, without rebuilding the WAR 
 file. 
 Of course this will involve making minor modifications in the web GUI to 
 where some of the dynamic parameters are read from (i.e., make it read them 
 from the Context.xml file (using application.getParameter most likely). Right 
 now the only one I can think of is searcher.dir, but I'm sure that there are 
 others (in particular the searcher.dir one is the most annoying for me). 
 The timeframe on this patch will be within the next month.
 Thanks,
   Chris

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-210) Context.xml file for Nutch web application

2006-03-25 Thread Chris A. Mattmann (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-210?page=comments#action_12371849 ] 

Chris A. Mattmann commented on NUTCH-210:
-

Hi Jerome,

 The updates look fine. No objections from my end. I hope people find the patch 
useful.

Cheers,
  Chris


 Context.xml file for Nutch web application
 --

  Key: NUTCH-210
  URL: http://issues.apache.org/jira/browse/NUTCH-210
  Project: Nutch
 Type: Improvement
   Components: web gui
 Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
  Environment: iMAC G5 2.3 Ghz, Mac OS X Tiger (10.4.3), 1.5 GB RAM, although 
 improvement is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor
  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1
  Attachments: NUTCH-210.060325.patch, NUTCH-210.Mattmann.patch.txt

 Currently the nutch web gui references a few parameters that are highly 
 dynamic, e.g., searcher.dir. These dynamic properties are read from the 
 configuration files, such as nutch-default.xml. One problem I'm noticing 
 however is that in order to change the parameter in the built webapp (the WAR 
 file), I am required to change the parameter first in the checked out Nutch 
 source tree, rebuild the webapp, then redploy. Or, if I'm feeling really 
 gutsty, I can go poke around in the unpackaged WAR file if the servlet 
 container exposes it to me, and try and modify the nutch-default.xml file 
 that way. However, I think that it would be really nice (and highly useful 
 for that matter) to factor out some of the more dynamic parameters of the web 
 application into a separate deliverable Context.xml file that would accompany 
 the webapp. The Context.xml file would be deployed in the webapps directory, 
 as oppossed to the WAR file itself, and the parameters could be updated 
 there, and changed as many times as necessary, without rebuilding the WAR 
 file. 
 Of course this will involve making minor modifications in the web GUI to 
 where some of the dynamic parameters are read from (i.e., make it read them 
 from the Context.xml file (using application.getParameter most likely). Right 
 now the only one I can think of is searcher.dir, but I'm sure that there are 
 others (in particular the searcher.dir one is the most annoying for me). 
 The timeframe on this patch will be within the next month.
 Thanks,
   Chris

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-14) NullPointerException NutchBean.getSummary

2006-03-25 Thread Piotr Kosiorowski (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-14?page=all ]
 
Piotr Kosiorowski closed NUTCH-14:
--

Resolution: Cannot Reproduce

Closed according to Stefan suggestion

 NullPointerException NutchBean.getSummary
 -

  Key: NUTCH-14
  URL: http://issues.apache.org/jira/browse/NUTCH-14
  Project: Nutch
 Type: Bug
   Components: searcher
 Reporter: Stefan Groschupf
 Priority: Minor


 In heavy load scenarios this may happens when connection broke.
 java.lang.NullPointerException
 at java.util.Hashtable.get(Hashtable.java:333)
 at net.nutch.ipc.Client.getConnection(Client.java:276)
 at net.nutch.ipc.Client.call(Client.java:251)
 at 
 net.nutch.searcher.DistributedSearch$Client.getSummary(DistributedSearch.java:418)
 at net.nutch.searcher.NutchBean.getSummary(NutchBean.java:236)
 at 
 org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:396)
 at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:99)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:325)
 at 
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
 at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:825)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:738)
 at 
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:526)
 at 
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
 at 
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
 at java.lang.Thread.run(Thread.java:552)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-117) Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL

2006-03-25 Thread Piotr Kosiorowski (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-117?page=all ]
 
Piotr Kosiorowski closed NUTCH-117:
---

Fix Version: 0.7.2-dev
 Resolution: Fixed
  Assign To: Piotr Kosiorowski

Applied fixed by Mike. Also reported offlist by Michal Karwanski.

 Crawl crashes with java.io.IOException: already exists: 
 C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
 -

  Key: NUTCH-117
  URL: http://issues.apache.org/jira/browse/NUTCH-117
  Project: Nutch
 Type: Bug
 Versions: 0.7.1, 0.7, 0.6
  Environment: Window 2000  P4 1.70GHz 512MB RAM
 Java 1.5.0_05
 Reporter: Stephen Cross
 Assignee: Piotr Kosiorowski
 Priority: Critical
  Fix For: 0.7.2-dev


 I started a crawl using the command line using nutch 0.7.1.
 nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20
 After crawling for over 15 hours the crawl crached with the following 
 exception:
 051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 
 bytes, 48020 ms
 051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page
 051019 050544 Updating C:\nutch\crawl.intranet\oct18\db
 051019 050544 Updating for 
 C:\nutch\crawl.intranet\oct18\segments\20051019050438
 051019 050544 Processing document 0
 051019 050544 Finishing update
 051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds.
 051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second
 Exception in thread main java.io.IOException: already exists: 
 C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
 at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:86)
 at 
 org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
 at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
 at 
 org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
 at 
 org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
 at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
 This was on the 14th segement from the requested depth of 20. Doing a quick 
 Google on the exception brings up a few previous posts with the same error 
 but no definitive answer, seems to have been occuring since nutch 0.6.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException

2006-03-25 Thread Richard Braman (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12371887 ] 

Richard Braman commented on NUTCH-220:
--

Here is an example of the error from my log file.  It seems it was fixed with 
the latest PDFBox pre Ben Litchfiled, developer of PDF Box.


060325 212856 fetch of http://www.state.sd.us/drr2/reg/bank/Trust%20Fee%20Calcul
ation.pdf failed with: java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:180
)
at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:171
)
at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:24
5)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:185)
060325 212856 SEVERE fetcher caught:java.lang.NullPointerException

 PDF Box can't parse document: java.lang.NullPointerException
 

  Key: NUTCH-220
  URL: http://issues.apache.org/jira/browse/NUTCH-220
  Project: Nutch
 Type: Bug
  Environment: PDFBox 0.7.2
 Reporter: Richard Braman


 This error was fixed in the ltest build of PDFBOx, which should be tested 
 with nutch.
  060228 160354 fetch okay, but can't parse
  http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
  failed(2,0): Can't be handled as pdf document. 
  java.lang.NullPointerException
 Yes, the NPE should be fixed.
  Ben
 Richard Braman wrote:
  Hi Bn,
 
  We actually got to the bottom of all of them except for 1... The 
  content truncatetion was due to an inconsistancy bug in nutch config .
  The no permission to extract text is actually true, the author, the NC
  Department of revenue put this restriction on all of their files (I have
  asked them to remove it as it hampers public accessability).  The Null
  pointer exception is the only one to deal with that may be due to the
  parsing bug .  Is this one that you are referring to?
 
  -Original Message-
  From: Ben Litchfield [mailto:[EMAIL PROTECTED]
  Sent: Thursday, March 02, 2006 4:07 PM
  To: Richard Braman
  Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org;
  [EMAIL PROTECTED]
  Subject: Re: [PDFBox-user] PDF Parse Error
 
 
 
  I believe these errors are due to a parsing bug in PDFBox that has 
  been fixed since the 0.7.2 release.  Please give the nightly 
  build(should be a drop in replacement) a try from 
  http://www.pdfbox.org/dist and let me know if you are still having 
  issues.
 
  Ben

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira