[jira] Closed: (NUTCH-239) I changed httpclient to use javax.net.ssl instead of com.sun.net.ssl
[ http://issues.apache.org/jira/browse/NUTCH-239?page=all ] Piotr Kosiorowski closed NUTCH-239: --- Fix Version: 0.7.2-dev Resolution: Fixed Assign To: Piotr Kosiorowski Applied with JavaDoc changes. Thanks. I changed httpclient to use javax.net.ssl instead of com.sun.net.ssl Key: NUTCH-239 URL: http://issues.apache.org/jira/browse/NUTCH-239 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.2-dev Environment: RedHat Enterprise Linux Reporter: Jake Vanderdray Assignee: Piotr Kosiorowski Priority: Trivial Fix For: 0.7.2-dev I made the following changes in order to get the dependency on com.sun.ssl out of the 0.7 branch. The same changes have already been applied to the 0.8 branch (Revision 379215) thanks to ab. There is still a dependency on using the Sun JRE. In order to get it to work with the IBM JRE I had to change SunX509 to IbmX509, but I didn't include that change in this patch. Thanks, Jake. Index: DummySSLProtocolSocketFactory.java === --- DummySSLProtocolSocketFactory.java (revision 388638) +++ DummySSLProtocolSocketFactory.java (working copy) @@ -22,8 +22,8 @@ import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; -import com.sun.net.ssl.SSLContext; -import com.sun.net.ssl.TrustManager; +import javax.net.ssl.SSLContext; +import javax.net.ssl.TrustManager; public class DummySSLProtocolSocketFactory implements ProtocolSocketFactory { Index: DummyX509TrustManager.java === --- DummyX509TrustManager.java (revision 388638) +++ DummyX509TrustManager.java (working copy) @@ -10,9 +10,9 @@ import java.security.cert.CertificateException; import java.security.cert.X509Certificate; -import com.sun.net.ssl.TrustManagerFactory; -import com.sun.net.ssl.TrustManager; -import com.sun.net.ssl.X509TrustManager; +import javax.net.ssl.TrustManagerFactory; +import javax.net.ssl.TrustManager; +import javax.net.ssl.X509TrustManager; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; @@ -57,4 +57,12 @@ public X509Certificate[] getAcceptedIssuers() { return this.standardTrustManager.getAcceptedIssuers(); } + +public void checkClientTrusted(X509Certificate[] arg0, String arg1) throws CertificateException { + // do nothing +} + +public void checkServerTrusted(X509Certificate[] arg0, String arg1) throws CertificateException { + // do nothing +} } -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-94) MapFile.Writer throwing 'File exists error'.
[ http://issues.apache.org/jira/browse/NUTCH-94?page=all ] Piotr Kosiorowski closed NUTCH-94: -- Fix Version: 0.7.2-dev Resolution: Duplicate Assign To: Piotr Kosiorowski Duplicate ofNUTCH-117. MapFile.Writer throwing 'File exists error'. Key: NUTCH-94 URL: http://issues.apache.org/jira/browse/NUTCH-94 Project: Nutch Type: Bug Components: fetcher Versions: 0.6 Environment: Server 2003, Resin, 1.4.2_05 Reporter: Michael Couck Assignee: Piotr Kosiorowski Fix For: 0.7.2-dev Running Nutch inside a server JVM or multiple times in the same JVM, MapFile.Writer doesn't get collected or closed by the WebDBWriter and the associated files and directories are not deleted, consequently throws a File exists error in the constructor of MapFile.Writer. Seems that this portion of code is very heavily integrated into Nutch and I am hesitant to look for a solution personally as a retrofit will be necessary with every release. Has anyone got any ideas, had the same issue, any solutions? Regards Michael -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-210) Context.xml file for Nutch web application
[ http://issues.apache.org/jira/browse/NUTCH-210?page=all ] Jerome Charron updated NUTCH-210: - Attachment: NUTCH-210.060325.patch I Chris, I made some minor changes to your patch (see my attached patch NUTCH-210.060325.patch): * Refactoring of the xsl code, and add query.* properties to the nutch.xml * Remove the JspUtil class and move the code to a NutchConfiguration.get(ServletContext) method. I used this patch = very usefull, I like it. If no objections about it, I will commit it in the next few days. Thanks Chris Jérôme Context.xml file for Nutch web application -- Key: NUTCH-210 URL: http://issues.apache.org/jira/browse/NUTCH-210 Project: Nutch Type: Improvement Components: web gui Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: iMAC G5 2.3 Ghz, Mac OS X Tiger (10.4.3), 1.5 GB RAM, although improvement is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1 Attachments: NUTCH-210.060325.patch, NUTCH-210.Mattmann.patch.txt Currently the nutch web gui references a few parameters that are highly dynamic, e.g., searcher.dir. These dynamic properties are read from the configuration files, such as nutch-default.xml. One problem I'm noticing however is that in order to change the parameter in the built webapp (the WAR file), I am required to change the parameter first in the checked out Nutch source tree, rebuild the webapp, then redploy. Or, if I'm feeling really gutsty, I can go poke around in the unpackaged WAR file if the servlet container exposes it to me, and try and modify the nutch-default.xml file that way. However, I think that it would be really nice (and highly useful for that matter) to factor out some of the more dynamic parameters of the web application into a separate deliverable Context.xml file that would accompany the webapp. The Context.xml file would be deployed in the webapps directory, as oppossed to the WAR file itself, and the parameters could be updated there, and changed as many times as necessary, without rebuilding the WAR file. Of course this will involve making minor modifications in the web GUI to where some of the dynamic parameters are read from (i.e., make it read them from the Context.xml file (using application.getParameter most likely). Right now the only one I can think of is searcher.dir, but I'm sure that there are others (in particular the searcher.dir one is the most annoying for me). The timeframe on this patch will be within the next month. Thanks, Chris -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-210) Context.xml file for Nutch web application
[ http://issues.apache.org/jira/browse/NUTCH-210?page=comments#action_12371849 ] Chris A. Mattmann commented on NUTCH-210: - Hi Jerome, The updates look fine. No objections from my end. I hope people find the patch useful. Cheers, Chris Context.xml file for Nutch web application -- Key: NUTCH-210 URL: http://issues.apache.org/jira/browse/NUTCH-210 Project: Nutch Type: Improvement Components: web gui Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: iMAC G5 2.3 Ghz, Mac OS X Tiger (10.4.3), 1.5 GB RAM, although improvement is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1 Attachments: NUTCH-210.060325.patch, NUTCH-210.Mattmann.patch.txt Currently the nutch web gui references a few parameters that are highly dynamic, e.g., searcher.dir. These dynamic properties are read from the configuration files, such as nutch-default.xml. One problem I'm noticing however is that in order to change the parameter in the built webapp (the WAR file), I am required to change the parameter first in the checked out Nutch source tree, rebuild the webapp, then redploy. Or, if I'm feeling really gutsty, I can go poke around in the unpackaged WAR file if the servlet container exposes it to me, and try and modify the nutch-default.xml file that way. However, I think that it would be really nice (and highly useful for that matter) to factor out some of the more dynamic parameters of the web application into a separate deliverable Context.xml file that would accompany the webapp. The Context.xml file would be deployed in the webapps directory, as oppossed to the WAR file itself, and the parameters could be updated there, and changed as many times as necessary, without rebuilding the WAR file. Of course this will involve making minor modifications in the web GUI to where some of the dynamic parameters are read from (i.e., make it read them from the Context.xml file (using application.getParameter most likely). Right now the only one I can think of is searcher.dir, but I'm sure that there are others (in particular the searcher.dir one is the most annoying for me). The timeframe on this patch will be within the next month. Thanks, Chris -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-14) NullPointerException NutchBean.getSummary
[ http://issues.apache.org/jira/browse/NUTCH-14?page=all ] Piotr Kosiorowski closed NUTCH-14: -- Resolution: Cannot Reproduce Closed according to Stefan suggestion NullPointerException NutchBean.getSummary - Key: NUTCH-14 URL: http://issues.apache.org/jira/browse/NUTCH-14 Project: Nutch Type: Bug Components: searcher Reporter: Stefan Groschupf Priority: Minor In heavy load scenarios this may happens when connection broke. java.lang.NullPointerException at java.util.Hashtable.get(Hashtable.java:333) at net.nutch.ipc.Client.getConnection(Client.java:276) at net.nutch.ipc.Client.call(Client.java:251) at net.nutch.searcher.DistributedSearch$Client.getSummary(DistributedSearch.java:418) at net.nutch.searcher.NutchBean.getSummary(NutchBean.java:236) at org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:396) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:99) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:325) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:825) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:738) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:526) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:552) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-117) Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL
[ http://issues.apache.org/jira/browse/NUTCH-117?page=all ] Piotr Kosiorowski closed NUTCH-117: --- Fix Version: 0.7.2-dev Resolution: Fixed Assign To: Piotr Kosiorowski Applied fixed by Mike. Also reported offlist by Michal Karwanski. Crawl crashes with java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL - Key: NUTCH-117 URL: http://issues.apache.org/jira/browse/NUTCH-117 Project: Nutch Type: Bug Versions: 0.7.1, 0.7, 0.6 Environment: Window 2000 P4 1.70GHz 512MB RAM Java 1.5.0_05 Reporter: Stephen Cross Assignee: Piotr Kosiorowski Priority: Critical Fix For: 0.7.2-dev I started a crawl using the command line using nutch 0.7.1. nutch-daemon.sh start crawl urls.txt -dir oct18 -threads 4 -depth 20 After crawling for over 15 hours the crawl crached with the following exception: 051019 050543 status: segment 20051019050438, 30 pages, 0 errors, 1589818 bytes, 48020 ms 051019 050543 status: 0.6247397 pages/s, 258.65167 kb/s, 52993.934 bytes/page 051019 050544 Updating C:\nutch\crawl.intranet\oct18\db 051019 050544 Updating for C:\nutch\crawl.intranet\oct18\segments\20051019050438 051019 050544 Processing document 0 051019 050544 Finishing update 051019 050544 Processing pagesByURL: Sorted 47 instructions in 0.02 seconds. 051019 050544 Processing pagesByURL: Sorted 2350.0 instructions/second Exception in thread main java.io.IOException: already exists: C:\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL at org.apache.nutch.io.MapFile$Writer.init(MapFile.java:86) at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549) at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544) at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321) at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371) at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141) This was on the 14th segement from the requested depth of 20. Doing a quick Google on the exception brings up a few previous posts with the same error but no definitive answer, seems to have been occuring since nutch 0.6. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-220) PDF Box can't parse document: java.lang.NullPointerException
[ http://issues.apache.org/jira/browse/NUTCH-220?page=comments#action_12371887 ] Richard Braman commented on NUTCH-220: -- Here is an example of the error from my log file. It seems it was fixed with the latest PDFBox pre Ben Litchfiled, developer of PDF Box. 060325 212856 fetch of http://www.state.sd.us/drr2/reg/bank/Trust%20Fee%20Calcul ation.pdf failed with: java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:180 ) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:171 ) at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:24 5) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:185) 060325 212856 SEVERE fetcher caught:java.lang.NullPointerException PDF Box can't parse document: java.lang.NullPointerException Key: NUTCH-220 URL: http://issues.apache.org/jira/browse/NUTCH-220 Project: Nutch Type: Bug Environment: PDFBox 0.7.2 Reporter: Richard Braman This error was fixed in the ltest build of PDFBOx, which should be tested with nutch. 060228 160354 fetch okay, but can't parse http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason: failed(2,0): Can't be handled as pdf document. java.lang.NullPointerException Yes, the NPE should be fixed. Ben Richard Braman wrote: Hi Bn, We actually got to the bottom of all of them except for 1... The content truncatetion was due to an inconsistancy bug in nutch config . The no permission to extract text is actually true, the author, the NC Department of revenue put this restriction on all of their files (I have asked them to remove it as it hampers public accessability). The Null pointer exception is the only one to deal with that may be due to the parsing bug . Is this one that you are referring to? -Original Message- From: Ben Litchfield [mailto:[EMAIL PROTECTED] Sent: Thursday, March 02, 2006 4:07 PM To: Richard Braman Cc: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: [PDFBox-user] PDF Parse Error I believe these errors are due to a parsing bug in PDFBox that has been fixed since the 0.7.2 release. Please give the nightly build(should be a drop in replacement) a try from http://www.pdfbox.org/dist and let me know if you are still having issues. Ben -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira