[jira] Commented: (NUTCH-616) Reset Fetch Retry counter when fetch is successful
[ https://issues.apache.org/jira/browse/NUTCH-616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578705#action_12578705 ] Andrzej Bialecki commented on NUTCH-616: - I'm considering a different approach to this patch. There are already 2 Fetcher implementations, and in the future we may want to go even more modular, so patching this issue in every fetching tool doesn't seem appropriate. IMHO this should be handled in the CrawlDb maintenance tools (i.e. CrawlDbReducer). Patch is forthcoming. Reset Fetch Retry counter when fetch is successful -- Key: NUTCH-616 URL: https://issues.apache.org/jira/browse/NUTCH-616 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Emmanuel Joke Fix For: 1.0.0 Attachments: NUTCH-616.patch We manage a counter to check how many time the URL has been consecutively in state Retry following some trouble to get the page. Here is a sample of the code: case ProtocolStatus.RETRY: // retry fit.datum.setRetriesSinceFetch(fit.datum.getRetriesSinceFetch()+1); However i notice that we don't reinitialize this counter at 0 in the case of successful fetch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-616) Reset Fetch Retry counter when fetch is successful
[ https://issues.apache.org/jira/browse/NUTCH-616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-616: Attachment: NUTCH-616-v2.patch This patch uses FetchSchedule to maintain the counter. Reset Fetch Retry counter when fetch is successful -- Key: NUTCH-616 URL: https://issues.apache.org/jira/browse/NUTCH-616 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Emmanuel Joke Fix For: 1.0.0 Attachments: NUTCH-616-v2.patch, NUTCH-616.patch We manage a counter to check how many time the URL has been consecutively in state Retry following some trouble to get the page. Here is a sample of the code: case ProtocolStatus.RETRY: // retry fit.datum.setRetriesSinceFetch(fit.datum.getRetriesSinceFetch()+1); However i notice that we don't reinitialize this counter at 0 in the case of successful fetch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-616) Reset Fetch Retry counter when fetch is successful
[ https://issues.apache.org/jira/browse/NUTCH-616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-616: --- Assignee: Andrzej Bialecki Reset Fetch Retry counter when fetch is successful -- Key: NUTCH-616 URL: https://issues.apache.org/jira/browse/NUTCH-616 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Emmanuel Joke Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: NUTCH-616-v2.patch, NUTCH-616.patch We manage a counter to check how many time the URL has been consecutively in state Retry following some trouble to get the page. Here is a sample of the code: case ProtocolStatus.RETRY: // retry fit.datum.setRetriesSinceFetch(fit.datum.getRetriesSinceFetch()+1); However i notice that we don't reinitialize this counter at 0 in the case of successful fetch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-613) Empty Summaries and Cached Pages
[ https://issues.apache.org/jira/browse/NUTCH-613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-613. --- Resolution: Fixed Fix Version/s: (was: 0.9.0) Assignee: Andrzej Bialecki (was: Dennis Kubes) Empty Summaries and Cached Pages Key: NUTCH-613 URL: https://issues.apache.org/jira/browse/NUTCH-613 Project: Nutch Issue Type: Bug Components: fetcher, searcher, web gui Affects Versions: 0.9.0 Environment: All Reporter: Dennis Kubes Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: NUTCH-613-1-20080219.patch There is a bug where some search results do not have summaries and viewing their cached pages causes a NullPointer. This bug is due to redirects getting stored under the new url and the getURL method of FetchedSegments getting the wrong (old) url which is stored in crawldb but has no content or parse objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-613) Empty Summaries and Cached Pages
[ https://issues.apache.org/jira/browse/NUTCH-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578754#action_12578754 ] Andrzej Bialecki commented on NUTCH-613: - Patch committed to trunk. Thank you! Empty Summaries and Cached Pages Key: NUTCH-613 URL: https://issues.apache.org/jira/browse/NUTCH-613 Project: Nutch Issue Type: Bug Components: fetcher, searcher, web gui Affects Versions: 0.9.0 Environment: All Reporter: Dennis Kubes Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: NUTCH-613-1-20080219.patch There is a bug where some search results do not have summaries and viewing their cached pages causes a NullPointer. This bug is due to redirects getting stored under the new url and the getURL method of FetchedSegments getting the wrong (old) url which is stored in crawldb but has no content or parse objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-615) Redirected URL are fetched wihtout setting any FetchInterval
[ https://issues.apache.org/jira/browse/NUTCH-615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578742#action_12578742 ] Andrzej Bialecki commented on NUTCH-615: - I think the code in ParseOutputFormat doesn't matter that much. Any CrawlDatum-s created with LINKED status will be used only as a source of metadata in CrawlDbReducer, and if it defines a truly new URL then the FetchSchedule will be initialized in CrawlDbReducer anyway. So, I think we could apply the parts of the patch in Fetcher-s, and skip the ParseOutputFormat part. Redirected URL are fetched wihtout setting any FetchInterval Key: NUTCH-615 URL: https://issues.apache.org/jira/browse/NUTCH-615 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Emmanuel Joke Fix For: 1.0.0 Attachments: NUTCH-615.patch, NUTCH-615_v2.patch An url which is redirected result to a new URL. We create a new CrawlDatum for the new URL within the Fetcher but the FetchInterval was not initialized. The new url was recorded in the DB with a FetchInterval = 0 and the FetchTime is never correctly updated to be fetch later in the future. Thus we keep crawling those URL at each generation. This patch fix this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-612) URL filtering is always disabled in Generator when invoked by Crawl
[ https://issues.apache.org/jira/browse/NUTCH-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578770#action_12578770 ] Andrzej Bialecki commented on NUTCH-612: - Patch committed to trunk rev. 637114. Thank you! URL filtering is always disabled in Generator when invoked by Crawl --- Key: NUTCH-612 URL: https://issues.apache.org/jira/browse/NUTCH-612 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Reporter: Susam Pal Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: NUTCH-612v0.1.patch When a crawl is done using the 'bin/nutch crawl' command, no filtering is done in Generator even if 'crawl.generate.filter' is set to true in the configuration file. The problem is that in the Generator's generate method, the following code unconditionally sets the filter value of the job to whatever is passed to it:- {code}job.setBoolean(CRAWL_GENERATE_FILTER, filter);{code} The code in Crawl.java always passes this as false. This has been fixed by exposing an overloaded generate method which takes only the 5 arguments that Crawl needs to set. This overloaded method reads the configuration and sets the filter value appropriately. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-612) URL filtering is always disabled in Generator when invoked by Crawl
[ https://issues.apache.org/jira/browse/NUTCH-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-612. --- Resolution: Fixed Assignee: Andrzej Bialecki URL filtering is always disabled in Generator when invoked by Crawl --- Key: NUTCH-612 URL: https://issues.apache.org/jira/browse/NUTCH-612 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Reporter: Susam Pal Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: NUTCH-612v0.1.patch When a crawl is done using the 'bin/nutch crawl' command, no filtering is done in Generator even if 'crawl.generate.filter' is set to true in the configuration file. The problem is that in the Generator's generate method, the following code unconditionally sets the filter value of the job to whatever is passed to it:- {code}job.setBoolean(CRAWL_GENERATE_FILTER, filter);{code} The code in Crawl.java always passes this as false. This has been fixed by exposing an overloaded generate method which takes only the 5 arguments that Crawl needs to set. This overloaded method reads the configuration and sets the filter value appropriately. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-601) Recrawling on existing crawl directory using force option
[ https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-601. --- Resolution: Fixed Fix Version/s: 1.0.0 Assignee: Andrzej Bialecki Recrawling on existing crawl directory using force option - Key: NUTCH-601 URL: https://issues.apache.org/jira/browse/NUTCH-601 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Susam Pal Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-601v0.1.patch, NUTCH-601v0.2.patch, NUTCH-601v0.3.patch, NUTCH-601v1.0.patch Added a '-force' option to the 'bin/nutch crawl' command line. With this option, one can crawl and recrawl in the following manner: {code} bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force {code} This option can be used for the first crawl too: {code} bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force {code} If one tries to crawl without the -force option when the crawl directory already exists, he/she finds a small warning along with the error message: {code} # bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 Exception in thread main java.lang.RuntimeException: crawl already exists. Add -force option to recrawl. at org.apache.nutch.crawl.Crawl.main(Crawl.java:89) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-601) Recrawling on existing crawl directory using force option
[ https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578781#action_12578781 ] Andrzej Bialecki commented on NUTCH-601: - Patch v. 1.0 applied to trunk in rev. 637122. Thank you! Recrawling on existing crawl directory using force option - Key: NUTCH-601 URL: https://issues.apache.org/jira/browse/NUTCH-601 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Susam Pal Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.0.0 Attachments: NUTCH-601v0.1.patch, NUTCH-601v0.2.patch, NUTCH-601v0.3.patch, NUTCH-601v1.0.patch Added a '-force' option to the 'bin/nutch crawl' command line. With this option, one can crawl and recrawl in the following manner: {code} bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force {code} This option can be used for the first crawl too: {code} bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force {code} If one tries to crawl without the -force option when the crawl directory already exists, he/she finds a small warning along with the error message: {code} # bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 Exception in thread main java.lang.RuntimeException: crawl already exists. Add -force option to recrawl. at org.apache.nutch.crawl.Crawl.main(Crawl.java:89) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
[ https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-592. --- Resolution: Duplicate Assignee: Andrzej Bialecki (was: Emmanuel Joke) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED - Key: NUTCH-592 URL: https://issues.apache.org/jira/browse/NUTCH-592 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Emmanuel Joke Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: patch.txt I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect function can return NULL for few case and it has not been managed in the function as it has been done for the case ProtocolStatus.SUCCESS. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-590) Index multiple docs per call using IndexingFilter extension point
[ https://issues.apache.org/jira/browse/NUTCH-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-590. --- Resolution: Won't Fix Assignee: Andrzej Bialecki Index multiple docs per call using IndexingFilter extension point - Key: NUTCH-590 URL: https://issues.apache.org/jira/browse/NUTCH-590 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.0.0 Reporter: Nathaniel Powell Assignee: Andrzej Bialecki Fix For: 1.0.0 There are many applications where extracting and indexing multiple documents from a single HTML web file or other object would be useful. Therefore, it would help a lot if the IndexingFilter extension point were modified to pass in a list of documents as an argument and return a list (or collection) of documents. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
[ https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578786#action_12578786 ] Andrzej Bialecki commented on NUTCH-592: - Duplicate of NUTCH-597 and NUTCH-615. Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED - Key: NUTCH-592 URL: https://issues.apache.org/jira/browse/NUTCH-592 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter: Emmanuel Joke Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: patch.txt I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect function can return NULL for few case and it has not been managed in the function as it has been done for the case ProtocolStatus.SUCCESS. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-590) Index multiple docs per call using IndexingFilter extension point
[ https://issues.apache.org/jira/browse/NUTCH-590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578788#action_12578788 ] Andrzej Bialecki commented on NUTCH-590: - No further comments or patches provided. Index multiple docs per call using IndexingFilter extension point - Key: NUTCH-590 URL: https://issues.apache.org/jira/browse/NUTCH-590 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.0.0 Reporter: Nathaniel Powell Assignee: Andrzej Bialecki Fix For: 1.0.0 There are many applications where extracting and indexing multiple documents from a single HTML web file or other object would be useful. Therefore, it would help a lot if the IndexingFilter extension point were modified to pass in a list of documents as an argument and return a list (or collection) of documents. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-610) Can't Update or modify an index while web gui is running
[ https://issues.apache.org/jira/browse/NUTCH-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578773#action_12578773 ] Andrzej Bialecki commented on NUTCH-610: - If there are no objections I would like to close this issue as Invalid. Can't Update or modify an index while web gui is running Key: NUTCH-610 URL: https://issues.apache.org/jira/browse/NUTCH-610 Project: Nutch Issue Type: Improvement Components: searcher, web gui Affects Versions: 0.9.0 Reporter: Ciminera Frederic Attachments: NutchBeanNoLock.patch When the search web application is started a NutchBean is created and initializes its searcher on the index files (and also a FetchedSegment on segments). This index searcher (and also FetchedSegment) is holding a lock on the files on disk that prevent the index to be updated or modified. It would be nice to be able to update an index without having to restart the web server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-575) NPE in OpenSearchServlet when summary is null
[ https://issues.apache.org/jira/browse/NUTCH-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578795#action_12578795 ] Andrzej Bialecki commented on NUTCH-575: - I applied the remaining patch (oss-npe_1.patch) to trunk, rev. 637127. Thank you! NPE in OpenSearchServlet when summary is null - Key: NUTCH-575 URL: https://issues.apache.org/jira/browse/NUTCH-575 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0, 1.0.0 Reporter: John H. Lee Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: oss-npe.patch, sagar-search.patch summaries[i].toHtml() is called without checking if summaries[i] is not null, causing an unhandled NullPointerException and a failed OpenSearchServlet query. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-575) NPE in OpenSearchServlet when summary is null
[ https://issues.apache.org/jira/browse/NUTCH-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki closed NUTCH-575. --- Resolution: Fixed Assignee: Andrzej Bialecki NPE in OpenSearchServlet when summary is null - Key: NUTCH-575 URL: https://issues.apache.org/jira/browse/NUTCH-575 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0, 1.0.0 Reporter: John H. Lee Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: oss-npe.patch, sagar-search.patch summaries[i].toHtml() is called without checking if summaries[i] is not null, causing an unhandled NullPointerException and a failed OpenSearchServlet query. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (NUTCH-575) NPE in OpenSearchServlet when summary is null
Please, I want to leave this mail list about nutch. I already sent a e-mail to keep of this mail list, but, I'm still receving many e-mail about it, with FROM: nutch-dev@lucene.apache.org Please, let me know how STOP to recever these emails. Thanks so much. On Fri, Mar 14, 2008 at 12:10 PM, Andrzej Bialecki (JIRA) [EMAIL PROTECTED] wrote: [ https://issues.apache.org/jira/browse/NUTCH-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578795#action_12578795] Andrzej Bialecki commented on NUTCH-575: - I applied the remaining patch (oss-npe_1.patch) to trunk, rev. 637127. Thank you! NPE in OpenSearchServlet when summary is null - Key: NUTCH-575 URL: https://issues.apache.org/jira/browse/NUTCH-575 Project: Nutch Issue Type: Bug Components: searcher Affects Versions: 0.9.0, 1.0.0 Reporter: John H. Lee Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: oss-npe.patch, sagar-search.patch summaries[i].toHtml() is called without checking if summaries[i] is not null, causing an unhandled NullPointerException and a failed OpenSearchServlet query. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- ___ Jesiel A.S. Trevisan Email: [EMAIL PROTECTED] MSN: [EMAIL PROTECTED] Skype AIM: jesieltrevisan YahooMessager: jesiel.trevisan ICQ:: 46527510 ___ CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information or otherwise be protected by law. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
Re: [jira] Commented: (NUTCH-575) NPE in OpenSearchServlet when summary is null
Jesiel Trevisan wrote: Please, I want to leave this mail list about nutch. I already sent a e-mail to keep of this mail list, but, I'm still receving many e-mail about it, with FROM: nutch-dev@lucene.apache.org Hi, Have you sent the email as described here http://lucene.apache.org/nutch/mailing_lists.html to the correct -unsubscribe address? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Problem in running Nutch where proxy authentication is required.
Hi All, I am facing a problem in running nutch where the proxy authentication is required to crawl the site.(eg. google.com, yahoo.com) I am able to crawl the sites which do not require proxy authentication from our domain (eg abc.com), it is successfully creating a crawl folder and 5 subfolders.. I have put all the values in conf/nutch-site.xml conf/nutch-default.xml as given. I have given below all the entries which i have modified to run nutch(eg. settings in urls/urls.txt, conf/crawl-urlfilter.txt, conf/nutch-site.xml, conf/nutch-default.xml) I have also given the crawl.log text for your reference. while crawling through cygwin, it is giving an exception(Please help me out what i have to do to run nutch successfully(where i have to put any entry to pass through Proxy Authentication)) Dedup: starting Dedup: adding indexes in: crawl/indexes Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43 9) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) = ===crawl.log crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 topN = 50 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080109122052 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080109122052 Fetcher: threads: 10 fetching http://www.yahoo.com/ fetch of http://www.yahoo.com/ http://www.yahoo.com/ failed with: Http code=407, url=http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080109122052] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080109122101 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080109122101 Fetcher: threads: 10 fetching http://www.yahoo.com/ fetch of http://www.yahoo.com/ http://www.yahoo.com/ failed with: Http code=407, url=http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080109122101] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080109122110 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080109122110 Fetcher: threads: 10 fetching http://www.yahoo.com/ fetch of http://www.yahoo.com/ http://www.yahoo.com/ failed with: Http code=407, url=http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080109122110] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: crawl/segments/20080109122052 LinkDb: adding segment: crawl/segments/20080109122101 LinkDb: adding segment: crawl/segments/20080109122110 LinkDb: done Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20080109122052 Indexer: adding segment: crawl/segments/20080109122101 Indexer: adding segment: crawl/segments/20080109122110 Optimizing index. Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43 9) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) = urls/urls.txt
Re: Problem in running Nutch where proxy authentication is required.
I still can't see any DEBUG logs in your log file. Did you go through my earlier mail? Regards, Susam Pal On Wed, Mar 12, 2008 at 9:39 PM, [EMAIL PROTECTED] wrote: Hi All, I am facing a problem in running nutch where the proxy authentication is required to crawl the site.(eg. google.com, yahoo.com) I am able to crawl the sites which do not require proxy authentication from our domain (eg abc.com), it is successfully creating a crawl folder and 5 subfolders.. I have put all the values in conf/nutch-site.xml conf/nutch-default.xml as given. I have given below all the entries which i have modified to run nutch(eg. settings in urls/urls.txt, conf/crawl-urlfilter.txt, conf/nutch-site.xml, conf/nutch-default.xml) I have also given the crawl.log text for your reference. while crawling through cygwin, it is giving an exception(Please help me out what i have to do to run nutch successfully(where i have to put any entry to pass through Proxy Authentication)) Dedup: starting Dedup: adding indexes in: crawl/indexes Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43 9) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) = ===crawl.log crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 topN = 50 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080109122052 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080109122052 Fetcher: threads: 10 fetching http://www.yahoo.com/ fetch of http://www.yahoo.com/ http://www.yahoo.com/ failed with: Http code=407, url=http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080109122052] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080109122101 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080109122101 Fetcher: threads: 10 fetching http://www.yahoo.com/ fetch of http://www.yahoo.com/ http://www.yahoo.com/ failed with: Http code=407, url=http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080109122101] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080109122110 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080109122110 Fetcher: threads: 10 fetching http://www.yahoo.com/ fetch of http://www.yahoo.com/ http://www.yahoo.com/ failed with: Http code=407, url=http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080109122110] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: crawl/segments/20080109122052 LinkDb: adding segment: crawl/segments/20080109122101 LinkDb: adding segment: crawl/segments/20080109122110 LinkDb: done Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20080109122052 Indexer: adding segment: crawl/segments/20080109122101 Indexer: adding segment: crawl/segments/20080109122110 Optimizing index. Indexer: done Dedup: starting
[jira] Commented: (NUTCH-566) Sun's URL class has bug in creation of relative query URLs
[ https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578957#action_12578957 ] Andrzej Bialecki commented on NUTCH-566: - I agree that this should be put into a utility class. We already have one in trunk, org.apache.nutch.util.URLUtil. Could any of you provide an updated patch, relative to the current trunk? Sun's URL class has bug in creation of relative query URLs -- Key: NUTCH-566 URL: https://issues.apache.org/jira/browse/NUTCH-566 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8, 0.8.1, 0.9.0 Environment: MacOS X and Linux (CentOS 4.5) both Reporter: Doug Cook Priority: Minor Attachments: RelativeURL.java I'm using 0.81, but this will affect all other versions as well. Relative links of the form ?blah are resolved incorrectly. For example, with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link of ?id_entrep=111, Nutch will resolve this pair to the link http://www.fleurie.org/?id_entrep=111;. No such URL exists, and all browsers I tried will resolve the pair to http://www.fleurie.org/entreprise.asp?id_entrep=111;. I tracked this down to what could be called a bug in Sun's URL class. According to Sun's spec, they parse the relative URL according to RFC 2396. But the original RFC for relative links was RFC 1808, and the two RFCs differ in how they handle relative links beginning with ?. Most browsers (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for compatibility and also because the behavior makes more sense). Apparently even the people that wrote RFC 2396 recognized that this was a mistake, and the specified behavior was changed in RFC 3986 to match what browsers do. For a discussion of this, see http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query Sun's URL implementation, however, still implements RFC2396, as far as I can tell, and is out of step with the rest of the world. This breaks link extraction on a number of sites. I implemented a simple workaround, which I'm attaching. It is a static method to create URLs which behaves exactly as new URL(URL base, String relativePath), and I use it as a drop-in replacement for that in DOMContentUtils, Javascript link extraction, etc. Obviously, it really only matters wherever links are extracted. I haven't included the calling code from DOMContentUtils, etc. because my local versions are largely rewritten, but it should be pretty obvious. I put it in the org.apache.nutch.net directory, but obviously feel free to move it to another place if you feel it belongs there! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-126) Fetching via https does not work with a proxy (patch)
[ https://issues.apache.org/jira/browse/NUTCH-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578969#action_12578969 ] Andrzej Bialecki commented on NUTCH-126: - Patch applied to trunk, rev. 637308. Thank you! Fetching via https does not work with a proxy (patch) - Key: NUTCH-126 URL: https://issues.apache.org/jira/browse/NUTCH-126 Project: Nutch Issue Type: Bug Environment: Any Reporter: Fritz Elfert Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: nutch-sslproxy.patch Trying to fetch content from an SSL-Server using a proxy does not work due to a bug in the protocol-httpclient plugin. The attached patch fixes this problem. Ciao -Fritz -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-157) Problem during parsing msword document . It fetching properly but parsing is not working. Please show me the way how can i parse it
[ https://issues.apache.org/jira/browse/NUTCH-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578972#action_12578972 ] Andrzej Bialecki commented on NUTCH-157: - This branch is in End Of Life status. Problem during parsing msword document . It fetching properly but parsing is not working. Please show me the way how can i parse it --- Key: NUTCH-157 URL: https://issues.apache.org/jira/browse/NUTCH-157 Project: Nutch Issue Type: Bug Affects Versions: 0.7 Environment: windows Reporter: karamjit Ms word document not parsing. Error messages :-- Page from url Path in fetch file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc 060301 173204 fetching file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc 060301 173204 Parsing [file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc] with [EMAIL PROTECTED] 060301 173204 fetch of file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc failed with: java.lang.NoSuchMethodError: org.apache.poi.hpsf.SummaryInformation.getEditTime()J 060301 173204 Could not clean the content-type [], Reason is [org.apache.nutch.util.mime.MimeTypeException: The type can not be null or empty]. Using its raw version... 060301 173204 Parsing [file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc] with [EMAIL PROTECTED] 060301 173205 status: segment 20060301173203, 1 pages, 1 errors, 35840 bytes, 1000 ms 060301 173205 status: 1.0 pages/s, 280.0 kb/s, 35840.0 bytes/page -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-612) URL filtering is always disabled in Generator when invoked by Crawl
[ https://issues.apache.org/jira/browse/NUTCH-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12579003#action_12579003 ] Hudson commented on NUTCH-612: -- Integrated in Nutch-trunk #390 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/390/]) URL filtering is always disabled in Generator when invoked by Crawl --- Key: NUTCH-612 URL: https://issues.apache.org/jira/browse/NUTCH-612 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Reporter: Susam Pal Assignee: Andrzej Bialecki Fix For: 1.0.0 Attachments: NUTCH-612v0.1.patch When a crawl is done using the 'bin/nutch crawl' command, no filtering is done in Generator even if 'crawl.generate.filter' is set to true in the configuration file. The problem is that in the Generator's generate method, the following code unconditionally sets the filter value of the job to whatever is passed to it:- {code}job.setBoolean(CRAWL_GENERATE_FILTER, filter);{code} The code in Crawl.java always passes this as false. This has been fixed by exposing an overloaded generate method which takes only the 5 arguments that Crawl needs to set. This overloaded method reads the configuration and sets the filter value appropriately. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.