[jira] Commented: (NUTCH-616) Reset Fetch Retry counter when fetch is successful

2008-03-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578705#action_12578705
 ] 

Andrzej Bialecki  commented on NUTCH-616:
-

I'm considering a different approach to this patch. There are already 2 Fetcher 
implementations, and in the future we may want to go even more modular, so 
patching this issue in every fetching tool doesn't seem appropriate. IMHO this 
should be handled in the CrawlDb maintenance tools (i.e. CrawlDbReducer). Patch 
is forthcoming.

 Reset Fetch Retry counter when fetch is successful
 --

 Key: NUTCH-616
 URL: https://issues.apache.org/jira/browse/NUTCH-616
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Emmanuel Joke
 Fix For: 1.0.0

 Attachments: NUTCH-616.patch


 We manage a counter to check how many time the URL has been consecutively in 
 state Retry following some trouble to get the page.
 Here is a sample of the code:
 case ProtocolStatus.RETRY:  // retry
  
 fit.datum.setRetriesSinceFetch(fit.datum.getRetriesSinceFetch()+1);
  
  However i notice that we don't reinitialize this counter at 0 in the case of 
 successful fetch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-616) Reset Fetch Retry counter when fetch is successful

2008-03-14 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-616:


Attachment: NUTCH-616-v2.patch

This patch uses FetchSchedule to maintain the counter.

 Reset Fetch Retry counter when fetch is successful
 --

 Key: NUTCH-616
 URL: https://issues.apache.org/jira/browse/NUTCH-616
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Emmanuel Joke
 Fix For: 1.0.0

 Attachments: NUTCH-616-v2.patch, NUTCH-616.patch


 We manage a counter to check how many time the URL has been consecutively in 
 state Retry following some trouble to get the page.
 Here is a sample of the code:
 case ProtocolStatus.RETRY:  // retry
  
 fit.datum.setRetriesSinceFetch(fit.datum.getRetriesSinceFetch()+1);
  
  However i notice that we don't reinitialize this counter at 0 in the case of 
 successful fetch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-616) Reset Fetch Retry counter when fetch is successful

2008-03-14 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reassigned NUTCH-616:
---

Assignee: Andrzej Bialecki 

 Reset Fetch Retry counter when fetch is successful
 --

 Key: NUTCH-616
 URL: https://issues.apache.org/jira/browse/NUTCH-616
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Emmanuel Joke
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: NUTCH-616-v2.patch, NUTCH-616.patch


 We manage a counter to check how many time the URL has been consecutively in 
 state Retry following some trouble to get the page.
 Here is a sample of the code:
 case ProtocolStatus.RETRY:  // retry
  
 fit.datum.setRetriesSinceFetch(fit.datum.getRetriesSinceFetch()+1);
  
  However i notice that we don't reinitialize this counter at 0 in the case of 
 successful fetch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-613) Empty Summaries and Cached Pages

2008-03-14 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-613.
---

   Resolution: Fixed
Fix Version/s: (was: 0.9.0)
 Assignee: Andrzej Bialecki   (was: Dennis Kubes)

 Empty Summaries and Cached Pages
 

 Key: NUTCH-613
 URL: https://issues.apache.org/jira/browse/NUTCH-613
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, searcher, web gui
Affects Versions: 0.9.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: NUTCH-613-1-20080219.patch


 There is a bug where some search results do not have summaries and viewing 
 their cached pages causes a NullPointer.  This bug is due to redirects 
 getting stored under the new url and the getURL method of FetchedSegments 
 getting the wrong (old) url which is stored in crawldb but has no content or 
 parse objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-613) Empty Summaries and Cached Pages

2008-03-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578754#action_12578754
 ] 

Andrzej Bialecki  commented on NUTCH-613:
-

Patch committed to trunk. Thank you!

 Empty Summaries and Cached Pages
 

 Key: NUTCH-613
 URL: https://issues.apache.org/jira/browse/NUTCH-613
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, searcher, web gui
Affects Versions: 0.9.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: NUTCH-613-1-20080219.patch


 There is a bug where some search results do not have summaries and viewing 
 their cached pages causes a NullPointer.  This bug is due to redirects 
 getting stored under the new url and the getURL method of FetchedSegments 
 getting the wrong (old) url which is stored in crawldb but has no content or 
 parse objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-615) Redirected URL are fetched wihtout setting any FetchInterval

2008-03-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578742#action_12578742
 ] 

Andrzej Bialecki  commented on NUTCH-615:
-

I think the code in ParseOutputFormat doesn't matter that much. Any 
CrawlDatum-s created with LINKED status will be used only as a source of 
metadata in CrawlDbReducer, and if it defines a truly new URL then the 
FetchSchedule will be initialized in CrawlDbReducer anyway.

So, I think we could apply the parts of the patch in Fetcher-s, and skip the 
ParseOutputFormat part.

 Redirected URL are fetched wihtout setting any FetchInterval
 

 Key: NUTCH-615
 URL: https://issues.apache.org/jira/browse/NUTCH-615
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Emmanuel Joke
 Fix For: 1.0.0

 Attachments: NUTCH-615.patch, NUTCH-615_v2.patch


 An url which is redirected result to a new URL. We create a new CrawlDatum 
 for the new URL within the Fetcher but the FetchInterval was not initialized.
 The new url was recorded in the DB with a FetchInterval = 0 and the FetchTime 
 is never correctly updated to be fetch later in the future. Thus we keep 
 crawling those URL at each generation.
 This patch fix this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-612) URL filtering is always disabled in Generator when invoked by Crawl

2008-03-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578770#action_12578770
 ] 

Andrzej Bialecki  commented on NUTCH-612:
-

Patch committed to trunk rev. 637114. Thank you!

 URL filtering is always disabled in Generator when invoked by Crawl
 ---

 Key: NUTCH-612
 URL: https://issues.apache.org/jira/browse/NUTCH-612
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.0.0
Reporter: Susam Pal
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: NUTCH-612v0.1.patch


 When a crawl is done using the 'bin/nutch crawl' command, no filtering is 
 done in Generator even if 'crawl.generate.filter' is set to true in the 
 configuration file.
 The problem is that in the Generator's generate method, the following code 
 unconditionally sets the filter value of the job to whatever is passed to it:-
 {code}job.setBoolean(CRAWL_GENERATE_FILTER, filter);{code}
 The code in Crawl.java always passes this as false. 
 This has been fixed by exposing an overloaded generate method which takes 
 only the 5 arguments that Crawl needs to set. This overloaded method reads 
 the configuration and sets the filter value appropriately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-612) URL filtering is always disabled in Generator when invoked by Crawl

2008-03-14 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-612.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

 URL filtering is always disabled in Generator when invoked by Crawl
 ---

 Key: NUTCH-612
 URL: https://issues.apache.org/jira/browse/NUTCH-612
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.0.0
Reporter: Susam Pal
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: NUTCH-612v0.1.patch


 When a crawl is done using the 'bin/nutch crawl' command, no filtering is 
 done in Generator even if 'crawl.generate.filter' is set to true in the 
 configuration file.
 The problem is that in the Generator's generate method, the following code 
 unconditionally sets the filter value of the job to whatever is passed to it:-
 {code}job.setBoolean(CRAWL_GENERATE_FILTER, filter);{code}
 The code in Crawl.java always passes this as false. 
 This has been fixed by exposing an overloaded generate method which takes 
 only the 5 arguments that Crawl needs to set. This overloaded method reads 
 the configuration and sets the filter value appropriately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-601) Recrawling on existing crawl directory using force option

2008-03-14 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-601.
---

   Resolution: Fixed
Fix Version/s: 1.0.0
 Assignee: Andrzej Bialecki 

 Recrawling on existing crawl directory using force option
 -

 Key: NUTCH-601
 URL: https://issues.apache.org/jira/browse/NUTCH-601
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Susam Pal
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-601v0.1.patch, NUTCH-601v0.2.patch, 
 NUTCH-601v0.3.patch, NUTCH-601v1.0.patch


 Added a '-force' option to the 'bin/nutch crawl' command line. With this 
 option, one can crawl and recrawl in the following manner:
 {code}
 bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
 bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
 {code}
 This option can be used for the first crawl too:
 {code}
 bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
 bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
 {code}
 If one tries to crawl without the -force option when the crawl directory 
 already exists, he/she finds a small warning along with the error message:
 {code}
 # bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
 Exception in thread main java.lang.RuntimeException: crawl already
 exists. Add -force option to recrawl.
at org.apache.nutch.crawl.Crawl.main(Crawl.java:89)
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-601) Recrawling on existing crawl directory using force option

2008-03-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578781#action_12578781
 ] 

Andrzej Bialecki  commented on NUTCH-601:
-

Patch v. 1.0 applied to trunk in rev. 637122. Thank you!

 Recrawling on existing crawl directory using force option
 -

 Key: NUTCH-601
 URL: https://issues.apache.org/jira/browse/NUTCH-601
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Susam Pal
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-601v0.1.patch, NUTCH-601v0.2.patch, 
 NUTCH-601v0.3.patch, NUTCH-601v1.0.patch


 Added a '-force' option to the 'bin/nutch crawl' command line. With this 
 option, one can crawl and recrawl in the following manner:
 {code}
 bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
 bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
 {code}
 This option can be used for the first crawl too:
 {code}
 bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
 bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5 -force
 {code}
 If one tries to crawl without the -force option when the crawl directory 
 already exists, he/she finds a small warning along with the error message:
 {code}
 # bin/nutch crawl urls -dir crawl -depth 2 -topN 10 -threads 5
 Exception in thread main java.lang.RuntimeException: crawl already
 exists. Add -force option to recrawl.
at org.apache.nutch.crawl.Crawl.main(Crawl.java:89)
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED

2008-03-14 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-592.
---

Resolution: Duplicate
  Assignee: Andrzej Bialecki   (was: Emmanuel Joke)

 Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
 -

 Key: NUTCH-592
 URL: https://issues.apache.org/jira/browse/NUTCH-592
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Emmanuel Joke
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: patch.txt


 I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect 
 function can return NULL for few case and it has not been managed in the 
 function as it has been done for the case ProtocolStatus.SUCCESS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-590) Index multiple docs per call using IndexingFilter extension point

2008-03-14 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-590.
---

Resolution: Won't Fix
  Assignee: Andrzej Bialecki 

 Index multiple docs per call using IndexingFilter extension point
 -

 Key: NUTCH-590
 URL: https://issues.apache.org/jira/browse/NUTCH-590
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.0.0
Reporter: Nathaniel Powell
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0


 There are many applications where extracting and indexing multiple documents 
 from a single HTML web file or other object would be useful. Therefore, it 
 would help a lot if the IndexingFilter extension point were modified to pass 
 in a list of documents as an argument and return a list (or collection) of 
 documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-592) Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED

2008-03-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578786#action_12578786
 ] 

Andrzej Bialecki  commented on NUTCH-592:
-

Duplicate of NUTCH-597 and NUTCH-615.

 Fetcher2 : NPE for page with status ProtocolStatus.TEMP_MOVED
 -

 Key: NUTCH-592
 URL: https://issues.apache.org/jira/browse/NUTCH-592
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Emmanuel Joke
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: patch.txt


 I have a NPE for page when ProtocolStatus.TEMP_MOVED. It seems handleRedirect 
 function can return NULL for few case and it has not been managed in the 
 function as it has been done for the case ProtocolStatus.SUCCESS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-590) Index multiple docs per call using IndexingFilter extension point

2008-03-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578788#action_12578788
 ] 

Andrzej Bialecki  commented on NUTCH-590:
-

No further comments or patches provided.

 Index multiple docs per call using IndexingFilter extension point
 -

 Key: NUTCH-590
 URL: https://issues.apache.org/jira/browse/NUTCH-590
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.0.0
Reporter: Nathaniel Powell
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0


 There are many applications where extracting and indexing multiple documents 
 from a single HTML web file or other object would be useful. Therefore, it 
 would help a lot if the IndexingFilter extension point were modified to pass 
 in a list of documents as an argument and return a list (or collection) of 
 documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-610) Can't Update or modify an index while web gui is running

2008-03-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578773#action_12578773
 ] 

Andrzej Bialecki  commented on NUTCH-610:
-

If there are no objections I would like to close this issue as Invalid.

 Can't Update or modify an index while web gui is running
 

 Key: NUTCH-610
 URL: https://issues.apache.org/jira/browse/NUTCH-610
 Project: Nutch
  Issue Type: Improvement
  Components: searcher, web gui
Affects Versions: 0.9.0
Reporter: Ciminera Frederic
 Attachments: NutchBeanNoLock.patch


 When the search web application is started a NutchBean is created and 
 initializes its searcher on the index files (and also a FetchedSegment on 
 segments).
 This index searcher (and also FetchedSegment) is holding a lock on the files 
 on disk that prevent the index to be updated or modified.
 It would be nice to be able to update an index without having to restart the 
 web server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-575) NPE in OpenSearchServlet when summary is null

2008-03-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578795#action_12578795
 ] 

Andrzej Bialecki  commented on NUTCH-575:
-

I applied the remaining patch (oss-npe_1.patch) to trunk, rev. 637127. Thank 
you!

 NPE in OpenSearchServlet when summary is null
 -

 Key: NUTCH-575
 URL: https://issues.apache.org/jira/browse/NUTCH-575
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0, 1.0.0
Reporter: John H. Lee
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: oss-npe.patch, sagar-search.patch


 summaries[i].toHtml() is called without checking if summaries[i] is not null, 
 causing an unhandled NullPointerException and a failed OpenSearchServlet 
 query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-575) NPE in OpenSearchServlet when summary is null

2008-03-14 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  closed NUTCH-575.
---

Resolution: Fixed
  Assignee: Andrzej Bialecki 

 NPE in OpenSearchServlet when summary is null
 -

 Key: NUTCH-575
 URL: https://issues.apache.org/jira/browse/NUTCH-575
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0, 1.0.0
Reporter: John H. Lee
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: oss-npe.patch, sagar-search.patch


 summaries[i].toHtml() is called without checking if summaries[i] is not null, 
 causing an unhandled NullPointerException and a failed OpenSearchServlet 
 query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (NUTCH-575) NPE in OpenSearchServlet when summary is null

2008-03-14 Thread Jesiel Trevisan
Please,

I want to leave this mail list about nutch.

I already sent a e-mail to keep of this mail list, but, I'm still receving
many e-mail about it, with FROM: nutch-dev@lucene.apache.org

Please, let me know how STOP to recever these emails.

Thanks so much.

On Fri, Mar 14, 2008 at 12:10 PM, Andrzej Bialecki (JIRA) [EMAIL PROTECTED]
wrote:


[
 https://issues.apache.org/jira/browse/NUTCH-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578795#action_12578795]

 Andrzej Bialecki  commented on NUTCH-575:
 -

 I applied the remaining patch (oss-npe_1.patch) to trunk, rev. 637127.
 Thank you!

  NPE in OpenSearchServlet when summary is null
  -
 
  Key: NUTCH-575
  URL: https://issues.apache.org/jira/browse/NUTCH-575
  Project: Nutch
   Issue Type: Bug
   Components: searcher
 Affects Versions: 0.9.0, 1.0.0
 Reporter: John H. Lee
 Assignee: Andrzej Bialecki
  Fix For: 1.0.0
 
  Attachments: oss-npe.patch, sagar-search.patch
 
 
  summaries[i].toHtml() is called without checking if summaries[i] is not
 null, causing an unhandled NullPointerException and a failed
 OpenSearchServlet query.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




-- 
___
Jesiel A.S. Trevisan
Email: [EMAIL PROTECTED]
MSN: [EMAIL PROTECTED]
Skype  AIM: jesieltrevisan
YahooMessager: jesiel.trevisan
ICQ:: 46527510
___
CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is
for the sole use of the intended recipient(s) and may contain confidential
and privileged information or otherwise be protected by law. Any
unauthorized review, use, disclosure or distribution is prohibited. If you
are not the intended recipient, please contact the sender by reply e-mail
and destroy all copies of the original message.


Re: [jira] Commented: (NUTCH-575) NPE in OpenSearchServlet when summary is null

2008-03-14 Thread Andrzej Bialecki

Jesiel Trevisan wrote:

Please,

I want to leave this mail list about nutch.

I already sent a e-mail to keep of this mail list, but, I'm still receving
many e-mail about it, with FROM: nutch-dev@lucene.apache.org



Hi,

Have you sent the email as described here 
http://lucene.apache.org/nutch/mailing_lists.html to the correct 
-unsubscribe address?



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Problem in running Nutch where proxy authentication is required.

2008-03-14 Thread naveen.goswami
Hi All,

I am facing a problem in running nutch where the proxy authentication is
required to crawl the site.(eg. google.com, yahoo.com)
I am able to crawl the sites which do not require proxy authentication
from our domain (eg abc.com), it is successfully creating a crawl folder
and 5 subfolders..
I have put all the values in conf/nutch-site.xml 
conf/nutch-default.xml as given.
I have given below all the entries which i have modified to run
nutch(eg. settings in urls/urls.txt, conf/crawl-urlfilter.txt,
conf/nutch-site.xml, conf/nutch-default.xml)
I have also given the crawl.log text for your reference.

while crawling through cygwin, it is giving an exception(Please help me
out what i have to do to run nutch successfully(where i have to put any
entry to pass through Proxy Authentication))

Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
 at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
9)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

=



===crawl.log

crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080109122052
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080109122052
Fetcher: threads: 10
fetching http://www.yahoo.com/
fetch of http://www.yahoo.com/ http://www.yahoo.com/  failed with:
Http code=407, url=http://www.yahoo.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080109122052]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080109122101
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080109122101
Fetcher: threads: 10
fetching http://www.yahoo.com/
fetch of http://www.yahoo.com/ http://www.yahoo.com/  failed with:
Http code=407, url=http://www.yahoo.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080109122101]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080109122110
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080109122110
Fetcher: threads: 10
fetching http://www.yahoo.com/
fetch of http://www.yahoo.com/ http://www.yahoo.com/  failed with:
Http code=407, url=http://www.yahoo.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080109122110]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20080109122052
LinkDb: adding segment: crawl/segments/20080109122101
LinkDb: adding segment: crawl/segments/20080109122110
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20080109122052
Indexer: adding segment: crawl/segments/20080109122101
Indexer: adding segment: crawl/segments/20080109122110
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
 at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
9)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

=



urls/urls.txt

Re: Problem in running Nutch where proxy authentication is required.

2008-03-14 Thread Susam Pal
I still can't see any DEBUG logs in your log file. Did you go through
my earlier mail?

Regards,
Susam Pal

On Wed, Mar 12, 2008 at 9:39 PM,  [EMAIL PROTECTED] wrote:

 Hi All,

  I am facing a problem in running nutch where the proxy authentication is
  required to crawl the site.(eg. google.com, yahoo.com)
  I am able to crawl the sites which do not require proxy authentication
  from our domain (eg abc.com), it is successfully creating a crawl folder
  and 5 subfolders..
  I have put all the values in conf/nutch-site.xml 
  conf/nutch-default.xml as given.
  I have given below all the entries which i have modified to run
  nutch(eg. settings in urls/urls.txt, conf/crawl-urlfilter.txt,
  conf/nutch-site.xml, conf/nutch-default.xml)
  I have also given the crawl.log text for your reference.

  while crawling through cygwin, it is giving an exception(Please help me
  out what i have to do to run nutch successfully(where i have to put any
  entry to pass through Proxy Authentication))

  Dedup: starting
  Dedup: adding indexes in: crawl/indexes
  Exception in thread main java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
   at
  org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
  9)
   at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

  =
  


  ===crawl.log

  crawl started in: crawl
  rootUrlDir = urls
  threads = 10
  depth = 3
  topN = 50
  Injector: starting
  Injector: crawlDb: crawl/crawldb
  Injector: urlDir: urls
  Injector: Converting injected urls to crawl db entries.
  Injector: Merging injected urls into crawl db.
  Injector: done
  Generator: Selecting best-scoring urls due for fetch.
  Generator: starting
  Generator: segment: crawl/segments/20080109122052
  Generator: filtering: false
  Generator: topN: 50
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls by host, for politeness.
  Generator: done.
  Fetcher: starting
  Fetcher: segment: crawl/segments/20080109122052
  Fetcher: threads: 10
  fetching http://www.yahoo.com/
  fetch of http://www.yahoo.com/ http://www.yahoo.com/  failed with:
  Http code=407, url=http://www.yahoo.com/
  Fetcher: done
  CrawlDb update: starting
  CrawlDb update: db: crawl/crawldb
  CrawlDb update: segments: [crawl/segments/20080109122052]
  CrawlDb update: additions allowed: true
  CrawlDb update: URL normalizing: true
  CrawlDb update: URL filtering: true
  CrawlDb update: Merging segment data into db.
  CrawlDb update: done
  Generator: Selecting best-scoring urls due for fetch.
  Generator: starting
  Generator: segment: crawl/segments/20080109122101
  Generator: filtering: false
  Generator: topN: 50
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls by host, for politeness.
  Generator: done.
  Fetcher: starting
  Fetcher: segment: crawl/segments/20080109122101
  Fetcher: threads: 10
  fetching http://www.yahoo.com/
  fetch of http://www.yahoo.com/ http://www.yahoo.com/  failed with:
  Http code=407, url=http://www.yahoo.com/
  Fetcher: done
  CrawlDb update: starting
  CrawlDb update: db: crawl/crawldb
  CrawlDb update: segments: [crawl/segments/20080109122101]
  CrawlDb update: additions allowed: true
  CrawlDb update: URL normalizing: true
  CrawlDb update: URL filtering: true
  CrawlDb update: Merging segment data into db.
  CrawlDb update: done
  Generator: Selecting best-scoring urls due for fetch.
  Generator: starting
  Generator: segment: crawl/segments/20080109122110
  Generator: filtering: false
  Generator: topN: 50
  Generator: jobtracker is 'local', generating exactly one partition.
  Generator: Partitioning selected urls by host, for politeness.
  Generator: done.
  Fetcher: starting
  Fetcher: segment: crawl/segments/20080109122110
  Fetcher: threads: 10
  fetching http://www.yahoo.com/
  fetch of http://www.yahoo.com/ http://www.yahoo.com/  failed with:
  Http code=407, url=http://www.yahoo.com/
  Fetcher: done
  CrawlDb update: starting
  CrawlDb update: db: crawl/crawldb
  CrawlDb update: segments: [crawl/segments/20080109122110]
  CrawlDb update: additions allowed: true
  CrawlDb update: URL normalizing: true
  CrawlDb update: URL filtering: true
  CrawlDb update: Merging segment data into db.
  CrawlDb update: done
  LinkDb: starting
  LinkDb: linkdb: crawl/linkdb
  LinkDb: URL normalize: true
  LinkDb: URL filter: true
  LinkDb: adding segment: crawl/segments/20080109122052
  LinkDb: adding segment: crawl/segments/20080109122101
  LinkDb: adding segment: crawl/segments/20080109122110
  LinkDb: done
  Indexer: starting
  Indexer: linkdb: crawl/linkdb
  Indexer: adding segment: crawl/segments/20080109122052
  Indexer: adding segment: crawl/segments/20080109122101
  Indexer: adding segment: crawl/segments/20080109122110
  Optimizing index.
  Indexer: done
  Dedup: starting
  

[jira] Commented: (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2008-03-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578957#action_12578957
 ] 

Andrzej Bialecki  commented on NUTCH-566:
-

I agree that this should be put into a utility class. We already have one in 
trunk, org.apache.nutch.util.URLUtil. Could any of you provide an updated 
patch, relative to the current trunk?

 Sun's URL class has bug in creation of relative query URLs
 --

 Key: NUTCH-566
 URL: https://issues.apache.org/jira/browse/NUTCH-566
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: MacOS X and Linux (CentOS 4.5) both
Reporter: Doug Cook
Priority: Minor
 Attachments: RelativeURL.java


 I'm using 0.81, but this will affect all other versions as well.
 Relative links of the form ?blah are resolved incorrectly. For example, 
 with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link 
 of ?id_entrep=111, Nutch will resolve this pair to the link
 http://www.fleurie.org/?id_entrep=111;. No such URL exists, and all browsers 
 I tried will resolve the pair to 
 http://www.fleurie.org/entreprise.asp?id_entrep=111;.
 I tracked this down to what could be called a bug in Sun's URL class. 
 According to Sun's spec, they parse the relative URL according to RFC 2396. 
 But the original RFC for relative links was RFC 1808, and the two RFCs differ 
 in how they handle relative links beginning with ?. Most browsers 
 (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for 
 compatibility and also because the behavior makes more sense). Apparently 
 even the people that wrote RFC 2396 recognized that this was a mistake, and 
 the specified behavior was changed in RFC 3986 to match what browsers do. 
 For a discussion of this, see  
 http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query
 Sun's URL implementation, however, still implements RFC2396, as far as I can 
 tell, and is out of step with the rest of the world.
 This breaks link extraction on a number of sites.
 I implemented a simple workaround, which I'm attaching. It is a static method 
 to create URLs which behaves exactly as new URL(URL base, String 
 relativePath), and I use it as a drop-in replacement for that in 
 DOMContentUtils, Javascript link extraction, etc. Obviously, it really only 
 matters wherever links are extracted. I haven't included the calling code 
 from DOMContentUtils, etc. because my local versions are largely rewritten, 
 but it should be pretty obvious.
 I put it in the org.apache.nutch.net directory, but obviously feel free to 
 move it to another place if you feel it belongs there!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-126) Fetching via https does not work with a proxy (patch)

2008-03-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578969#action_12578969
 ] 

Andrzej Bialecki  commented on NUTCH-126:
-

Patch applied to trunk, rev. 637308. Thank you!

 Fetching via https does not work with a proxy (patch)
 -

 Key: NUTCH-126
 URL: https://issues.apache.org/jira/browse/NUTCH-126
 Project: Nutch
  Issue Type: Bug
 Environment: Any
Reporter: Fritz Elfert
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: nutch-sslproxy.patch


 Trying to fetch content from an SSL-Server using a proxy does not work due to 
 a bug in the protocol-httpclient plugin.
 The attached patch fixes this problem.
 Ciao
  -Fritz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-157) Problem during parsing msword document . It fetching properly but parsing is not working. Please show me the way how can i parse it

2008-03-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12578972#action_12578972
 ] 

Andrzej Bialecki  commented on NUTCH-157:
-

This branch is in End Of Life status.

 Problem during parsing msword document . It fetching properly but parsing is 
 not working. Please show me the way how can i parse it
 ---

 Key: NUTCH-157
 URL: https://issues.apache.org/jira/browse/NUTCH-157
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.7
 Environment: windows 
Reporter: karamjit

 Ms word document  not parsing.
 Error messages :--
 Page from url Path in fetch 
 file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc
 060301 173204 fetching  
 file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc
 060301 173204 Parsing 
 [file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc] with [EMAIL 
 PROTECTED]
 060301 173204 fetch of 
 file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc failed with: 
 java.lang.NoSuchMethodError: 
 org.apache.poi.hpsf.SummaryInformation.getEditTime()J
 060301 173204 Could not clean the content-type [], Reason is 
 [org.apache.nutch.util.mime.MimeTypeException: The type can not be null or 
 empty]. Using its raw version...
 060301 173204 Parsing 
 [file:/D:/karam/Atlantis_Tools/Crawl_Files/compareFVAJ.doc] with [EMAIL 
 PROTECTED]
 060301 173205 status: segment 20060301173203, 1 pages, 1 errors, 35840 bytes, 
 1000 ms
 060301 173205 status: 1.0 pages/s, 280.0 kb/s, 35840.0 bytes/page

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-612) URL filtering is always disabled in Generator when invoked by Crawl

2008-03-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12579003#action_12579003
 ] 

Hudson commented on NUTCH-612:
--

Integrated in Nutch-trunk #390 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/390/])

 URL filtering is always disabled in Generator when invoked by Crawl
 ---

 Key: NUTCH-612
 URL: https://issues.apache.org/jira/browse/NUTCH-612
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.0.0
Reporter: Susam Pal
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: NUTCH-612v0.1.patch


 When a crawl is done using the 'bin/nutch crawl' command, no filtering is 
 done in Generator even if 'crawl.generate.filter' is set to true in the 
 configuration file.
 The problem is that in the Generator's generate method, the following code 
 unconditionally sets the filter value of the job to whatever is passed to it:-
 {code}job.setBoolean(CRAWL_GENERATE_FILTER, filter);{code}
 The code in Crawl.java always passes this as false. 
 This has been fixed by exposing an overloaded generate method which takes 
 only the 5 arguments that Crawl needs to set. This overloaded method reads 
 the configuration and sets the filter value appropriately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.