[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-22 Thread Vishal Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507144
 ] 

Vishal Shah commented on NUTCH-503:
---

Hi Emmanuel,

   Can you please dump the contents of your crawldb after injecting your urls 
into the crawldb using the readdb command? Are these urls injected into the db 
in the first place? It could be that your urlfilters are filtering out your 
urls, or maybe there's some other problem. (esp. since the third test you did 
works). It would be good to know the contents of the crawldb before generate 
and after inject in each case.


 Generator exits incorrectly for small fetchlists 
 -

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
 Fix For: 0.8.2

 Attachments: emptyfetchlist.patch, emptyfetchlist.patch


I think I found the reason why the generator returns with an empty 
 fetchlist for small fetchsizes. 
  
After the first job finishes running, the generator checks the following 
 condition to see if it got an empty list:
  
 if (readers == null || readers.length == 0 || !readers[0].next(new
 FloatWritable())) {
  
   The third condition is incorrect here. In some cases, esp. for small 
 fetchlists, the first partition might be empty, but some other partition(s) 
 might contain urls. In this case, the Generator is incorrectly assuming that 
 all partitions are empty by just looking at the first. This problem could 
 also occur when all URLs in the fetchlist are from the same host (or from a 
 very small number of hosts, or from a number of hosts that all map to a small 
 number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

2007-06-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507145
 ] 

Hudson commented on NUTCH-471:
--

Integrated in Nutch-Nightly #125 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/125/])

 Fix synchronization in NutchBean creation
 -

 Key: NUTCH-471
 URL: https://issues.apache.org/jira/browse/NUTCH-471
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Doğacan Güney
 Fix For: 1.0.0

 Attachments: NutchBeanCreationSync_v1.patch, 
 NutchBeanCreationSync_v2.patch


 NutchBean is created and then cached in servlet context. But 
 NutchBean.get(ServletContext app, Configuration conf) is not syncronized, 
 which causes more than one instance of the bean (and 
 DistributedSearch$Client) if servlet container is accessed rapidly during 
 startup. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



where to put hadoop native lib in tomcat?

2007-06-22 Thread qi wu
Where should I put the hadoop native lib file like libhadoop.so for the 
searching function ? I have tried to put it in the dir like:
/data/apache-tomcat-5.5.23/webapps/ROOT/WEB-INF/lib/native..
and this doesn't work.
Thanks!


[jira] Created: (NUTCH-504) NUTCH-443 broke parsing during fetching

2007-06-22 Thread JIRA
NUTCH-443 broke parsing during fetching
---

 Key: NUTCH-504
 URL: https://issues.apache.org/jira/browse/NUTCH-504
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
 Fix For: 1.0.0


After NUTCH-443, if one is parsing during fetching and parsing for a url fails, 
that url doesn't get segment name or similar properties in its metadata. 
Because of this, indexer fails (because, index expects to see segment name for 
all parses, even those that failed).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-504) NUTCH-443 broke parsing during fetching

2007-06-22 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-504:


Attachment: parse_in_fetchers.patch

Patch for the problem. I think it would be nice to add a test case for this, 
but I am not sure how we can force a parse to fail so we can test it 
properly(comments are welcome:). 



 NUTCH-443 broke parsing during fetching
 ---

 Key: NUTCH-504
 URL: https://issues.apache.org/jira/browse/NUTCH-504
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
 Fix For: 1.0.0

 Attachments: parse_in_fetchers.patch


 After NUTCH-443, if one is parsing during fetching and parsing for a url 
 fails, that url doesn't get segment name or similar properties in its 
 metadata. Because of this, indexer fails (because, index expects to see 
 segment name for all parses, even those that failed).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-504) NUTCH-443 broke parsing during fetching

2007-06-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507162
 ] 

Doğacan Güney commented on NUTCH-504:
-

Also, should we actually index documents even if their parses have failed? 
Since, when a url fails we replace its parse with an empty parse anyway, it may 
be a good idea to skip such documents.

 NUTCH-443 broke parsing during fetching
 ---

 Key: NUTCH-504
 URL: https://issues.apache.org/jira/browse/NUTCH-504
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
 Fix For: 1.0.0

 Attachments: parse_in_fetchers.patch


 After NUTCH-443, if one is parsing during fetching and parsing for a url 
 fails, that url doesn't get segment name or similar properties in its 
 metadata. Because of this, indexer fails (because, index expects to see 
 segment name for all parses, even those that failed).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-465) I download nutch 0.9 used tar zxvf nutch-0.9.tar.gz at last A lone zero block

2007-06-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507167
 ] 

Doğacan Güney commented on NUTCH-465:
-

Which mirror did you download it from?

 I download nutch 0.9 used tar zxvf nutch-0.9.tar.gz   at last  A lone zero 
 block
 

 Key: NUTCH-465
 URL: https://issues.apache.org/jira/browse/NUTCH-465
 Project: Nutch
  Issue Type: Test
 Environment: win 2003 jdk 1.6 
Reporter: qiuwenbin

 I download nutch 0.9 used tar zxvf nutch-0.9.tar.gz   at last  A lone zero 
 block

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-504) NUTCH-443 broke parsing during fetching

2007-06-22 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507168
 ] 

Andrzej Bialecki  commented on NUTCH-504:
-

+1 - we should skip documents that failed to parse properly, in such cases we 
have no usable text anyway.

 NUTCH-443 broke parsing during fetching
 ---

 Key: NUTCH-504
 URL: https://issues.apache.org/jira/browse/NUTCH-504
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
 Fix For: 1.0.0

 Attachments: parse_in_fetchers.patch


 After NUTCH-443, if one is parsing during fetching and parsing for a url 
 fails, that url doesn't get segment name or similar properties in its 
 metadata. Because of this, indexer fails (because, index expects to see 
 segment name for all parses, even those that failed).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507169
 ] 

Doğacan Güney commented on NUTCH-503:
-

Also, how many machines are there on your cluster and which version of nutch 
are you using?

 Generator exits incorrectly for small fetchlists 
 -

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
 Fix For: 0.8.2

 Attachments: emptyfetchlist.patch, emptyfetchlist.patch


I think I found the reason why the generator returns with an empty 
 fetchlist for small fetchsizes. 
  
After the first job finishes running, the generator checks the following 
 condition to see if it got an empty list:
  
 if (readers == null || readers.length == 0 || !readers[0].next(new
 FloatWritable())) {
  
   The third condition is incorrect here. In some cases, esp. for small 
 fetchlists, the first partition might be empty, but some other partition(s) 
 might contain urls. In this case, the Generator is incorrectly assuming that 
 all partitions are empty by just looking at the first. This problem could 
 also occur when all URLs in the fetchlist are from the same host (or from a 
 very small number of hosts, or from a number of hosts that all map to a small 
 number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507169
 ] 

Doğacan Güney edited comment on NUTCH-503 at 6/22/07 1:58 AM:
--

Also, how many machines are there on your cluster, how many partitions 
generator tries to create and which version of nutch are you using?


 was:
Also, how many machines are there on your cluster and which version of nutch 
are you using?

 Generator exits incorrectly for small fetchlists 
 -

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
 Fix For: 0.8.2

 Attachments: emptyfetchlist.patch, emptyfetchlist.patch


I think I found the reason why the generator returns with an empty 
 fetchlist for small fetchsizes. 
  
After the first job finishes running, the generator checks the following 
 condition to see if it got an empty list:
  
 if (readers == null || readers.length == 0 || !readers[0].next(new
 FloatWritable())) {
  
   The third condition is incorrect here. In some cases, esp. for small 
 fetchlists, the first partition might be empty, but some other partition(s) 
 might contain urls. In this case, the Generator is incorrectly assuming that 
 all partitions are empty by just looking at the first. This problem could 
 also occur when all URLs in the fetchlist are from the same host (or from a 
 very small number of hosts, or from a number of hosts that all map to a small 
 number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-504) NUTCH-443 broke parsing during fetching

2007-06-22 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-504:


Attachment: NUTCH-504_v2.patch

New version.

* Includes older patch.
* Indexer filters unsuccessful parses.
* Updated TestFetcher unit case, TestFetcher now fails without this patch.
* Also added a http.robots.agents property to src/test/crawl-tests.xml. Without 
this, TestFetcher logs a FATAL RobotRuleParser error(which doesn't cause 
TestFetcher to fail but is still annoying).

 NUTCH-443 broke parsing during fetching
 ---

 Key: NUTCH-504
 URL: https://issues.apache.org/jira/browse/NUTCH-504
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Doğacan Güney
 Fix For: 1.0.0

 Attachments: NUTCH-504_v2.patch, parse_in_fetchers.patch


 After NUTCH-443, if one is parsing during fetching and parsing for a url 
 fails, that url doesn't get segment name or similar properties in its 
 metadata. Because of this, indexer fails (because, index expects to see 
 segment name for all parses, even those that failed).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-479) Support for OR queries

2007-06-22 Thread Rob Young (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507221
 ] 

Rob Young commented on NUTCH-479:
-

How would this work in the following case?

search phrase category:cat1 OR category:cat2

would it end up as

(search phrase AND category:cat1) OR category:cat2

or as

search phrase AND (category:cat1 OR category:cat2)

 Support for OR queries
 --

 Key: NUTCH-479
 URL: https://issues.apache.org/jira/browse/NUTCH-479
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: or.patch


 There have been many requests from users to extend Nutch query syntax to add 
 support for OR queries, in addition to the implicit AND and NOT queries 
 supported now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-468) Scoring filter should distribute score to all outlinks at once

2007-06-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507366
 ] 

Doğacan Güney commented on NUTCH-468:
-

Latest patch still applies to current trunk. If no one has objections I am 
going to commit this in a few days.

 Scoring filter should distribute score to all outlinks at once
 --

 Key: NUTCH-468
 URL: https://issues.apache.org/jira/browse/NUTCH-468
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0

 Attachments: scoring-v2.patch, scoring.patch


 Currently ScoringFilter.distributeScoreToOutlink, as its name implies, takes 
 only a single outlink and works on that. I would suggest that we change it to 
 distributeScoreToOutlink_s_ so that it would take all the outlinks of a page 
 at once. This has several advantages:
 1) A ScoringFilter plugin returns a single adjust datum to set its score 
 instead of returning several.
 2) A ScoringFilter plugin can change the score of the original page (via 
 adjust datum) even if there are no outlinks. This is useful if you have a 
 ScoringFilter plugin that, say, scores pages based on content instead of 
 outlinks.
 3) Since the ScoringFilter plugin recieves all outlinks at once, it can make 
 better decisions on how to distribute the score. For example, right now it is 
 not possible to create a plugin that always distributes exactly a page's 
 'cash' to outlinks(that is, if a page has score 5, it will always distribute 
 exactly 5 points to its outlinks no matter what the internal/external factors 
 are) if internal / external score factors are not 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-22 Thread Emmanuel Joke (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507469
 ] 

Emmanuel Joke commented on NUTCH-503:
-

Sorry, my mistake.

My compiled jar was not correctly included in my classpath. I confirm that it 
does work with your patch. 

Thanks for ur help.

 Generator exits incorrectly for small fetchlists 
 -

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
 Fix For: 0.8.2

 Attachments: emptyfetchlist.patch, emptyfetchlist.patch


I think I found the reason why the generator returns with an empty 
 fetchlist for small fetchsizes. 
  
After the first job finishes running, the generator checks the following 
 condition to see if it got an empty list:
  
 if (readers == null || readers.length == 0 || !readers[0].next(new
 FloatWritable())) {
  
   The third condition is incorrect here. In some cases, esp. for small 
 fetchlists, the first partition might be empty, but some other partition(s) 
 might contain urls. In this case, the Generator is incorrectly assuming that 
 all partitions are empty by just looking at the first. This problem could 
 also occur when all URLs in the fetchlist are from the same host (or from a 
 very small number of hosts, or from a number of hosts that all map to a small 
 number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-479) Support for OR queries

2007-06-22 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507473
 ] 

Doug Cutting commented on NUTCH-479:


Neither.  It would end up as the Lucene query:

+search phrase +category:cat1 category:cat2

where category:cat2 is a non-required clause that just impacts ranking, not the 
set of documents returned.

As for nested queries, parsing is only half the problem.  The query filter 
plugins would need to be extended to handle such things, as they presently 
expect flat queries.

The query foo bar currently expands to a Lucene query that looks something 
like:

+(anchor:foo title:foo content:foo)
+(anchor:bar title:bar content:bar)
anchor:foo bar~10
title:foo bar~1000
content:foo bar~1000

(The latter three boost scores when terms are nearer.  Anchor proximity is 
limited, to keep from matching anchors from other documents.)

So, how should (foo AND (bar OR baz) expand?  Probably something like:

+(anchor:foo title:foo content:foo)
+((anchor:bar title:bar content:bar)
(anchor:baz title:baz content:baz))
... proximity boosting clauses?...

And (foo OR (bar AND baz)) might expand to:

(anchor:foo title:foo content:foo)
(+(anchor:bar title:bar content:bar)
 +(anchor:baz title:baz content:baz))
... proximity boosting clauses?...

This expansion is done by the query-basic plugin.


 Support for OR queries
 --

 Key: NUTCH-479
 URL: https://issues.apache.org/jira/browse/NUTCH-479
 Project: Nutch
  Issue Type: Improvement
  Components: searcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.0.0

 Attachments: or.patch


 There have been many requests from users to extend Nutch query syntax to add 
 support for OR queries, in addition to the implicit AND and NOT queries 
 supported now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

2007-06-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507535
 ] 

Doğacan Güney commented on NUTCH-503:
-

Nice to hear, Emmanuel.

I believe this is ready for committing, but, Vishal, can you add a test case 
for this? (Though, I am not sure how we can add a test case since this bug only 
occurs in distributed setups).

 Generator exits incorrectly for small fetchlists 
 -

 Key: NUTCH-503
 URL: https://issues.apache.org/jira/browse/NUTCH-503
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: Fedora Core 2, JDK 1.6
Reporter: Vishal Shah
 Fix For: 0.8.2

 Attachments: emptyfetchlist.patch, emptyfetchlist.patch


I think I found the reason why the generator returns with an empty 
 fetchlist for small fetchsizes. 
  
After the first job finishes running, the generator checks the following 
 condition to see if it got an empty list:
  
 if (readers == null || readers.length == 0 || !readers[0].next(new
 FloatWritable())) {
  
   The third condition is incorrect here. In some cases, esp. for small 
 fetchlists, the first partition might be empty, but some other partition(s) 
 might contain urls. In this case, the Generator is incorrectly assuming that 
 all partitions are empty by just looking at the first. This problem could 
 also occur when all URLs in the fetchlist are from the same host (or from a 
 very small number of hosts, or from a number of hosts that all map to a small 
 number of partitions).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.