Re: Creating a new scoring filter.

2007-02-27 Thread Nicolás Lichtmaier



Yeah, but there I don't have the parse data for those new pages. What I
would like to do is override passScoreAfterParsing() and not pass
anything: just analyze the parsed data and decide a score. The problem
is that that function doesn't get passed the CrawlDatum... it seems I'll
need to modify Nutch itself =(

Can you be a bit more specific about your problem?


I'm indexing a fixed set of URLs that I think are a specific type of 
document. I don't care about links (I'm using -noAdditions to prevent 
adding links to crawldb, I've backported that to 0.8.x and it's waiting 
for somebody to commit it =) 
https://issues.apache.org/jira/browse/NUTCH-438 ).


I just want to replace the scoring algorithm with one which test if that 
URL really is that specific type of document. I want to use the parse 
data of a document to calculate its relevance.



Anyway, without the details, here is my guess on how you can do it:
1) In passScoreAfterParsing(), analyze the content and parse text and
put the relevant score information in parse data's metadata.
2) In distributeScoreToOutlink() ignore the outlinks (just give them
initialScore()),
but check your parse data and return an adjust datum with the status
STATUS_LINKED and score extracted from parse data. This adjust datum
will update the score of the original datum in updatedb.

Does this work for you?


It doesn't seem a good way to do it. What if there are no outlinks? This 
method won't be called at all. And anyway, it would be called once per 
each outlink, which would multiplicate the work.


Thanks!



Re: Creating a new scoring filter.

2007-02-27 Thread Doğacan Güney

Hi,

On 2/27/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote:

[snip]



It doesn't seem a good way to do it. What if there are no outlinks? This
method won't be called at all. And anyway, it would be called once per
each outlink, which would multiplicate the work.


Multiplication is easy to solve but you are right that it won't work
if there are no outlinks.

Maybe scoring filter api should change? A distributeScoreToOutlinks
method may be more useful than the current one: (which will be called
even if there are no outlinks)

CrawlDatum distributeScoreToOutlinks(Text fromUrl, ListString
toUrlList,   ListCrawlDatum datumList, ParseData parseData,
CrawlDatum adjust)

This method gives more control to the plugin since knowing all the
outlinks the plugin can make more informed decisions. Like, right now,
there is no way a scoring filter can be sure that it has distributed
all its cash (e.g if db.score.internal.link is 0.5 and
db.score.external.link is 1.0, filter will almost always distribute
less than its cash).

This will also work for your case, since you will just ignore the
outlinks and return the adjust datum based on information in parse
metadata.

What do you (and others) think?



Thanks!





--
Doğacan Güney


[jira] Commented: (NUTCH-445) Domain İndexing / Query Filter

2007-02-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476212
 ] 

Doğacan Güney commented on NUTCH-445:
-

Has anyone looked at this? Google seems to do site: searches like this too. A 
site:apache.org  search doesn't only return results under apache.org but also 
returns results under all sub-hosts (or whatever they are called) of 
apache.org. 

Also, IMO, this shouldn't be a seperate domain field in the index but rather, 
should replace the site field. So (again IMO) this should not be a seperate 
plugin but should be a patch to index-basic and query-site. 

(PS: Enis and I know each other so you may want to take this with a grain of 
salt.)

 Domain İndexing / Query Filter
 --

 Key: NUTCH-445
 URL: https://issues.apache.org/jira/browse/NUTCH-445
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: index_query_domain_v1.0.patch, 
 index_query_domain_v1.1.patch, TranslatingRawFieldQueryFilter_v1.0.patch


 Hostname's contain information about the domain of th host, and all of the 
 subdomains. Indexing and Searching the domains are important for intuitive 
 behavior. 
 From DomainIndexingFilter javadoc : 
 Adds the domain(hostname) and all super domains to the index. 
  * br For http://lucene.apache.org/nutch/ the 
  * following will be added to the index : br 
  * ul
  * lilucene.apache.org /li
  * liapache/li
  * liorg /li
  * /ul
  * All hostnames are domain names, but not all the domain names are 
  * hostnames. In the above example hostname lucene is a 
  * subdomain of apache.org, which is itself a subdomain of 
  * org br
  * 
  
 Currently Basic indexing filter indexes the hostname in the site field, and 
 query-site plugin 
 allows to search in the site field. However site:apache.org will not return 
 http://lucene.apache.org
  By indexing the domain, we can be able to search domains. Unlike 
  the site field (indexed by BasicIndexingFilter) search, searching the 
  domain field allows us to retrieve lucene.apache.org to the query 
  apache.org. 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Creating a new scoring filter.

2007-02-27 Thread Nicolás Lichtmaier



It doesn't seem a good way to do it. What if there are no outlinks? This
method won't be called at all. And anyway, it would be called once per
each outlink, which would multiplicate the work.


Multiplication is easy to solve but you are right that it won't work
if there are no outlinks.

Maybe scoring filter api should change? A distributeScoreToOutlinks
method may be more useful than the current one: (which will be called
even if there are no outlinks)

CrawlDatum distributeScoreToOutlinks(Text fromUrl, ListString
toUrlList,   ListCrawlDatum datumList, ParseData parseData,
CrawlDatum adjust)

This method gives more control to the plugin since knowing all the
outlinks the plugin can make more informed decisions. Like, right now,
there is no way a scoring filter can be sure that it has distributed
all its cash (e.g if db.score.internal.link is 0.5 and
db.score.external.link is 1.0, filter will almost always distribute
less than its cash).

This will also work for your case, since you will just ignore the
outlinks and return the adjust datum based on information in parse
metadata.

What do you (and others) think?


I think that good API design here means not assuming so many things 
about the plugin behaviour. You are right about this 
distributeScoreToOutlinks(), but IMO it should be called something 
like assignScores(). Then you could add an abstract class 
DistributingScorePlugin (implementing the interface) which overrides 
assignScores() and calls an abstract protected method called 
distributeScoreToOutlink().. So the code for traversing the outlinks 
would be in DistributingScorePlugin.


I would need another class, called ContentBasedScorePlugin. That class 
could call an abstract protected method called calculateScore() which 
would receive the parsed data and return the score.


What do you think?



[jira] Commented: (NUTCH-445) Domain İndexing / Query Filter

2007-02-27 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476243
 ] 

Doug Cutting commented on NUTCH-445:


Note that the site field is also used for search-time deduplication, and that 
assumes that each document has only one value for the field (returned from a 
Lucene FieldCache with raw hits, for performance).  So this feature should 
perhaps use a separate field.

That said, I think this should replace the current site-search feature, as it 
is an improvement and the industry-standard semantics.  So perhaps a site: 
query should search the domain: field?

 Domain İndexing / Query Filter
 --

 Key: NUTCH-445
 URL: https://issues.apache.org/jira/browse/NUTCH-445
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar
 Attachments: index_query_domain_v1.0.patch, 
 index_query_domain_v1.1.patch, TranslatingRawFieldQueryFilter_v1.0.patch


 Hostname's contain information about the domain of th host, and all of the 
 subdomains. Indexing and Searching the domains are important for intuitive 
 behavior. 
 From DomainIndexingFilter javadoc : 
 Adds the domain(hostname) and all super domains to the index. 
  * br For http://lucene.apache.org/nutch/ the 
  * following will be added to the index : br 
  * ul
  * lilucene.apache.org /li
  * liapache/li
  * liorg /li
  * /ul
  * All hostnames are domain names, but not all the domain names are 
  * hostnames. In the above example hostname lucene is a 
  * subdomain of apache.org, which is itself a subdomain of 
  * org br
  * 
  
 Currently Basic indexing filter indexes the hostname in the site field, and 
 query-site plugin 
 allows to search in the site field. However site:apache.org will not return 
 http://lucene.apache.org
  By indexing the domain, we can be able to search domains. Unlike 
  the site field (indexed by BasicIndexingFilter) search, searching the 
  domain field allows us to retrieve lucene.apache.org to the query 
  apache.org. 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-27 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476297
 ] 

Andrzej Bialecki  commented on NUTCH-443:
-

Overall the idea of this improvement looks very useful, but I'm -1 on this 
patch as it looks right now (only now I had a chance to review it in detail). 
I'd like to see the following issues addressed in this patch before it's 
committed:

* in my opinion it's easier to add missing CrawlDatum's (with correctly set 
fetch time) for the new urls to the output rather than work-around this by 
passing around the fetch time in metadata, and then again compensating in 
Indexer and CrawlDbReducer for the lack of these fetchDatum-s ..

* in Fetcher / Fetcher2 you don't pass the signature in case when there is no 
valid Parse output, but in the current versions of Fetchers the signature is 
still calculated and passed in datum.setSignature() (which ends up in 
crawl_fetch).

* using a generic MapString, Parse is IMHO inappropriate, as I indicated 
earlier, especially since this Map requires special post-processing in 
ParseUtil.processParseMap - and what would happen if I didn't use ParseUtil? I 
think this calls for a special-purpose class (ParseResult?), which would 
encapsulate this behavior without exposing it to its users (or even worse - 
allowing users to bypass it). This class would also help us to avoid somewhat 
ugly convenience methods in ParseStatus and ParseImpl - these details would 
be hidden in one of the constructors of ParseResult.

* I'm also not sure why we use MapString, Parse and not MapText, Parse, 
since in all further processing we need to create Text objects ...

* the new section in HtmlParseFilters breaks the loop on encountering the first 
error, and leaves the parse results incompletely filtered. It should simply 
continue - the result is an aggregation of more or less independent documents 
that are parsed on their own.

* the comment about redirects in Parser.java is misplaced - I think this 
contract should be both defined and enforced in the Fetcher.


And finally, I think this is a significant change in the way how content 
parsers work with the rest of the framework, so we should wait with this patch 
after the 0.9 release - and we should push 0.9 out of the door really soon ...

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
 Assigned To: Chris A. Mattmann
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
 NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
 NUTCH-443.022507.patch.txt, parse-map-core-draft-v1.patch, 
 parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-27 Thread nutch.newbie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476361
 ] 

nutch.newbie commented on NUTCH-443:


Hi:

We were really counting on this patch that it will make it to trunk as our site 
launch depends on it. This patch let us to complete Nutch-444. However I don't 
have enough knowledge about the inner workings of the patch to comment. I can 
only say that I tried it on a large set of seeds and it works without error. 

Regarding 0.9 release .. its been months since it was discussed on the list ... 
and it is not possible to predict when 0.9 release will take place what I 
worry about is  like many other patch this patch will also die out .. which 
is sad. I tend not to use code that are not in the trunk... so its a big loss 
for me cos my site needs to be launched...anyway thats my headache :-(

Regards



 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
 Assigned To: Chris A. Mattmann
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
 NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
 NUTCH-443.022507.patch.txt, parse-map-core-draft-v1.patch, 
 parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Creating a new scoring filter.

2007-02-27 Thread Nicolás Lichtmaier



I didn't understand the point of creating abstract base classes for
plugins. I am not strictly opposing it or anything, I just don't see
why it would make things simpler/more flexible. AFAICS, there is not
much an abstract base class can do but to pass the arguments of
assignScores to calculateScore/distributeScoreToOutlinks. I mean, here
is how I envision a ContentBasedScoringFilter class(or a
DistributingScoringFilter):

abstract class ContentBasedScoringFilter implements ScoringFilter {
  assignScores(args) { return calculateScore(args);  }
  protected abstract calculateScore(args);
}

Or do you have something else in mind?


Yes, something like that. But I also thought that if you don't want to 
repeat the logic of traversing through links (with all the logic which 
is now in ParseOutputFormat), that logic could be in an abstract class 
which would just traverse them and call an abstract function for each one.