Re: Creating a new scoring filter.
Yeah, but there I don't have the parse data for those new pages. What I would like to do is override passScoreAfterParsing() and not pass anything: just analyze the parsed data and decide a score. The problem is that that function doesn't get passed the CrawlDatum... it seems I'll need to modify Nutch itself =( Can you be a bit more specific about your problem? I'm indexing a fixed set of URLs that I think are a specific type of document. I don't care about links (I'm using -noAdditions to prevent adding links to crawldb, I've backported that to 0.8.x and it's waiting for somebody to commit it =) https://issues.apache.org/jira/browse/NUTCH-438 ). I just want to replace the scoring algorithm with one which test if that URL really is that specific type of document. I want to use the parse data of a document to calculate its relevance. Anyway, without the details, here is my guess on how you can do it: 1) In passScoreAfterParsing(), analyze the content and parse text and put the relevant score information in parse data's metadata. 2) In distributeScoreToOutlink() ignore the outlinks (just give them initialScore()), but check your parse data and return an adjust datum with the status STATUS_LINKED and score extracted from parse data. This adjust datum will update the score of the original datum in updatedb. Does this work for you? It doesn't seem a good way to do it. What if there are no outlinks? This method won't be called at all. And anyway, it would be called once per each outlink, which would multiplicate the work. Thanks!
Re: Creating a new scoring filter.
Hi, On 2/27/07, Nicolás Lichtmaier [EMAIL PROTECTED] wrote: [snip] It doesn't seem a good way to do it. What if there are no outlinks? This method won't be called at all. And anyway, it would be called once per each outlink, which would multiplicate the work. Multiplication is easy to solve but you are right that it won't work if there are no outlinks. Maybe scoring filter api should change? A distributeScoreToOutlinks method may be more useful than the current one: (which will be called even if there are no outlinks) CrawlDatum distributeScoreToOutlinks(Text fromUrl, ListString toUrlList, ListCrawlDatum datumList, ParseData parseData, CrawlDatum adjust) This method gives more control to the plugin since knowing all the outlinks the plugin can make more informed decisions. Like, right now, there is no way a scoring filter can be sure that it has distributed all its cash (e.g if db.score.internal.link is 0.5 and db.score.external.link is 1.0, filter will almost always distribute less than its cash). This will also work for your case, since you will just ignore the outlinks and return the adjust datum based on information in parse metadata. What do you (and others) think? Thanks! -- Doğacan Güney
[jira] Commented: (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476212 ] Doğacan Güney commented on NUTCH-445: - Has anyone looked at this? Google seems to do site: searches like this too. A site:apache.org search doesn't only return results under apache.org but also returns results under all sub-hosts (or whatever they are called) of apache.org. Also, IMO, this shouldn't be a seperate domain field in the index but rather, should replace the site field. So (again IMO) this should not be a seperate plugin but should be a patch to index-basic and query-site. (PS: Enis and I know each other so you may want to take this with a grain of salt.) Domain İndexing / Query Filter -- Key: NUTCH-445 URL: https://issues.apache.org/jira/browse/NUTCH-445 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: index_query_domain_v1.0.patch, index_query_domain_v1.1.patch, TranslatingRawFieldQueryFilter_v1.0.patch Hostname's contain information about the domain of th host, and all of the subdomains. Indexing and Searching the domains are important for intuitive behavior. From DomainIndexingFilter javadoc : Adds the domain(hostname) and all super domains to the index. * br For http://lucene.apache.org/nutch/ the * following will be added to the index : br * ul * lilucene.apache.org /li * liapache/li * liorg /li * /ul * All hostnames are domain names, but not all the domain names are * hostnames. In the above example hostname lucene is a * subdomain of apache.org, which is itself a subdomain of * org br * Currently Basic indexing filter indexes the hostname in the site field, and query-site plugin allows to search in the site field. However site:apache.org will not return http://lucene.apache.org By indexing the domain, we can be able to search domains. Unlike the site field (indexed by BasicIndexingFilter) search, searching the domain field allows us to retrieve lucene.apache.org to the query apache.org. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Creating a new scoring filter.
It doesn't seem a good way to do it. What if there are no outlinks? This method won't be called at all. And anyway, it would be called once per each outlink, which would multiplicate the work. Multiplication is easy to solve but you are right that it won't work if there are no outlinks. Maybe scoring filter api should change? A distributeScoreToOutlinks method may be more useful than the current one: (which will be called even if there are no outlinks) CrawlDatum distributeScoreToOutlinks(Text fromUrl, ListString toUrlList, ListCrawlDatum datumList, ParseData parseData, CrawlDatum adjust) This method gives more control to the plugin since knowing all the outlinks the plugin can make more informed decisions. Like, right now, there is no way a scoring filter can be sure that it has distributed all its cash (e.g if db.score.internal.link is 0.5 and db.score.external.link is 1.0, filter will almost always distribute less than its cash). This will also work for your case, since you will just ignore the outlinks and return the adjust datum based on information in parse metadata. What do you (and others) think? I think that good API design here means not assuming so many things about the plugin behaviour. You are right about this distributeScoreToOutlinks(), but IMO it should be called something like assignScores(). Then you could add an abstract class DistributingScorePlugin (implementing the interface) which overrides assignScores() and calls an abstract protected method called distributeScoreToOutlink().. So the code for traversing the outlinks would be in DistributingScorePlugin. I would need another class, called ContentBasedScorePlugin. That class could call an abstract protected method called calculateScore() which would receive the parsed data and return the score. What do you think?
[jira] Commented: (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476243 ] Doug Cutting commented on NUTCH-445: Note that the site field is also used for search-time deduplication, and that assumes that each document has only one value for the field (returned from a Lucene FieldCache with raw hits, for performance). So this feature should perhaps use a separate field. That said, I think this should replace the current site-search feature, as it is an improvement and the industry-standard semantics. So perhaps a site: query should search the domain: field? Domain İndexing / Query Filter -- Key: NUTCH-445 URL: https://issues.apache.org/jira/browse/NUTCH-445 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Attachments: index_query_domain_v1.0.patch, index_query_domain_v1.1.patch, TranslatingRawFieldQueryFilter_v1.0.patch Hostname's contain information about the domain of th host, and all of the subdomains. Indexing and Searching the domains are important for intuitive behavior. From DomainIndexingFilter javadoc : Adds the domain(hostname) and all super domains to the index. * br For http://lucene.apache.org/nutch/ the * following will be added to the index : br * ul * lilucene.apache.org /li * liapache/li * liorg /li * /ul * All hostnames are domain names, but not all the domain names are * hostnames. In the above example hostname lucene is a * subdomain of apache.org, which is itself a subdomain of * org br * Currently Basic indexing filter indexes the hostname in the site field, and query-site plugin allows to search in the site field. However site:apache.org will not return http://lucene.apache.org By indexing the domain, we can be able to search domains. Unlike the site field (indexed by BasicIndexingFilter) search, searching the domain field allows us to retrieve lucene.apache.org to the query apache.org. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476297 ] Andrzej Bialecki commented on NUTCH-443: - Overall the idea of this improvement looks very useful, but I'm -1 on this patch as it looks right now (only now I had a chance to review it in detail). I'd like to see the following issues addressed in this patch before it's committed: * in my opinion it's easier to add missing CrawlDatum's (with correctly set fetch time) for the new urls to the output rather than work-around this by passing around the fetch time in metadata, and then again compensating in Indexer and CrawlDbReducer for the lack of these fetchDatum-s .. * in Fetcher / Fetcher2 you don't pass the signature in case when there is no valid Parse output, but in the current versions of Fetchers the signature is still calculated and passed in datum.setSignature() (which ends up in crawl_fetch). * using a generic MapString, Parse is IMHO inappropriate, as I indicated earlier, especially since this Map requires special post-processing in ParseUtil.processParseMap - and what would happen if I didn't use ParseUtil? I think this calls for a special-purpose class (ParseResult?), which would encapsulate this behavior without exposing it to its users (or even worse - allowing users to bypass it). This class would also help us to avoid somewhat ugly convenience methods in ParseStatus and ParseImpl - these details would be hidden in one of the constructors of ParseResult. * I'm also not sure why we use MapString, Parse and not MapText, Parse, since in all further processing we need to create Text objects ... * the new section in HtmlParseFilters breaks the loop on encountering the first error, and leaves the parse results incompletely filtered. It should simply continue - the result is an aggregation of more or less independent documents that are parsed on their own. * the comment about redirects in Parser.java is misplaced - I think this contract should be both defined and enforced in the Fetcher. And finally, I think this is a significant change in the way how content parsers work with the rest of the framework, so we should wait with this patch after the 0.9 release - and we should push 0.9 out of the door really soon ... allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assigned To: Chris A. Mattmann Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476361 ] nutch.newbie commented on NUTCH-443: Hi: We were really counting on this patch that it will make it to trunk as our site launch depends on it. This patch let us to complete Nutch-444. However I don't have enough knowledge about the inner workings of the patch to comment. I can only say that I tried it on a large set of seeds and it works without error. Regarding 0.9 release .. its been months since it was discussed on the list ... and it is not possible to predict when 0.9 release will take place what I worry about is like many other patch this patch will also die out .. which is sad. I tend not to use code that are not in the trunk... so its a big loss for me cos my site needs to be launched...anyway thats my headache :-( Regards allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assigned To: Chris A. Mattmann Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, NUTCH-443.022507.patch.txt, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Creating a new scoring filter.
I didn't understand the point of creating abstract base classes for plugins. I am not strictly opposing it or anything, I just don't see why it would make things simpler/more flexible. AFAICS, there is not much an abstract base class can do but to pass the arguments of assignScores to calculateScore/distributeScoreToOutlinks. I mean, here is how I envision a ContentBasedScoringFilter class(or a DistributingScoringFilter): abstract class ContentBasedScoringFilter implements ScoringFilter { assignScores(args) { return calculateScore(args); } protected abstract calculateScore(args); } Or do you have something else in mind? Yes, something like that. But I also thought that if you don't want to repeat the logic of traversing through links (with all the logic which is now in ParseOutputFormat), that logic could be in an abstract class which would just traverse them and call an abstract function for each one.