Re: Injector checking for other than STATUS_INJECTED
AhhhNow I get it :) Andrzej Bialecki wrote: Dennis Kubes wrote: Sorry. I am still not getting this. I understand the reason but I am not seeing how it works. Ah, because apparently it doesn't ... :( You were right, the first job consists only of new records. Now that I checked the code again, InjectReducer should be set on the second job, and not on the first one ... I'll fix it right away.
Re: Injector checking for other than STATUS_INJECTED
Dennis Kubes wrote: Sorry. I am still not getting this. I understand the reason but I am not seeing how it works. Ah, because apparently it doesn't ... :( You were right, the first job consists only of new records. Now that I checked the code again, InjectReducer should be set on the second job, and not on the first one ... I'll fix it right away. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Updated: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt
[ https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-446: Attachment: crawl-delay.patch > RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt > - > > Key: NUTCH-446 > URL: https://issues.apache.org/jira/browse/NUTCH-446 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Doğacan Güney >Priority: Minor > Fix For: 0.9.0 > > Attachments: crawl-delay.patch > > > RobotRulesParser doesn't check for addRules when reading the crawl-delay > value, so the nutch bot will get the crawl-delay value of another robot's > crawl-delay in robots.txt. > Let me try to be more clear: > User-agent: foobot > Crawl-delay: 3600 > User-agent: * > Disallow: /baz > In such a robots.txt file, nutch bot will get 3600 as its crawl-delay > value, no matter what nutch bot's name actually is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt
RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt - Key: NUTCH-446 URL: https://issues.apache.org/jira/browse/NUTCH-446 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Reporter: Doğacan Güney Priority: Minor Fix For: 0.9.0 Attachments: crawl-delay.patch RobotRulesParser doesn't check for addRules when reading the crawl-delay value, so the nutch bot will get the crawl-delay value of another robot's crawl-delay in robots.txt. Let me try to be more clear: User-agent: foobot Crawl-delay: 3600 User-agent: * Disallow: /baz In such a robots.txt file, nutch bot will get 3600 as its crawl-delay value, no matter what nutch bot's name actually is. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Injector checking for other than STATUS_INJECTED
Sorry. I am still not getting this. I understand the reason but I am not seeing how it works. We inject a url directory which uses TextInputFormat and breaks the urls into lines. Those urls are then filtered and scored. If the pass filtering then they are injected with STATUS_INJECTED and collected by the mapper. As far as I can tell that is the only input to the reduce function is the mapped CrawlDatums which in my mind means there can't be any old (not STATUS_INJECTED) CrawlDatums at that point. The Reducer loops through the Datums replacing STATUS_INJECTED with STATUS_DB_UNFETCHED or using the old Datum if not STATUS_INJECTED. Again where do the old Datums come from? I can understand the merge logic taking care of this to make sure it doesn't overwrite something already fetched, etc with a STATUS_DB_UNFETCHED but I am not getting where the older Datums come from in the Reducer. Dennis Kubes Andrzej Bialecki wrote: Gal Nitzan wrote: Hi Andrzej, Does it mean that when you inject an existing (in crawldb) a URL it changes its status to STATUS_DB_UNFETCHED? With the current version of Injector - it won't. With previous versions - it might, depending on the order of values received in reduce().
Re: lib-http crawl-delay problem
HI, I think the robots.txt example you used was invalid (no path for that last Disallow rule). Small patch indeed, but sticking it in JIRA would still make sense because: - it leaves a good record of the bug + fix - it could be used for release notes/changelog Not trying to be picky, just pointing this out. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Doğacan Güney <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Thursday, February 15, 2007 9:12:28 PM Subject: Re: lib-http crawl-delay problem rubdabadub wrote: > Hi: > > I am unable to get the attached patch via mail. Its better if you > create a JIra issue and attached the patch there. > > Thank you. > I don't know, this bug seems too minor to require its own JIRA issue. So I put the patch to http://www.ceng.metu.edu.tr/~e1345172/crawl-delay.patch
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473383 ] Doğacan Güney commented on NUTCH-443: - > Regarding the ObjectWritable: since in this case all data is composed of > Writables I think it's still better to use GenericWritable, > because it > saves some bytes on intermediate data. Don't get me wrong, I agree with you that GenericWritable is better. The problem is that, fetcher may output a Parse object (thus a ParseData object), so it needs a wrapper that can inject configuration. Once Nutch has such a mechanism I'll be happy to provide a patch that removes ObjectWritable usage here and in Indexer. > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann >Priority: Minor > Fix For: 0.9.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, > parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473380 ] Andrzej Bialecki commented on NUTCH-443: - > Why does fetcher need to synchronize? Why does the order fetcher outputs > pairs matters? You are right, I've been spending too much time with 0.7 branch lately ... I can't see any need for that either. Regarding the ObjectWritable: since in this case all data is composed of Writables I think it's still better to use GenericWritable, because it saves some bytes on intermediate data. > allow parsers to return multiple Parse object, this will speed up the rss > parser > > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Affects Versions: 0.9.0 >Reporter: Renaud Richardet > Assigned To: Chris A. Mattmann >Priority: Minor > Fix For: 0.9.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, > parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-445: Attachment: index_query_domain_v1.1.patch This patch fixes the raw field name bug in v1.0 and adds the forgotten NutchDocumentAnalyzer modifications.(using WhiteSpaceAnalyzer in domain field). This patch obsoletes v1.0 (index_query_domain_v1.0.patch), and should be used with TranslatingRawFieldQueryFilter_v1.0.patch Note that query-site should not be included with query-domain, since it may cause some strange behavior. > Domain İndexing / Query Filter > -- > > Key: NUTCH-445 > URL: https://issues.apache.org/jira/browse/NUTCH-445 > Project: Nutch > Issue Type: New Feature > Components: indexer, searcher >Affects Versions: 0.9.0 >Reporter: Enis Soztutar > Attachments: index_query_domain_v1.0.patch, > index_query_domain_v1.1.patch, TranslatingRawFieldQueryFilter_v1.0.patch > > > Hostname's contain information about the domain of th host, and all of the > subdomains. Indexing and Searching the domains are important for intuitive > behavior. > From DomainIndexingFilter javadoc : > Adds the domain(hostname) and all super domains to the index. > * For http://lucene.apache.org/nutch/ the > * following will be added to the index : > * > * lucene.apache.org > * apache > * org > * > * All hostnames are domain names, but not all the domain names are > * hostnames. In the above example hostname lucene is a > * subdomain of apache.org, which is itself a subdomain of > * org > * > > Currently Basic indexing filter indexes the hostname in the site field, and > query-site plugin > allows to search in the site field. However site:apache.org will not return > http://lucene.apache.org > By indexing the domain, we can be able to search domains. Unlike > the site field (indexed by BasicIndexingFilter) search, searching the > domain field allows us to retrieve lucene.apache.org to the query > apache.org. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: lib-http crawl-delay problem
Thanks for the link! On 2/15/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: rubdabadub wrote: > Hi: > > I am unable to get the attached patch via mail. Its better if you > create a JIra issue and attached the patch there. > > Thank you. > I don't know, this bug seems too minor to require its own JIRA issue. So I put the patch to http://www.ceng.metu.edu.tr/~e1345172/crawl-delay.patch
Re: lib-http crawl-delay problem
rubdabadub wrote: > Hi: > > I am unable to get the attached patch via mail. Its better if you > create a JIra issue and attached the patch there. > > Thank you. > I don't know, this bug seems too minor to require its own JIRA issue. So I put the patch to http://www.ceng.metu.edu.tr/~e1345172/crawl-delay.patch
Re: lib-http crawl-delay problem
Hi: I am unable to get the attached patch via mail. Its better if you create a JIra issue and attached the patch there. Thank you. On 2/15/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: Hi, There seems to be two small bugs in lib-http's RobotRulesParser. First is about reading crawl-delay. The code doesn't check for addRules, so the nutch bot will get the crawl-delay value of another robot's crawl-delay in robots.txt. Let me try to be more clear: User-agent: foobot Crawl-delay: 3600 User-agent: * Disallow: In such a robots.txt file, nutch bot will get 3600 as its crawl-delay value, no matter what nutch bot's name actually is. Second is about main method. RobotRulesParser.main advertises its usage as " +" but if you give it more than one agent time it refuses it. Trivial patch attached. -- Doğacan Güney
lib-http crawl-delay problem
Hi, There seems to be two small bugs in lib-http's RobotRulesParser. First is about reading crawl-delay. The code doesn't check for addRules, so the nutch bot will get the crawl-delay value of another robot's crawl-delay in robots.txt. Let me try to be more clear: User-agent: foobot Crawl-delay: 3600 User-agent: * Disallow: In such a robots.txt file, nutch bot will get 3600 as its crawl-delay value, no matter what nutch bot's name actually is. Second is about main method. RobotRulesParser.main advertises its usage as " +" but if you give it more than one agent time it refuses it. Trivial patch attached. -- Doğacan Güney Index: src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java === --- src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java (revision 507852) +++ src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java (working copy) @@ -389,15 +389,17 @@ } else if ( (line.length() >= 12) && (line.substring(0, 12).equalsIgnoreCase("Crawl-Delay:"))) { doneAgents = true; -long crawlDelay = -1; -String delay = line.substring("Crawl-Delay:".length(), line.length()).trim(); -if (delay.length() > 0) { - try { -crawlDelay = Long.parseLong(delay) * 1000; // sec to millisec - } catch (Exception e) { -LOG.info("can not parse Crawl-Delay:" + e.toString()); +if (addRules) { + long crawlDelay = -1; + String delay = line.substring("Crawl-Delay:".length(), line.length()).trim(); + if (delay.length() > 0) { +try { + crawlDelay = Long.parseLong(delay) * 1000; // sec to millisec +} catch (Exception e) { + LOG.info("can not parse Crawl-Delay:" + e.toString()); +} +currentRules.setCrawlDelay(crawlDelay); } - currentRules.setCrawlDelay(crawlDelay); } } } @@ -500,7 +502,7 @@ /** command-line main for testing */ public static void main(String[] argv) { -if (argv.length != 3) { +if (argv.length < 3) { System.out.println("Usage:"); System.out.println(" java +"); System.out.println(""); @@ -513,7 +515,7 @@ try { FileInputStream robotsIn= new FileInputStream(argv[0]); LineNumberReader testsIn= new LineNumberReader(new FileReader(argv[1])); - String[] robotNames= new String[argv.length - 1]; + String[] robotNames= new String[argv.length - 2]; for (int i= 0; i < argv.length - 2; i++) robotNames[i]= argv[i+2];
[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-445: Attachment: TranslatingRawFieldQueryFilter_v1.0.patch This patch complements index_query_domain_v1.0.patch. However, The class TranslatingRawFieldQueryFilter can be used independently, so i have put this in a seperate file. The javadoc reads : * Similar to [EMAIL PROTECTED] RawFieldQueryFilter} except that the index * and query field names can be different. * * This class can be extended by QueryFilters to allow * searching a field in the index, but using another field name in the * search. * * For example index field names can be kept in english such as "content", * "lang", "title", ..., however query filters can be build in other languages > Domain İndexing / Query Filter > -- > > Key: NUTCH-445 > URL: https://issues.apache.org/jira/browse/NUTCH-445 > Project: Nutch > Issue Type: New Feature > Components: indexer, searcher >Affects Versions: 0.9.0 >Reporter: Enis Soztutar > Attachments: index_query_domain_v1.0.patch, > TranslatingRawFieldQueryFilter_v1.0.patch > > > Hostname's contain information about the domain of th host, and all of the > subdomains. Indexing and Searching the domains are important for intuitive > behavior. > From DomainIndexingFilter javadoc : > Adds the domain(hostname) and all super domains to the index. > * For http://lucene.apache.org/nutch/ the > * following will be added to the index : > * > * lucene.apache.org > * apache > * org > * > * All hostnames are domain names, but not all the domain names are > * hostnames. In the above example hostname lucene is a > * subdomain of apache.org, which is itself a subdomain of > * org > * > > Currently Basic indexing filter indexes the hostname in the site field, and > query-site plugin > allows to search in the site field. However site:apache.org will not return > http://lucene.apache.org > By indexing the domain, we can be able to search domains. Unlike > the site field (indexed by BasicIndexingFilter) search, searching the > domain field allows us to retrieve lucene.apache.org to the query > apache.org. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-445: Attachment: index_query_domain_v1.0.patch Patch for index-domain and query-domain plugins. > Domain İndexing / Query Filter > -- > > Key: NUTCH-445 > URL: https://issues.apache.org/jira/browse/NUTCH-445 > Project: Nutch > Issue Type: New Feature > Components: indexer, searcher >Affects Versions: 0.9.0 >Reporter: Enis Soztutar > Attachments: index_query_domain_v1.0.patch > > > Hostname's contain information about the domain of th host, and all of the > subdomains. Indexing and Searching the domains are important for intuitive > behavior. > From DomainIndexingFilter javadoc : > Adds the domain(hostname) and all super domains to the index. > * For http://lucene.apache.org/nutch/ the > * following will be added to the index : > * > * lucene.apache.org > * apache > * org > * > * All hostnames are domain names, but not all the domain names are > * hostnames. In the above example hostname lucene is a > * subdomain of apache.org, which is itself a subdomain of > * org > * > > Currently Basic indexing filter indexes the hostname in the site field, and > query-site plugin > allows to search in the site field. However site:apache.org will not return > http://lucene.apache.org > By indexing the domain, we can be able to search domains. Unlike > the site field (indexed by BasicIndexingFilter) search, searching the > domain field allows us to retrieve lucene.apache.org to the query > apache.org. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-445) Domain İndexing / Query Filter
Domain İndexing / Query Filter -- Key: NUTCH-445 URL: https://issues.apache.org/jira/browse/NUTCH-445 Project: Nutch Issue Type: New Feature Components: indexer, searcher Affects Versions: 0.9.0 Reporter: Enis Soztutar Hostname's contain information about the domain of th host, and all of the subdomains. Indexing and Searching the domains are important for intuitive behavior. >From DomainIndexingFilter javadoc : Adds the domain(hostname) and all super domains to the index. * For http://lucene.apache.org/nutch/ the * following will be added to the index : * * lucene.apache.org * apache * org * * All hostnames are domain names, but not all the domain names are * hostnames. In the above example hostname lucene is a * subdomain of apache.org, which is itself a subdomain of * org * Currently Basic indexing filter indexes the hostname in the site field, and query-site plugin allows to search in the site field. However site:apache.org will not return http://lucene.apache.org By indexing the domain, we can be able to search domains. Unlike the site field (indexed by BasicIndexingFilter) search, searching the domain field allows us to retrieve lucene.apache.org to the query apache.org. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.