Re: Injector checking for other than STATUS_INJECTED

2007-02-15 Thread Dennis Kubes

AhhhNow I get it :)

Andrzej Bialecki wrote:

Dennis Kubes wrote:
Sorry.  I am still not getting this.  I understand the reason but I am 
not seeing how it works.


Ah, because apparently it doesn't ... :( You were right, the first job 
consists only of new records. Now that I checked the code again, 
InjectReducer should be set on the second job, and not on the first one 
... I'll fix it right away.




Re: Injector checking for other than STATUS_INJECTED

2007-02-15 Thread Andrzej Bialecki

Dennis Kubes wrote:
Sorry.  I am still not getting this.  I understand the reason but I am 
not seeing how it works.


Ah, because apparently it doesn't ... :( You were right, the first job 
consists only of new records. Now that I checked the code again, 
InjectReducer should be set on the second job, and not on the first one 
... I'll fix it right away.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Updated: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-02-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-446:


Attachment: crawl-delay.patch

> RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt
> -
>
> Key: NUTCH-446
> URL: https://issues.apache.org/jira/browse/NUTCH-446
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Doğacan Güney
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: crawl-delay.patch
>
>
> RobotRulesParser doesn't check for addRules when reading the crawl-delay 
> value, so the nutch bot will get the crawl-delay value of another robot's 
> crawl-delay in robots.txt. 
> Let me try to be more clear:
> User-agent: foobot
> Crawl-delay: 3600
> User-agent: *
> Disallow: /baz
> In such a robots.txt file, nutch bot will get 3600 as its crawl-delay
> value, no matter what nutch bot's name actually is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-02-15 Thread JIRA
RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt
-

 Key: NUTCH-446
 URL: https://issues.apache.org/jira/browse/NUTCH-446
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Doğacan Güney
Priority: Minor
 Fix For: 0.9.0
 Attachments: crawl-delay.patch

RobotRulesParser doesn't check for addRules when reading the crawl-delay value, 
so the nutch bot will get the crawl-delay value of another robot's crawl-delay 
in robots.txt. 

Let me try to be more clear:

User-agent: foobot
Crawl-delay: 3600

User-agent: *
Disallow: /baz

In such a robots.txt file, nutch bot will get 3600 as its crawl-delay
value, no matter what nutch bot's name actually is.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Injector checking for other than STATUS_INJECTED

2007-02-15 Thread Dennis Kubes
Sorry.  I am still not getting this.  I understand the reason but I am 
not seeing how it works.


We inject a url directory which uses TextInputFormat and breaks the urls 
into lines.  Those urls are then filtered and scored.  If the pass 
filtering then they are injected with STATUS_INJECTED and collected by 
the mapper.  As far as I can tell that is the only input to the reduce 
function is the mapped CrawlDatums which in my mind means there can't be 
any old (not STATUS_INJECTED) CrawlDatums at that point.


The Reducer loops through the Datums replacing STATUS_INJECTED with 
STATUS_DB_UNFETCHED or using the old Datum if not STATUS_INJECTED. 
Again where do the old Datums come from?


I can understand the merge logic taking care of this to make sure it 
doesn't overwrite something already fetched, etc with a 
STATUS_DB_UNFETCHED but I am not getting where the older Datums come 
from in the Reducer.


Dennis Kubes



Andrzej Bialecki wrote:

Gal Nitzan wrote:

Hi Andrzej,

Does it mean that when you inject an existing (in crawldb) a URL it 
changes

its status to STATUS_DB_UNFETCHED?

  


With the current version of Injector - it won't. With previous versions 
- it might, depending on the order of values received in reduce().




Re: lib-http crawl-delay problem

2007-02-15 Thread ogjunk-nutch
HI,

I think the robots.txt example you used was invalid (no path for that last 
Disallow rule).
Small patch indeed, but sticking it in JIRA would still make sense because:
- it leaves a good record of the bug + fix
- it could be used for release notes/changelog

Not trying to be picky, just pointing this out.

Otis 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Doğacan Güney <[EMAIL PROTECTED]>
To: nutch-dev@lucene.apache.org
Sent: Thursday, February 15, 2007 9:12:28 PM
Subject: Re: lib-http crawl-delay problem

rubdabadub wrote:
> Hi:
>
> I am unable to get the attached patch via mail. Its better if you
> create a JIra issue and attached the patch there.
>
> Thank you.
>

I don't know, this bug seems too minor to require its own JIRA issue.
So I put the patch to
http://www.ceng.metu.edu.tr/~e1345172/crawl-delay.patch 






[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473383
 ] 

Doğacan Güney commented on NUTCH-443:
-

> Regarding the ObjectWritable: since in this case all data is composed of 
> Writables I think it's still better to use GenericWritable, > because it 
> saves some bytes on intermediate data.

Don't get me wrong, I agree with you that GenericWritable is better. The 
problem is that, fetcher may output a Parse object (thus a ParseData object), 
so it needs a wrapper that can inject configuration. Once Nutch has such a 
mechanism I'll be happy to provide a patch that removes ObjectWritable usage 
here and in Indexer.

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Chris A. Mattmann
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-15 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473380
 ] 

Andrzej Bialecki  commented on NUTCH-443:
-

> Why does fetcher need to synchronize? Why does the order fetcher outputs 
>  pairs matters?

You are right, I've been spending too much time with 0.7 branch lately ... I 
can't see any need for that either.

Regarding the ObjectWritable: since in this case all data is composed of 
Writables I think it's still better to use GenericWritable, because it saves 
some bytes on intermediate data.

> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> 
>
> Key: NUTCH-443
> URL: https://issues.apache.org/jira/browse/NUTCH-443
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
> Assigned To: Chris A. Mattmann
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter

2007-02-15 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-445:


Attachment: index_query_domain_v1.1.patch

This patch fixes the raw field name bug in v1.0 and adds the forgotten 
NutchDocumentAnalyzer modifications.(using WhiteSpaceAnalyzer in domain field). 

This patch obsoletes v1.0 (index_query_domain_v1.0.patch), and should be used 
with TranslatingRawFieldQueryFilter_v1.0.patch

Note that query-site should not be included with query-domain, since it may 
cause some strange behavior. 



> Domain İndexing / Query Filter
> --
>
> Key: NUTCH-445
> URL: https://issues.apache.org/jira/browse/NUTCH-445
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, searcher
>Affects Versions: 0.9.0
>Reporter: Enis Soztutar
> Attachments: index_query_domain_v1.0.patch, 
> index_query_domain_v1.1.patch, TranslatingRawFieldQueryFilter_v1.0.patch
>
>
> Hostname's contain information about the domain of th host, and all of the 
> subdomains. Indexing and Searching the domains are important for intuitive 
> behavior. 
> From DomainIndexingFilter javadoc : 
> Adds the domain(hostname) and all super domains to the index. 
>  *  For http://lucene.apache.org/nutch/ the 
>  * following will be added to the index :  
>  * 
>  * lucene.apache.org 
>  * apache
>  * org 
>  * 
>  * All hostnames are domain names, but not all the domain names are 
>  * hostnames. In the above example hostname lucene is a 
>  * subdomain of apache.org, which is itself a subdomain of 
>  * org 
>  * 
>  
> Currently Basic indexing filter indexes the hostname in the site field, and 
> query-site plugin 
> allows to search in the site field. However site:apache.org will not return 
> http://lucene.apache.org
>  By indexing the domain, we can be able to search domains. Unlike 
>  the site field (indexed by BasicIndexingFilter) search, searching the 
>  domain field allows us to retrieve lucene.apache.org to the query 
>  apache.org. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: lib-http crawl-delay problem

2007-02-15 Thread rubdabadub

Thanks for the link!



On 2/15/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:

rubdabadub wrote:
> Hi:
>
> I am unable to get the attached patch via mail. Its better if you
> create a JIra issue and attached the patch there.
>
> Thank you.
>

I don't know, this bug seems too minor to require its own JIRA issue.
So I put the patch to
http://www.ceng.metu.edu.tr/~e1345172/crawl-delay.patch




Re: lib-http crawl-delay problem

2007-02-15 Thread Doğacan Güney
rubdabadub wrote:
> Hi:
>
> I am unable to get the attached patch via mail. Its better if you
> create a JIra issue and attached the patch there.
>
> Thank you.
>

I don't know, this bug seems too minor to require its own JIRA issue.
So I put the patch to
http://www.ceng.metu.edu.tr/~e1345172/crawl-delay.patch 



Re: lib-http crawl-delay problem

2007-02-15 Thread rubdabadub

Hi:

I am unable to get the attached patch via mail. Its better if you
create a JIra issue and attached the patch there.

Thank you.

On 2/15/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:

Hi,

There seems to be two small bugs in lib-http's RobotRulesParser.

First is about reading crawl-delay. The code doesn't check for addRules,
so the nutch bot will get the crawl-delay value of another robot's
crawl-delay in robots.txt. Let me try to be more clear:

User-agent: foobot
Crawl-delay: 3600

User-agent: *
Disallow:


In such a robots.txt file, nutch bot will get 3600 as its crawl-delay
value, no matter what nutch bot's name actually is.

Second is about main method. RobotRulesParser.main advertises its usage
as "  +" but if you give it more than
one agent time it refuses it.

Trivial patch attached.

--
Doğacan Güney




lib-http crawl-delay problem

2007-02-15 Thread Doğacan Güney
Hi,

There seems to be two small bugs in lib-http's RobotRulesParser.

First is about reading crawl-delay. The code doesn't check for addRules,
so the nutch bot will get the crawl-delay value of another robot's
crawl-delay in robots.txt. Let me try to be more clear:

User-agent: foobot
Crawl-delay: 3600

User-agent: *
Disallow:


In such a robots.txt file, nutch bot will get 3600 as its crawl-delay
value, no matter what nutch bot's name actually is.

Second is about main method. RobotRulesParser.main advertises its usage
as "  +" but if you give it more than
one agent time it refuses it.

Trivial patch attached.

--
Doğacan Güney
Index: src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java
===
--- src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java	(revision 507852)
+++ src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/RobotRulesParser.java	(working copy)
@@ -389,15 +389,17 @@
   } else if ( (line.length() >= 12)
   && (line.substring(0, 12).equalsIgnoreCase("Crawl-Delay:"))) {
 doneAgents = true;
-long crawlDelay = -1;
-String delay = line.substring("Crawl-Delay:".length(), line.length()).trim();
-if (delay.length() > 0) {
-  try {
-crawlDelay = Long.parseLong(delay) * 1000; // sec to millisec
-  } catch (Exception e) {
-LOG.info("can not parse Crawl-Delay:" + e.toString());
+if (addRules) {
+  long crawlDelay = -1;
+  String delay = line.substring("Crawl-Delay:".length(), line.length()).trim();
+  if (delay.length() > 0) {
+try {
+  crawlDelay = Long.parseLong(delay) * 1000; // sec to millisec
+} catch (Exception e) {
+  LOG.info("can not parse Crawl-Delay:" + e.toString());
+}
+currentRules.setCrawlDelay(crawlDelay);
   }
-  currentRules.setCrawlDelay(crawlDelay);
 }
   }
 }
@@ -500,7 +502,7 @@
 
   /** command-line main for testing */
   public static void main(String[] argv) {
-if (argv.length != 3) {
+if (argv.length < 3) {
   System.out.println("Usage:");
   System.out.println("   java   +");
   System.out.println("");
@@ -513,7 +515,7 @@
 try { 
   FileInputStream robotsIn= new FileInputStream(argv[0]);
   LineNumberReader testsIn= new LineNumberReader(new FileReader(argv[1]));
-  String[] robotNames= new String[argv.length - 1];
+  String[] robotNames= new String[argv.length - 2];
 
   for (int i= 0; i < argv.length - 2; i++) 
 robotNames[i]= argv[i+2];


[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter

2007-02-15 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-445:


Attachment: TranslatingRawFieldQueryFilter_v1.0.patch

This patch complements index_query_domain_v1.0.patch. 

However, The class TranslatingRawFieldQueryFilter can be used independently, so 
i have put this in a seperate file. The javadoc reads : 

 * Similar to [EMAIL PROTECTED] RawFieldQueryFilter} except that the index 
 * and query field names can be different. 
 * 
 * This class can be extended by QueryFilters to allow 
 * searching a field in the index, but using another field name in the 
 * search. 
 * 
 * For example index field names can be kept in english such as "content", 
 * "lang", "title", ..., however query filters can be build in other languages 

> Domain İndexing / Query Filter
> --
>
> Key: NUTCH-445
> URL: https://issues.apache.org/jira/browse/NUTCH-445
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, searcher
>Affects Versions: 0.9.0
>Reporter: Enis Soztutar
> Attachments: index_query_domain_v1.0.patch, 
> TranslatingRawFieldQueryFilter_v1.0.patch
>
>
> Hostname's contain information about the domain of th host, and all of the 
> subdomains. Indexing and Searching the domains are important for intuitive 
> behavior. 
> From DomainIndexingFilter javadoc : 
> Adds the domain(hostname) and all super domains to the index. 
>  *  For http://lucene.apache.org/nutch/ the 
>  * following will be added to the index :  
>  * 
>  * lucene.apache.org 
>  * apache
>  * org 
>  * 
>  * All hostnames are domain names, but not all the domain names are 
>  * hostnames. In the above example hostname lucene is a 
>  * subdomain of apache.org, which is itself a subdomain of 
>  * org 
>  * 
>  
> Currently Basic indexing filter indexes the hostname in the site field, and 
> query-site plugin 
> allows to search in the site field. However site:apache.org will not return 
> http://lucene.apache.org
>  By indexing the domain, we can be able to search domains. Unlike 
>  the site field (indexed by BasicIndexingFilter) search, searching the 
>  domain field allows us to retrieve lucene.apache.org to the query 
>  apache.org. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter

2007-02-15 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-445:


Attachment: index_query_domain_v1.0.patch

Patch for index-domain and query-domain plugins. 

> Domain İndexing / Query Filter
> --
>
> Key: NUTCH-445
> URL: https://issues.apache.org/jira/browse/NUTCH-445
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, searcher
>Affects Versions: 0.9.0
>Reporter: Enis Soztutar
> Attachments: index_query_domain_v1.0.patch
>
>
> Hostname's contain information about the domain of th host, and all of the 
> subdomains. Indexing and Searching the domains are important for intuitive 
> behavior. 
> From DomainIndexingFilter javadoc : 
> Adds the domain(hostname) and all super domains to the index. 
>  *  For http://lucene.apache.org/nutch/ the 
>  * following will be added to the index :  
>  * 
>  * lucene.apache.org 
>  * apache
>  * org 
>  * 
>  * All hostnames are domain names, but not all the domain names are 
>  * hostnames. In the above example hostname lucene is a 
>  * subdomain of apache.org, which is itself a subdomain of 
>  * org 
>  * 
>  
> Currently Basic indexing filter indexes the hostname in the site field, and 
> query-site plugin 
> allows to search in the site field. However site:apache.org will not return 
> http://lucene.apache.org
>  By indexing the domain, we can be able to search domains. Unlike 
>  the site field (indexed by BasicIndexingFilter) search, searching the 
>  domain field allows us to retrieve lucene.apache.org to the query 
>  apache.org. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-445) Domain İndexing / Query Filter

2007-02-15 Thread Enis Soztutar (JIRA)
Domain İndexing / Query Filter
--

 Key: NUTCH-445
 URL: https://issues.apache.org/jira/browse/NUTCH-445
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 0.9.0
Reporter: Enis Soztutar


Hostname's contain information about the domain of th host, and all of the 
subdomains. Indexing and Searching the domains are important for intuitive 
behavior. 

>From DomainIndexingFilter javadoc : 
Adds the domain(hostname) and all super domains to the index. 
 *  For http://lucene.apache.org/nutch/ the 
 * following will be added to the index :  
 * 
 * lucene.apache.org 
 * apache
 * org 
 * 
 * All hostnames are domain names, but not all the domain names are 
 * hostnames. In the above example hostname lucene is a 
 * subdomain of apache.org, which is itself a subdomain of 
 * org 
 * 
 
Currently Basic indexing filter indexes the hostname in the site field, and 
query-site plugin 
allows to search in the site field. However site:apache.org will not return 
http://lucene.apache.org

 By indexing the domain, we can be able to search domains. Unlike 
 the site field (indexed by BasicIndexingFilter) search, searching the 
 domain field allows us to retrieve lucene.apache.org to the query 
 apache.org. 
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.