[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter

2007-02-28 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-445:


Attachment: index_query_domain_v1.2.patch

This patch is an update of the previous three patches. 
The patch 
1. contains TranslatingRawFieldQueryFilter as an abstract implementation for 
searching certain fields in the index with a different query fieldname. 
2. index-basic indexes the domain and all "super domains " in the domain field.
3.query-site is changed so that site: will search domain:

By this plugin we can search site:apache.org, and get results from 
http://issues.apache.org, etc. or we can search site:com to retrieve all .com 
domains. 


> Domain İndexing / Query Filter
> --
>
> Key: NUTCH-445
> URL: https://issues.apache.org/jira/browse/NUTCH-445
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, searcher
>Affects Versions: 0.9.0
>Reporter: Enis Soztutar
> Attachments: index_query_domain_v1.0.patch, 
> index_query_domain_v1.1.patch, index_query_domain_v1.2.patch, 
> TranslatingRawFieldQueryFilter_v1.0.patch
>
>
> Hostname's contain information about the domain of th host, and all of the 
> subdomains. Indexing and Searching the domains are important for intuitive 
> behavior. 
> From DomainIndexingFilter javadoc : 
> Adds the domain(hostname) and all super domains to the index. 
>  *  For http://lucene.apache.org/nutch/ the 
>  * following will be added to the index :  
>  * 
>  * lucene.apache.org 
>  * apache
>  * org 
>  * 
>  * All hostnames are domain names, but not all the domain names are 
>  * hostnames. In the above example hostname lucene is a 
>  * subdomain of apache.org, which is itself a subdomain of 
>  * org 
>  * 
>  
> Currently Basic indexing filter indexes the hostname in the site field, and 
> query-site plugin 
> allows to search in the site field. However site:apache.org will not return 
> http://lucene.apache.org
>  By indexing the domain, we can be able to search domains. Unlike 
>  the site field (indexed by BasicIndexingFilter) search, searching the 
>  domain field allows us to retrieve lucene.apache.org to the query 
>  apache.org. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter

2007-02-15 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-445:


Attachment: index_query_domain_v1.1.patch

This patch fixes the raw field name bug in v1.0 and adds the forgotten 
NutchDocumentAnalyzer modifications.(using WhiteSpaceAnalyzer in domain field). 

This patch obsoletes v1.0 (index_query_domain_v1.0.patch), and should be used 
with TranslatingRawFieldQueryFilter_v1.0.patch

Note that query-site should not be included with query-domain, since it may 
cause some strange behavior. 



> Domain İndexing / Query Filter
> --
>
> Key: NUTCH-445
> URL: https://issues.apache.org/jira/browse/NUTCH-445
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, searcher
>Affects Versions: 0.9.0
>Reporter: Enis Soztutar
> Attachments: index_query_domain_v1.0.patch, 
> index_query_domain_v1.1.patch, TranslatingRawFieldQueryFilter_v1.0.patch
>
>
> Hostname's contain information about the domain of th host, and all of the 
> subdomains. Indexing and Searching the domains are important for intuitive 
> behavior. 
> From DomainIndexingFilter javadoc : 
> Adds the domain(hostname) and all super domains to the index. 
>  *  For http://lucene.apache.org/nutch/ the 
>  * following will be added to the index :  
>  * 
>  * lucene.apache.org 
>  * apache
>  * org 
>  * 
>  * All hostnames are domain names, but not all the domain names are 
>  * hostnames. In the above example hostname lucene is a 
>  * subdomain of apache.org, which is itself a subdomain of 
>  * org 
>  * 
>  
> Currently Basic indexing filter indexes the hostname in the site field, and 
> query-site plugin 
> allows to search in the site field. However site:apache.org will not return 
> http://lucene.apache.org
>  By indexing the domain, we can be able to search domains. Unlike 
>  the site field (indexed by BasicIndexingFilter) search, searching the 
>  domain field allows us to retrieve lucene.apache.org to the query 
>  apache.org. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter

2007-02-15 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-445:


Attachment: TranslatingRawFieldQueryFilter_v1.0.patch

This patch complements index_query_domain_v1.0.patch. 

However, The class TranslatingRawFieldQueryFilter can be used independently, so 
i have put this in a seperate file. The javadoc reads : 

 * Similar to [EMAIL PROTECTED] RawFieldQueryFilter} except that the index 
 * and query field names can be different. 
 * 
 * This class can be extended by QueryFilters to allow 
 * searching a field in the index, but using another field name in the 
 * search. 
 * 
 * For example index field names can be kept in english such as "content", 
 * "lang", "title", ..., however query filters can be build in other languages 

> Domain İndexing / Query Filter
> --
>
> Key: NUTCH-445
> URL: https://issues.apache.org/jira/browse/NUTCH-445
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, searcher
>Affects Versions: 0.9.0
>Reporter: Enis Soztutar
> Attachments: index_query_domain_v1.0.patch, 
> TranslatingRawFieldQueryFilter_v1.0.patch
>
>
> Hostname's contain information about the domain of th host, and all of the 
> subdomains. Indexing and Searching the domains are important for intuitive 
> behavior. 
> From DomainIndexingFilter javadoc : 
> Adds the domain(hostname) and all super domains to the index. 
>  *  For http://lucene.apache.org/nutch/ the 
>  * following will be added to the index :  
>  * 
>  * lucene.apache.org 
>  * apache
>  * org 
>  * 
>  * All hostnames are domain names, but not all the domain names are 
>  * hostnames. In the above example hostname lucene is a 
>  * subdomain of apache.org, which is itself a subdomain of 
>  * org 
>  * 
>  
> Currently Basic indexing filter indexes the hostname in the site field, and 
> query-site plugin 
> allows to search in the site field. However site:apache.org will not return 
> http://lucene.apache.org
>  By indexing the domain, we can be able to search domains. Unlike 
>  the site field (indexed by BasicIndexingFilter) search, searching the 
>  domain field allows us to retrieve lucene.apache.org to the query 
>  apache.org. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter

2007-02-15 Thread Enis Soztutar (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated NUTCH-445:


Attachment: index_query_domain_v1.0.patch

Patch for index-domain and query-domain plugins. 

> Domain İndexing / Query Filter
> --
>
> Key: NUTCH-445
> URL: https://issues.apache.org/jira/browse/NUTCH-445
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, searcher
>Affects Versions: 0.9.0
>Reporter: Enis Soztutar
> Attachments: index_query_domain_v1.0.patch
>
>
> Hostname's contain information about the domain of th host, and all of the 
> subdomains. Indexing and Searching the domains are important for intuitive 
> behavior. 
> From DomainIndexingFilter javadoc : 
> Adds the domain(hostname) and all super domains to the index. 
>  *  For http://lucene.apache.org/nutch/ the 
>  * following will be added to the index :  
>  * 
>  * lucene.apache.org 
>  * apache
>  * org 
>  * 
>  * All hostnames are domain names, but not all the domain names are 
>  * hostnames. In the above example hostname lucene is a 
>  * subdomain of apache.org, which is itself a subdomain of 
>  * org 
>  * 
>  
> Currently Basic indexing filter indexes the hostname in the site field, and 
> query-site plugin 
> allows to search in the site field. However site:apache.org will not return 
> http://lucene.apache.org
>  By indexing the domain, we can be able to search domains. Unlike 
>  the site field (indexed by BasicIndexingFilter) search, searching the 
>  domain field allows us to retrieve lucene.apache.org to the query 
>  apache.org. 
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.