[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-445: Attachment: index_query_domain_v1.2.patch This patch is an update of the previous three patches. The patch 1. contains TranslatingRawFieldQueryFilter as an abstract implementation for searching certain fields in the index with a different query fieldname. 2. index-basic indexes the domain and all "super domains " in the domain field. 3.query-site is changed so that site: will search domain: By this plugin we can search site:apache.org, and get results from http://issues.apache.org, etc. or we can search site:com to retrieve all .com domains. > Domain İndexing / Query Filter > -- > > Key: NUTCH-445 > URL: https://issues.apache.org/jira/browse/NUTCH-445 > Project: Nutch > Issue Type: New Feature > Components: indexer, searcher >Affects Versions: 0.9.0 >Reporter: Enis Soztutar > Attachments: index_query_domain_v1.0.patch, > index_query_domain_v1.1.patch, index_query_domain_v1.2.patch, > TranslatingRawFieldQueryFilter_v1.0.patch > > > Hostname's contain information about the domain of th host, and all of the > subdomains. Indexing and Searching the domains are important for intuitive > behavior. > From DomainIndexingFilter javadoc : > Adds the domain(hostname) and all super domains to the index. > * For http://lucene.apache.org/nutch/ the > * following will be added to the index : > * > * lucene.apache.org > * apache > * org > * > * All hostnames are domain names, but not all the domain names are > * hostnames. In the above example hostname lucene is a > * subdomain of apache.org, which is itself a subdomain of > * org > * > > Currently Basic indexing filter indexes the hostname in the site field, and > query-site plugin > allows to search in the site field. However site:apache.org will not return > http://lucene.apache.org > By indexing the domain, we can be able to search domains. Unlike > the site field (indexed by BasicIndexingFilter) search, searching the > domain field allows us to retrieve lucene.apache.org to the query > apache.org. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-445: Attachment: index_query_domain_v1.1.patch This patch fixes the raw field name bug in v1.0 and adds the forgotten NutchDocumentAnalyzer modifications.(using WhiteSpaceAnalyzer in domain field). This patch obsoletes v1.0 (index_query_domain_v1.0.patch), and should be used with TranslatingRawFieldQueryFilter_v1.0.patch Note that query-site should not be included with query-domain, since it may cause some strange behavior. > Domain İndexing / Query Filter > -- > > Key: NUTCH-445 > URL: https://issues.apache.org/jira/browse/NUTCH-445 > Project: Nutch > Issue Type: New Feature > Components: indexer, searcher >Affects Versions: 0.9.0 >Reporter: Enis Soztutar > Attachments: index_query_domain_v1.0.patch, > index_query_domain_v1.1.patch, TranslatingRawFieldQueryFilter_v1.0.patch > > > Hostname's contain information about the domain of th host, and all of the > subdomains. Indexing and Searching the domains are important for intuitive > behavior. > From DomainIndexingFilter javadoc : > Adds the domain(hostname) and all super domains to the index. > * For http://lucene.apache.org/nutch/ the > * following will be added to the index : > * > * lucene.apache.org > * apache > * org > * > * All hostnames are domain names, but not all the domain names are > * hostnames. In the above example hostname lucene is a > * subdomain of apache.org, which is itself a subdomain of > * org > * > > Currently Basic indexing filter indexes the hostname in the site field, and > query-site plugin > allows to search in the site field. However site:apache.org will not return > http://lucene.apache.org > By indexing the domain, we can be able to search domains. Unlike > the site field (indexed by BasicIndexingFilter) search, searching the > domain field allows us to retrieve lucene.apache.org to the query > apache.org. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-445: Attachment: TranslatingRawFieldQueryFilter_v1.0.patch This patch complements index_query_domain_v1.0.patch. However, The class TranslatingRawFieldQueryFilter can be used independently, so i have put this in a seperate file. The javadoc reads : * Similar to [EMAIL PROTECTED] RawFieldQueryFilter} except that the index * and query field names can be different. * * This class can be extended by QueryFilters to allow * searching a field in the index, but using another field name in the * search. * * For example index field names can be kept in english such as "content", * "lang", "title", ..., however query filters can be build in other languages > Domain İndexing / Query Filter > -- > > Key: NUTCH-445 > URL: https://issues.apache.org/jira/browse/NUTCH-445 > Project: Nutch > Issue Type: New Feature > Components: indexer, searcher >Affects Versions: 0.9.0 >Reporter: Enis Soztutar > Attachments: index_query_domain_v1.0.patch, > TranslatingRawFieldQueryFilter_v1.0.patch > > > Hostname's contain information about the domain of th host, and all of the > subdomains. Indexing and Searching the domains are important for intuitive > behavior. > From DomainIndexingFilter javadoc : > Adds the domain(hostname) and all super domains to the index. > * For http://lucene.apache.org/nutch/ the > * following will be added to the index : > * > * lucene.apache.org > * apache > * org > * > * All hostnames are domain names, but not all the domain names are > * hostnames. In the above example hostname lucene is a > * subdomain of apache.org, which is itself a subdomain of > * org > * > > Currently Basic indexing filter indexes the hostname in the site field, and > query-site plugin > allows to search in the site field. However site:apache.org will not return > http://lucene.apache.org > By indexing the domain, we can be able to search domains. Unlike > the site field (indexed by BasicIndexingFilter) search, searching the > domain field allows us to retrieve lucene.apache.org to the query > apache.org. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-445) Domain İndexing / Query Filter
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enis Soztutar updated NUTCH-445: Attachment: index_query_domain_v1.0.patch Patch for index-domain and query-domain plugins. > Domain İndexing / Query Filter > -- > > Key: NUTCH-445 > URL: https://issues.apache.org/jira/browse/NUTCH-445 > Project: Nutch > Issue Type: New Feature > Components: indexer, searcher >Affects Versions: 0.9.0 >Reporter: Enis Soztutar > Attachments: index_query_domain_v1.0.patch > > > Hostname's contain information about the domain of th host, and all of the > subdomains. Indexing and Searching the domains are important for intuitive > behavior. > From DomainIndexingFilter javadoc : > Adds the domain(hostname) and all super domains to the index. > * For http://lucene.apache.org/nutch/ the > * following will be added to the index : > * > * lucene.apache.org > * apache > * org > * > * All hostnames are domain names, but not all the domain names are > * hostnames. In the above example hostname lucene is a > * subdomain of apache.org, which is itself a subdomain of > * org > * > > Currently Basic indexing filter indexes the hostname in the site field, and > query-site plugin > allows to search in the site field. However site:apache.org will not return > http://lucene.apache.org > By indexing the domain, we can be able to search domains. Unlike > the site field (indexed by BasicIndexingFilter) search, searching the > domain field allows us to retrieve lucene.apache.org to the query > apache.org. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.