Re: How to do faceting on data indexed by Nutch
2010/4/25 Andrzej Bialecki a...@getopt.org On 2010-04-25 15:03, KK wrote: Hi All, I might be repeating this question asked by someone else but googling didn't help tracking any such mail responses. I'm pretty much aware of Solr/Lucene and its basic architecture. I've done hit highlighting in Lucene, has idea on faceting support by Solr but never tried it actually. I wanted to implement faceting on Nutch's indexed data. I already have some MBs of data already indexed by Nutch. I just want to implement faceting on those . Can someone give me pointers on how to proceed further in this regard. Or is it the case that I've to query using Solr interface and redirect all the queries to the index already created by Nutch. What is the best possible way, simplest way for achieving the same. Please help in this regard. Nutch has two indexing/searching backends - the one that is configured by default uses plain Lucene, and it does not support faceting. The other backend uses Solr, and then of course it supports faceting and all other Solr features. So in your case you need to switch to use Solr indexing (and searching). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Although I haven't tried it, maybe you can build a faceting plugin for nutch using http://sna-projects.com/bobo . Regards
Re: n00b question follow up
It looks like the query is well done. Query means (scoring part is a little bit more complicated): please index, give me all the ducuments you have that contain in the field url OR anchor OR content OR host the word test. If you find it in the field url use to score the document a boost factor of 4 if you find it in the field anchor use to score a boost of 2... Maybe it could be helpful to see your nuch-site.xml (e.g. the one you are using when deploy nutch.war in tomcat, nutch/WEB-INF/classes/nutch-site). 2007/2/12, Patrick Simon [EMAIL PROTECTED]: All of my config stuff sits in nutch-default.xml (nutch-site.xml doesn't change anything but I assume this should be fine) The out put to my log file of what you advise below is Query-+(url:test^4.0 anchor:test^2.0 content:test title:test^1.5 host:test^2.0) I'm not too sure why all the carots are appearing? -Original Message- From: Alvaro Cabrerizo [mailto:[EMAIL PROTECTED] Sent: Thursday, 8 February 2007 1:25 AM To: nutch-user@lucene.apache.org Subject: Re: n00b question follow up Hi: First you can check that query plugins (query-basic, more, etc) appear in your nutch-site.xml. If everything is ok, you can add a LOG line in the method search of the class org.apache.nutch.searcher.IndexSearcher in order to see how the lucene query is built. If I'm not wrong you have to add in line 99 LOG.info(query -+luceneQuery.toString()); This method should look like this: public Hits search(Query query...) ... try{ org.apache.lucene.search.BooleanQuery luceneQuery = this.queryFilters.filter(query); LOG.info(query - +luceneQuery.toString()); return .. Recompile, and make a new query. Hope it helps. 2007/2/7, Patrick Simon [EMAIL PROTECTED]: Hi All, The is an older post I made with more details from logs that will hopefully be painfully obvious to someone out there why its not working.. It appears that I have successfully created a Nutch index via the command nutch/bin :./nutch crawl ../urls -dir ../crawl.test -depth 5. I say it is successful as when I use Luke (a Lucene GUI tool that interegates Lucene indexes) to view the index, a valid index and search results come up. The directory I point Luke to is /home/simonp/nutch-0.8/crawl.test/indexes/part-0 (the value I give for searcher.dir in nutch-default.xml is /home/simonp/nutch-0.8/crawl.test) The problem is that I cannot see any results via the command bin/nutch org.apache.nutch.searcher.NutchBean apache or when I search for the string apache within the nutch servlet. I don't run any fetching or indexing as the tutorial says not to for simple intranet searching. I am using Tomcat 5.5 and Nutch 0.8. Can any body help with this one please? The output from catalina.out is 2007-02-06 09:01:27,990 INFO NutchBean - opening indexes in /home/simonp/nutch-8.0/crawl.test/indexes 2007-02-06 09:01:28,032 INFO Configuration - found resource common-terms.utf8 at file:/usr/local/tomcat/webapps/nutch-0.8/WEB-INF/classes/common-terms. ut f8 2007-02-06 09:01:28,037 INFO NutchBean - opening segments in /home/simonp/nutch-8.0/crawl.test/segments 2007-02-06 09:01:28,056 INFO SummarizerFactory - Using the first summarizer extension found: Basic Summarizer 2007-02-06 09:01:28,056 INFO NutchBean - opening linkdb in /home/simonp/nutch-8.0/crawl.test/linkdb 2007-02-06 09:01:28,062 INFO NutchBean - query request from 192.168.5.173 2007-02-06 09:01:28,072 INFO NutchBean - query: ubuntu 2007-02-06 09:01:28,072 INFO NutchBean - lang: en 2007-02-06 09:01:28,101 INFO NutchBean - searching for 20 raw hits 2007-02-06 09:01:28,142 INFO NutchBean - total hits: 0 2007-02-06 09:01:30,506 INFO NutchBean - query request from 192.168.5.173 2007-02-06 09:01:30,506 INFO NutchBean - query: apache 2007-02-06 09:01:30,506 INFO NutchBean - lang: en 2007-02-06 09:01:30,507 INFO NutchBean - searching for 20 raw hits 2007-02-06 09:01:30,507 INFO NutchBean - total hits: 0 2007-02-06 09:01:51,191 INFO NutchBean - query request from 192.168.5.173 2007-02-06 09:01:51,191 INFO NutchBean - query: test 2007-02-06 09:01:51,191 INFO NutchBean - lang: en 2007-02-06 09:01:51,193 INFO NutchBean - searching for 20 raw hits 2007-02-06 09:01:51,193 INFO NutchBean - total hits: 0 2007-02-06 10:22:51,068 INFO NutchBean - query request from 192.168.5.173 2007-02-06 10:22:51,070 INFO NutchBean - query: test 2007-02-06 10:22:51,070 INFO NutchBean - lang: en 2007-02-06 10:22:51,073 INFO NutchBean - searching for 20 raw hits 2007-02-06 10:22:51,076 INFO NutchBean - total hits: 0 OAG Best Low Cost Airline Of The Year The content of this e-mail, including any attachments, is a confidential communication between Virgin Blue, Pacific Blue or a related entity (or the sender if this email is a private communication) and the intended addressee and is for the sole use of that intended addressee. If you are not the intended addressee, any use, interference
Re: n00b question follow up
Hi: First you can check that query plugins (query-basic, more, etc) appear in your nutch-site.xml. If everything is ok, you can add a LOG line in the method search of the class org.apache.nutch.searcher.IndexSearcher in order to see how the lucene query is built. If I'm not wrong you have to add in line 99 LOG.info(query -+luceneQuery.toString()); This method should look like this: public Hits search(Query query...) ... try{ org.apache.lucene.search.BooleanQuery luceneQuery = this.queryFilters.filter(query); LOG.info(query - +luceneQuery.toString()); return .. Recompile, and make a new query. Hope it helps. 2007/2/7, Patrick Simon [EMAIL PROTECTED]: Hi All, The is an older post I made with more details from logs that will hopefully be painfully obvious to someone out there why its not working.. It appears that I have successfully created a Nutch index via the command nutch/bin :./nutch crawl ../urls -dir ../crawl.test -depth 5. I say it is successful as when I use Luke (a Lucene GUI tool that interegates Lucene indexes) to view the index, a valid index and search results come up. The directory I point Luke to is /home/simonp/nutch-0.8/crawl.test/indexes/part-0 (the value I give for searcher.dir in nutch-default.xml is /home/simonp/nutch-0.8/crawl.test) The problem is that I cannot see any results via the command bin/nutch org.apache.nutch.searcher.NutchBean apache or when I search for the string apache within the nutch servlet. I don't run any fetching or indexing as the tutorial says not to for simple intranet searching. I am using Tomcat 5.5 and Nutch 0.8. Can any body help with this one please? The output from catalina.out is 2007-02-06 09:01:27,990 INFO NutchBean - opening indexes in /home/simonp/nutch-8.0/crawl.test/indexes 2007-02-06 09:01:28,032 INFO Configuration - found resource common-terms.utf8 at file:/usr/local/tomcat/webapps/nutch-0.8/WEB-INF/classes/common-terms.ut f8 2007-02-06 09:01:28,037 INFO NutchBean - opening segments in /home/simonp/nutch-8.0/crawl.test/segments 2007-02-06 09:01:28,056 INFO SummarizerFactory - Using the first summarizer extension found: Basic Summarizer 2007-02-06 09:01:28,056 INFO NutchBean - opening linkdb in /home/simonp/nutch-8.0/crawl.test/linkdb 2007-02-06 09:01:28,062 INFO NutchBean - query request from 192.168.5.173 2007-02-06 09:01:28,072 INFO NutchBean - query: ubuntu 2007-02-06 09:01:28,072 INFO NutchBean - lang: en 2007-02-06 09:01:28,101 INFO NutchBean - searching for 20 raw hits 2007-02-06 09:01:28,142 INFO NutchBean - total hits: 0 2007-02-06 09:01:30,506 INFO NutchBean - query request from 192.168.5.173 2007-02-06 09:01:30,506 INFO NutchBean - query: apache 2007-02-06 09:01:30,506 INFO NutchBean - lang: en 2007-02-06 09:01:30,507 INFO NutchBean - searching for 20 raw hits 2007-02-06 09:01:30,507 INFO NutchBean - total hits: 0 2007-02-06 09:01:51,191 INFO NutchBean - query request from 192.168.5.173 2007-02-06 09:01:51,191 INFO NutchBean - query: test 2007-02-06 09:01:51,191 INFO NutchBean - lang: en 2007-02-06 09:01:51,193 INFO NutchBean - searching for 20 raw hits 2007-02-06 09:01:51,193 INFO NutchBean - total hits: 0 2007-02-06 10:22:51,068 INFO NutchBean - query request from 192.168.5.173 2007-02-06 10:22:51,070 INFO NutchBean - query: test 2007-02-06 10:22:51,070 INFO NutchBean - lang: en 2007-02-06 10:22:51,073 INFO NutchBean - searching for 20 raw hits 2007-02-06 10:22:51,076 INFO NutchBean - total hits: 0 OAG Best Low Cost Airline Of The Year The content of this e-mail, including any attachments, is a confidential communication between Virgin Blue, Pacific Blue or a related entity (or the sender if this email is a private communication) and the intended addressee and is for the sole use of that intended addressee. If you are not the intended addressee, any use, interference with, disclosure or copying of this material is unauthorized and prohibited. If you have received this e-mail in error please contact the sender immediately and then delete the message and any attachment(s). There is no warranty that this email is error, virus or defect free. This email is also subject to copyright. No part of it should be reproduced, adapted or communicated without the written consent of the copyright owner. If this is a private communication it does not represent the views of Virgin Blue, Pacific Blue or their related entities. Please be aware that the contents of any emails sent to or from Virgin Blue, Pacific Blue or their related entities may be periodically monitored and reviewed. Virgin Blue, Pacific Blue and their related entities respect your privacy. Our privacy policy can be accessed from our website: www.virginblue.com.au
loading different indexes in tomcat
Hi: I´ve got two different indexes: myIndexA and myIndexB. I wanted to load both on tomcat but without merging. So I've created a new dir myIndexC. Under myIndexC/indexes I've deployed myIndexA/indexes/part-0 and myIndexB/indexes/part-0 renamed too part-1. Later I've copied myIndexA/segments/* and myIndexB/segments/* to myIndexC/segments. I´ve also moved linkdb/current/part-0 from myIndexA and myIindexB to myIndexC. Summarizing, I've moved directories from indexA and B to C. Then I started tomcat (searcher.dir points to C ) and searches give me hits from both indexes B and C (based on what explain.jsp says). So is there any difference in the result I get (e.g. page ranking ) between this process and a real merge using Nutch built-in commands? Thanks.
Re: how to use PorterStemFilter with NutchDocumentAnalyzer
Sorry but, don´t understand what you mean whit will that impact my search results? 2007/1/23, DS jha [EMAIL PROTECTED]: yeah - you are right. I can implement this by extending NutchAnalyzer - however, that will bypass current NutchDocumentAnalyzer and CommonGrams functionality - will that impact my search results? Thanks, On 1/23/07, Alvaro Cabrerizo [EMAIL PROTECTED] wrote: Hello: At least in nutch-0.8 you can find two plugins analysis-es and analysis-de. The source code of this plugins is the best guide you can follow to implement your Analyzer. 2007/1/19, DS jha [EMAIL PROTECTED]: Hello – I want to use PorterStemFilter (along with LowerCaseTokenizer) from Lucene library – and was wondering, how would I do it in Nutch. I can extend DocumentAnalyzer class for returning PorterStemFilter and add the plugin the config file - however, in that scenario I think NutchDocumentAnalyzer won't get called – so is there a way to invoke PorterStemFilter before/after NutchDocumentAnalyzer? Thanks, DS Jha
Re: exact matches and stemming
Maybe you could store in your index both the stemmed word and the original one. Although it will increment the size of your index. Another posibllity could be to develop a WildcardQuery plugin or a FuzzyQuery plugin, because lucene comes with this capabilities, and avoid stemming task. But it is known that wildcard a fuzzy have poor performance. Hope it helps. 2007/1/24, Aïcha [EMAIL PROTECTED]: Hello, I want to use the FrenchAnalyzer for stop word and stemming treatment but I want to still be able to do exact search, the problem is that the FrenchAnalyzer remove characters from the terms when the indexing is made so it isn't possible to have only exact matches from an index indexed with the FrenchAnalyzer could someone help me, thanks in advance Aïcha ___ Découvrez une nouvelle façon d'obtenir des réponses à toutes vos questions ! Profitez des connaissances, des opinions et des expériences des internautes sur Yahoo! Questions/Réponses http://fr.answers.yahoo.com
Re: how to use PorterStemFilter with NutchDocumentAnalyzer
Hello: At least in nutch-0.8 you can find two plugins analysis-es and analysis-de. The source code of this plugins is the best guide you can follow to implement your Analyzer. 2007/1/19, DS jha [EMAIL PROTECTED]: Hello – I want to use PorterStemFilter (along with LowerCaseTokenizer) from Lucene library – and was wondering, how would I do it in Nutch. I can extend DocumentAnalyzer class for returning PorterStemFilter and add the plugin the config file - however, in that scenario I think NutchDocumentAnalyzer won't get called – so is there a way to invoke PorterStemFilter before/after NutchDocumentAnalyzer? Thanks, DS Jha
Re: Searcher doesn't find what expected
I recommend you to check you index using luke. Whith luke you can manage (query, see structure..) your lucene index in order to discover if you have a problem during indexation or during the search. 2007/1/16, kauu [EMAIL PROTECTED]: so ,u must show us the logs , and did u change the nutch-site.xml in the tomcat ? On 1/16/07, Libor Štefek [EMAIL PROTECTED] wrote: Hi, I'm using nutch 0.8.1 to index several thousand text files (source code) and I use intranet crawling method to create an index. Everything looks fine, but when I try to search something, it often doesn't find what it should. I'm sure that the term is in several pages, but I got result only for some of them. I tried to set limits in properties like page sizes, number of links etc. but nothing helped. There aren't any error messages in logfile during crawl. Is there any way how to find a reason for this behavior ? How to make nutch more reliable in results? Thanks for any hint. Libor -- www.babatu.com
Re: problems to exclude subdirectories in a web site
Try to write your excluding patterns before accepting patterns. If I'm not wrong nutch follows the order of the patterns. So it first check +^http://toto.web-site.net http://site.net/adding all the urls you want to skip with -^http://toto.web- http://site.net/site.net/de/([a-z0-9]*)...http://site.net/de/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/en/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/fr/mv/%28%5Ba-z0-9%5D*%29 Then your crawl-urlfilter.txt or regex-urlfilter.txt should look like this: ... # URL to exclude for indexing -^http://toto.web-site.net/de/([a-z0-9]*) -^http://toto.web-site.net/en/([a-z0-9]*) -^http://toto.web-site.net/fr/mv/([a-z0-9]*)http://site.net/de/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/en/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/fr/mv/%28%5Ba-z0-9%5D*%29 # Website hostname for indexing +^http://toto.web-site.net # skip everything else -. Hope it helps. 2007/1/12, yleny @ ifrance. com [EMAIL PROTECTED]: Hello, I want to exclude for indexing subdirectories in a website and i have not found the goods parameters. I use Nutch-0.7.2 because it is impossible for me to index with Nutch-0.8.1 (it crash). I want to exclude in my website the subdirectories : /de/* /en/* /fr/mv/* I try the command line -^http://toto.web-site.net/de/([a-z0-9]*) and -^http://toto.web-site.net/de/* in my crawl-urlfilter.txt file but they don#39;t work and nutch index these url but i don#39;t want this. Any idea ? I have the default regex-urlfilter.txt and my personnal crawl-urlfilter.txt is: # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by #39;+#39; or #39;-#39;. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can#39;t yet parse -.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*.)*MY.DOMAIN.NAME/ # Website hostname for indexing +^http://toto.web-site.net # URL to exclude for indexing -^http://toto.web-site.net/de/([a-z0-9]*) -^http://toto.web-site.net/en/([a-z0-9]*) -^http://toto.web-site.net/fr/mv/([a-z0-9]*) # skip everything else -. *** my default regex-urlfilter.txt file is ** # The default url filter. # Better for whole-internet crawling. # Each non-comment, non-blank line contains a regular expression # prefixed by #39;+#39; or #39;-#39;. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can#39;t yet parse -.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept anything else +. iFRANCE, exprimez-vous ! http://web.ifrance.com
Re: Written a plugin: now nutch fails with an error
2006/11/21, Nicolás Lichtmaier [EMAIL PROTECTED]: nutch/hadoop are you using? conf/log4j.properties file determines the logging factor. Maybe setting logging level to INFO or DEBUG may help. Oh, sorry =). I was trying to index some extra field I calculate from the URLs. So I followed the instructions at http://wiki.apache.org/nutch/CreateNewFilter . I've created two classes: one implementing IndexingFilter and another extending RawFieldQueryFilter. And this is the plugin.xml file: Did you added my-plugin to the src/plugin/build.xml file. If not add it to there, and recompile. And check the class files in the build dir : check for com.pierpoint/.. if they exist. I just compiled it, made a jar and put it in an existing nutch installation. Then I've added it to the regex in the plugin.includes property in nutch-site.xml However, it seems that your error might have been caused from another reason. Since the generator runs neither indexingFilters nor QueryFilters. If you added another plugin (like url normalizer), i suggest you check that one. Yes, but the real probem is not being able to see anything. Some part of nutch is doing something ugly: Swallowing the real error and displaying a generic one. That's always a bad thing to do. I've tried to enable some more debug info bu I've failed, I've edited Java's logging.properties and nutch's log4j.properties (replacing INFO for DEBUG), but I couldn't get nutch to be more verbose... =/ Have you try to add your own LOG lines whithin the code you want to inspect, at least in your new plugin, re-compile and execute. Hope it helps
Re: Re-injecting URLS, perhaps by removing them from the CrawlDB first?
Hi: At least in nutch 0.8.X you can filter the crawldb, linkdb and segments using URL patterns (e.g. editing regex-urlfilter.txt). Just execute this command: nutch_home/bin/nutch mergedb It will answer the usage information: CrawlDbMerger output_crawldb crawldb1 [crawldb2 crawldb3 ...] [-filter] output_crawldb output CrawlDb crawldb1 ...input CrawlDb-s -filter use URLFilters on urls in the crawldb(s) So, in order to use it run a command like: nutch_home/bin/nutch mergedb output_crawldb your_crawldb -filter Before running you have to edit edit regex-urlfilter.txt, if you are using urlfilter-regex plugin, adding the URL patterns you want to filter. In order to check it works, dump your crawldb and the filtered one, using readdb nutch command, and make a diff. Hope it helps.
Re: I can not query myplugin in field category:test
You can use the existing subcollection plugin in nutch 0.8.X and extend it to use regular expressions. Basically you have to modify the class org.apache.nutch.collection.Subcollection. Change the method filter (lines 146 154) and substitute if(urlString.indexOf(row) =! -1) with somethig like if(Pattern.matches(row, urlString)). This approach lets you: -Use the existing file subcollection.xml to define your url-expression/categories -Use the package java.util.regex to define matching urls Here is a sample of subcollection.xml, after modifying subcollection plugin. subcollection namemyCategory/name idmyCategory/id whitelist http://www.categorySite.es/setcionA(.)*sectionB(.)* /whitelist blacklist http://www.categorySite.es/setcionA(.)*sectionB(.)*sectionC(.)* /blacklist /subcollection 2006/10/14, Ernesto De Santis [EMAIL PROTECTED]: Hi Chad The link was a configuration example. more explained example: http://www.misite.com/videos/.*=videos (rule A) if the url fetched match which rule A, then index a Field named = 'category' with value = 'videos'. Later you can search over this field category to filter yours searches. I will send this plugin in another new thread mail. I post the plugin here, in the list. I don't know another way to share it with you. Regards Ernesto. [EMAIL PROTECTED] escribió: couldn't get the link to work but yes if you could share that would be great. Chad Savage Ernesto De Santis wrote: I did a url-category-indexer. It works with a .properties file that map urls writed as regexp and categories. example: http://www.misite.com/videos/.*=videos If it seems useful, I can share it. Maybe, it could be better config it in a .xml file. Regards, Ernesto. Stefan Neufeind escribió: Alvaro Cabrerizo wrote: Have you included a node to describe your new searcher filter into plugin.xml? 2006/10/11, xu nutch [EMAIL PROTECTED]: I have a question about myplugin for indexfilter and queryfilter. Can u Help me ! - MoreIndexingFilter.java in add doc.add(new Field(category, test, false, true, false)); - -- package org.apache.nutch.searcher.more; import org.apache.nutch.searcher.RawFieldQueryFilter; /** Handles category: query clauses, causing them to search the field indexed by * BasicIndexingFilter. */ public class CategoryQueryFilter extends RawFieldQueryFilter { public CategoryQueryFilter() { super(category); } } --- --- property nameplugin.includes/name valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. /description /property property nameplugin.includes/name valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. /description /property --- I use luke to query category:test is ok! but I use tomcat webstie to query category:test , no return result. In case you get the search working: How do you plan to categorize URLs/sites? I'm looking for a solution there, since I didn't yet manage to implement something URL-prefix-filter based to map categories to URLs or so. Regards, Stefan __ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas __ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas
Re: I can not query myplugin in field category:test
Have you included a node to describe your new searcher filter into plugin.xml? 2006/10/11, xu nutch [EMAIL PROTECTED]: I have a question about myplugin for indexfilter and queryfilter. Can u Help me ! - MoreIndexingFilter.java in add doc.add(new Field(category, test, false, true, false)); - -- package org.apache.nutch.searcher.more; import org.apache.nutch.searcher.RawFieldQueryFilter; /** Handles category: query clauses, causing them to search the field indexed by * BasicIndexingFilter. */ public class CategoryQueryFilter extends RawFieldQueryFilter { public CategoryQueryFilter() { super(category); } } --- --- property nameplugin.includes/name valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. /description /property property nameplugin.includes/name valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. /description /property --- I use luke to query category:test is ok! but I use tomcat webstie to query category:test , no return result.
Re: category string gets matched as a term
It looks you syntax is correct ( category:video searchString). Try to write a LOG.info line into org.apache.nutch.searcher.LuceneQueryOptimizer(Line 178), just at the begining of the optimize method: public TopDocs optimize(BooleanQuery original, Searcher searcher, int numHits, String sortField, boolean reverse) throws IOException { LOG.info(Query - +original.toString()); Recompile nutch a make a query, for example category:video funny if your category plugin works fine you'll get an info line within hadoop.log similar to this: +(url:funny^0.0 anchor:funny^0.0 content:funny title:funny^0.0 host:funny^0.0) +category:video First part means (+(url:funny^0.0 anchor:funny^0.0 content:funny title:funny^0.0 host:funny^0.0)) that funny must appear at least in one of that fields (url, anchor...). The second part filters results to obtain only the ones tagged as video. In your case it looks like the word video is being included into the first part. Check your plugin implementation is correct, and the plugin.xml and build.xml are correct. Your plugin.xml should look similar to this: ... extension id=... name= point=org.apache.nutch.searcher.QueryFilter implementation id=... class=/ parameter name=raw-fields value=category/ /extension Hope it helps. 2006/10/3, Dima Gritsenko [EMAIL PROTECTED]: Hi, I have categorized web sites during crawl to provide filtered results similar to google Video, Images tabs. But when I enter category:video MySearchString nutch matches both the video and MySearchString as terms (though it filters results correctly and displays links to only video categorized pages) but the search is not relevant since video string is matched as well. How do I filter category string off during search? Great thanks. Dima.
Re: focussed crawling
Although I havent use it. After making a crawl, at least in nutch 0.8, you can make a ./bin/nutch/mergedb outputdb your_db -filter. Using the filter option you can generate a new db, filtering links you wanna remove. And use it to make a recrawl. Hope it helps.
Re: stop an index server
Hi: It works fine. Thanks again. 2006/9/28, Sami Siren [EMAIL PROTECTED]: hello, here's an adhoc addition to search server to support shutdown command. client calls server like this: bin/nutch 'org.apache.nutch.searcher.DistributedSearch$Client' -shutdown 127.0.0.1 -- Sami Siren Alvaro Cabrerizo wrote: 2006/9/27, Sami Siren [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]: Alvaro Cabrerizo wrote: How could I stop an index server (started with bin/nutch server port index) knowing the port? Thanks in advance. It does not support such a feature. Can you describe a little bit more what are you trying to accomplish something similar to tomcats SHUTDOWN? Sure, That's right. If this feature doesn't exist, I'm looking for a clue to develop a SHUTDOWN and a RESTART command, using NUTCH/HADOOP api. The idea is to have a group of JAVA classes that lets people execute a command like: SERVER_RESTART port or more advanced SERVER_RESTART port ip_address. Anyway I can execute ps aux | grep 4 in a shell and find out proccess number in order to kill it or I can make a ^C to stop it, but this is not the solution I'm looking for. Thanks, in advance. -- Sami Siren
Re: stop an index server
2006/9/27, Sami Siren [EMAIL PROTECTED]: Alvaro Cabrerizo wrote: How could I stop an index server (started with bin/nutch server port index) knowing the port? Thanks in advance. It does not support such a feature. Can you describe a little bit more what are you trying to accomplish something similar to tomcats SHUTDOWN? Sure, That's right. If this feature doesn't exist, I'm looking for a clue to develop a SHUTDOWN and a RESTART command, using NUTCH/HADOOP api. The idea is to have a group of JAVA classes that lets people execute a command like: SERVER_RESTART port or more advanced SERVER_RESTART port ip_address. Anyway I can execute ps aux | grep 4 in a shell and find out proccess number in order to kill it or I can make a ^C to stop it, but this is not the solution I'm looking for. Thanks, in advance. -- Sami Siren
Re: stop an index server
Ok, I'll try to explain it in a more clear way. Imagine that you have finished crawling a group of sites and you have a well formed index. Then you configure tomcat, create a nutch-site.xml, add the property searcher.dir pointing to a search-servers.txt that contains this line: 127.0.0.1 4. Then you start tomcat, and an index server using the command nutch_home/bin/nutch server 4 myIndexDir. Now you can get the results of that server via tomcat, in a distributed way. At this point I would like to know how to stop the server running on 4 port. I can execute ps aux | grep 4 in a shell and find out proccess number in order to kill it or I can make a ^C to stop it, but this is not the solution I'm looking for. I've tried this piece of code (based on org.apache.nutch.search.DistributedSearch): Configuration conf = NutchConfiguration.create(); InetSocketAddress[] a_InetSocketAddress = new InetSocketAddress[1]; a_InetSocketAddress[0] = new InetSocketAddress(localhost, 4); Object[][] params = new Object[1][0]; Method get_method = org.apache.hadoop.ipc.Server.class.getMethod(get, new Class[] {}); org.apache.hadoop.ipc.Server[] servers = (org.apache.hadoop.ipc.Server[])RPC.call(get_method, params, a_InetSocketAddress ,conf); servers[0].stop(); Executing this code gives me a nullPointer exception because the RPC.call, returns an array of nulls. If I understand it, when we execute nutch_home/bin/nutch server 4 myIndexDir, we are enveloping a NutchBean in a RPC layer (), tha let us to access to the methods of NutchBean via RPC.calls, but not to org.apache.hadoop.ipc.Server methods. Summarizing, the question is how to get the instance of that server (the org.apache.hadoop.ipc.Server running on 4 port) to make an STOP. Once I can make an stop, i can update the index and restart it. Thanks for your answer. 2006/9/26, Jim Wilson [EMAIL PROTECTED]: Do you mean what crawl-urlfilter.txt line you'd need? I think the following would do it: -^http://server:port/ But I'm not convinced that this is what you were asking ... -- Jim On 9/26/06, Alvaro Cabrerizo [EMAIL PROTECTED] wrote: How could I stop an index server (started with bin/nutch server port index) knowing the port? Thanks in advance.
gettin subcollections
Hi, I would like to know, how could I get all the subcollections and how many documents belong to each subcollection after making a query. The approach I took was to iterate over the results, getting details for each one. The problem is that every query I make is limited by numHits [ LuceneQueryOptimizer.optimize(...INT NUMHITS ...) so I can iterate only over that number of hits. Although I could fix a high number in numhits, this solution doesnt let my to inspect every hit (i.e. when the hits total is bigger than numhits). Thanks in advance.
Getting subcollections
Hi, I would like to know, how could I get all the subcollections and how many documents belong to each subcollection after making a query. The approach I took was to iterate over the results, getting details for each one. The problem is that every query I make is limited by numHits [ LuceneQueryOptimizer.optimize(...INT NUMHITS ...) and org.apache.lucene.search.Searcher]. So I can iterate only by a subset of the results. Thanks in advance.