Re: How to do faceting on data indexed by Nutch

2010-04-26 Thread Alvaro Cabrerizo
2010/4/25 Andrzej Bialecki a...@getopt.org

 On 2010-04-25 15:03, KK wrote:
  Hi All,
  I might be repeating this question asked by someone else but googling
 didn't
  help tracking any such mail responses.
  I'm pretty much aware of Solr/Lucene and its basic architecture. I've
 done
  hit highlighting in Lucene, has idea on faceting support by Solr but
 never
  tried it actually. I wanted to implement faceting on Nutch's indexed
 data. I
  already have some MBs of data already indexed by Nutch. I just want to
  implement faceting on those . Can someone give me pointers on how to
 proceed
  further in this regard. Or is it the case that I've to query using Solr
  interface and redirect all the queries to the index already created by
  Nutch. What is the best possible way, simplest way for achieving the
 same.
  Please help in this regard.

 Nutch has two indexing/searching backends - the one that is configured
 by default uses plain Lucene, and it does not support faceting. The
 other backend uses Solr, and then of course it supports faceting and all
 other Solr features.

 So in your case you need to switch to use Solr indexing (and searching).

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com



Although I haven't tried it, maybe you can build a faceting plugin for nutch
using  http://sna-projects.com/bobo .

Regards


Re: n00b question follow up

2007-02-13 Thread Alvaro Cabrerizo

It looks like the query is well done. Query means (scoring part is a little
bit more complicated): please index, give me all the ducuments you have
that contain in the field url OR anchor OR content OR host the word test. If
you find it in the field url use to score the document a boost factor of 4
if you find it in the field anchor use to score a boost of 2...

Maybe it could be helpful to see your nuch-site.xml (e.g. the one you are
using when deploy nutch.war in tomcat, nutch/WEB-INF/classes/nutch-site).


2007/2/12, Patrick Simon [EMAIL PROTECTED]:


All of my config stuff sits in  nutch-default.xml (nutch-site.xml
doesn't change anything but I assume this should be fine)

The out put to my log file of what you advise below is

Query-+(url:test^4.0 anchor:test^2.0 content:test title:test^1.5
host:test^2.0)

I'm not too sure why all the carots are appearing?


-Original Message-
From: Alvaro Cabrerizo [mailto:[EMAIL PROTECTED]
Sent: Thursday, 8 February 2007 1:25 AM
To: nutch-user@lucene.apache.org
Subject: Re: n00b question follow up

Hi:

First you can check that query plugins (query-basic, more, etc) appear
in your nutch-site.xml. If everything is ok, you can add a LOG line in
the method search of the class org.apache.nutch.searcher.IndexSearcher
in order to see how the lucene query is built. If I'm not wrong you have
to add in line 99 LOG.info(query
-+luceneQuery.toString()); This method should look like this:

public Hits search(Query query...)
...
try{
org.apache.lucene.search.BooleanQuery luceneQuery =
this.queryFilters.filter(query); LOG.info(query -
+luceneQuery.toString()); return ..

Recompile, and make a new query.

Hope it helps.





2007/2/7, Patrick Simon [EMAIL PROTECTED]:

 Hi All,

 The is an older post I made with more details from logs that will
 hopefully be painfully obvious to someone out there why its not
 working..

 It appears that I have successfully created a Nutch index via the
 command nutch/bin :./nutch crawl ../urls -dir ../crawl.test -depth
5.

 I say it is successful as when I use Luke (a Lucene GUI tool that
 interegates Lucene indexes) to view the index, a valid index and
 search results come up.

 The directory I point Luke to is
 /home/simonp/nutch-0.8/crawl.test/indexes/part-0 (the value I give

 for searcher.dir in nutch-default.xml is
 /home/simonp/nutch-0.8/crawl.test)

 The problem is that I cannot see any results via the command
 bin/nutch org.apache.nutch.searcher.NutchBean apache or when I
 search for the string apache within the nutch servlet.

 I don't run any fetching or indexing as the tutorial says not to for
 simple intranet searching.

 I am using Tomcat 5.5 and Nutch 0.8.

 Can any body help with this one please?

 The output from catalina.out is

 2007-02-06 09:01:27,990 INFO  NutchBean - opening indexes in
 /home/simonp/nutch-8.0/crawl.test/indexes
 2007-02-06 09:01:28,032 INFO  Configuration - found resource
 common-terms.utf8 at
 file:/usr/local/tomcat/webapps/nutch-0.8/WEB-INF/classes/common-terms.
 ut
 f8
 2007-02-06 09:01:28,037 INFO  NutchBean - opening segments in
 /home/simonp/nutch-8.0/crawl.test/segments
 2007-02-06 09:01:28,056 INFO  SummarizerFactory - Using the first
 summarizer extension found: Basic Summarizer
 2007-02-06 09:01:28,056 INFO  NutchBean - opening linkdb in
 /home/simonp/nutch-8.0/crawl.test/linkdb
 2007-02-06 09:01:28,062 INFO  NutchBean - query request from
 192.168.5.173
 2007-02-06 09:01:28,072 INFO  NutchBean - query: ubuntu
 2007-02-06 09:01:28,072 INFO  NutchBean - lang: en
 2007-02-06 09:01:28,101 INFO  NutchBean - searching for 20 raw hits
 2007-02-06 09:01:28,142 INFO  NutchBean - total hits: 0
 2007-02-06 09:01:30,506 INFO  NutchBean - query request from
 192.168.5.173
 2007-02-06 09:01:30,506 INFO  NutchBean - query: apache
 2007-02-06 09:01:30,506 INFO  NutchBean - lang: en
 2007-02-06 09:01:30,507 INFO  NutchBean - searching for 20 raw hits
 2007-02-06 09:01:30,507 INFO  NutchBean - total hits: 0
 2007-02-06 09:01:51,191 INFO  NutchBean - query request from
 192.168.5.173
 2007-02-06 09:01:51,191 INFO  NutchBean - query: test
 2007-02-06 09:01:51,191 INFO  NutchBean - lang: en
 2007-02-06 09:01:51,193 INFO  NutchBean - searching for 20 raw hits
 2007-02-06 09:01:51,193 INFO  NutchBean - total hits: 0
 2007-02-06 10:22:51,068 INFO  NutchBean - query request from
 192.168.5.173
 2007-02-06 10:22:51,070 INFO  NutchBean - query: test
 2007-02-06 10:22:51,070 INFO  NutchBean - lang: en
 2007-02-06 10:22:51,073 INFO  NutchBean - searching for 20 raw hits
 2007-02-06 10:22:51,076 INFO  NutchBean - total hits: 0 OAG Best Low
 Cost Airline Of The Year

 The content of this e-mail, including any attachments, is a
 confidential communication between Virgin Blue, Pacific Blue or a
 related entity (or the sender if this email is a private
 communication) and the intended addressee and is for the sole use of
 that intended addressee. If you are not the intended addressee, any
 use, interference

Re: n00b question follow up

2007-02-07 Thread Alvaro Cabrerizo

Hi:

First you can check that query plugins (query-basic, more, etc) appear in
your nutch-site.xml. If everything is ok,
you can add a LOG line in the method search of the class
org.apache.nutch.searcher.IndexSearcher in order to see how the lucene query
is built. If I'm not wrong you have to add in line 99 LOG.info(query
-+luceneQuery.toString()); This method should look like this:

public Hits search(Query query...)
...
try{
org.apache.lucene.search.BooleanQuery luceneQuery =
this.queryFilters.filter(query);
LOG.info(query - +luceneQuery.toString());
return ..

Recompile, and make a new query.

Hope it helps.





2007/2/7, Patrick Simon [EMAIL PROTECTED]:


Hi All,

The is an older post I made with more details from logs that will
hopefully be painfully obvious to someone out there why its not
working..

It appears that I have successfully created a Nutch index via the
command nutch/bin :./nutch crawl ../urls -dir ../crawl.test -depth 5.

I say it is successful as when I use Luke (a Lucene GUI tool that
interegates Lucene indexes) to view the index, a valid index and search
results come up.

The directory I point Luke to is
/home/simonp/nutch-0.8/crawl.test/indexes/part-0 (the value I give
for searcher.dir in nutch-default.xml is
/home/simonp/nutch-0.8/crawl.test)

The problem is that I cannot see any results via the command bin/nutch
org.apache.nutch.searcher.NutchBean apache or when I search for the
string apache within the nutch servlet.

I don't run any fetching or indexing as the tutorial says not to for
simple intranet searching.

I am using Tomcat 5.5 and Nutch 0.8.

Can any body help with this one please?

The output from catalina.out is

2007-02-06 09:01:27,990 INFO  NutchBean - opening indexes in
/home/simonp/nutch-8.0/crawl.test/indexes
2007-02-06 09:01:28,032 INFO  Configuration - found resource
common-terms.utf8 at
file:/usr/local/tomcat/webapps/nutch-0.8/WEB-INF/classes/common-terms.ut
f8
2007-02-06 09:01:28,037 INFO  NutchBean - opening segments in
/home/simonp/nutch-8.0/crawl.test/segments
2007-02-06 09:01:28,056 INFO  SummarizerFactory - Using the first
summarizer extension found: Basic Summarizer
2007-02-06 09:01:28,056 INFO  NutchBean - opening linkdb in
/home/simonp/nutch-8.0/crawl.test/linkdb
2007-02-06 09:01:28,062 INFO  NutchBean - query request from
192.168.5.173
2007-02-06 09:01:28,072 INFO  NutchBean - query: ubuntu
2007-02-06 09:01:28,072 INFO  NutchBean - lang: en
2007-02-06 09:01:28,101 INFO  NutchBean - searching for 20 raw hits
2007-02-06 09:01:28,142 INFO  NutchBean - total hits: 0
2007-02-06 09:01:30,506 INFO  NutchBean - query request from
192.168.5.173
2007-02-06 09:01:30,506 INFO  NutchBean - query: apache
2007-02-06 09:01:30,506 INFO  NutchBean - lang: en
2007-02-06 09:01:30,507 INFO  NutchBean - searching for 20 raw hits
2007-02-06 09:01:30,507 INFO  NutchBean - total hits: 0
2007-02-06 09:01:51,191 INFO  NutchBean - query request from
192.168.5.173
2007-02-06 09:01:51,191 INFO  NutchBean - query: test
2007-02-06 09:01:51,191 INFO  NutchBean - lang: en
2007-02-06 09:01:51,193 INFO  NutchBean - searching for 20 raw hits
2007-02-06 09:01:51,193 INFO  NutchBean - total hits: 0
2007-02-06 10:22:51,068 INFO  NutchBean - query request from
192.168.5.173
2007-02-06 10:22:51,070 INFO  NutchBean - query: test
2007-02-06 10:22:51,070 INFO  NutchBean - lang: en
2007-02-06 10:22:51,073 INFO  NutchBean - searching for 20 raw hits
2007-02-06 10:22:51,076 INFO  NutchBean - total hits: 0
OAG Best Low Cost Airline Of The Year

The content of this e-mail, including any attachments, is a confidential
communication between Virgin Blue, Pacific Blue or a related entity (or the
sender if this email is a private communication) and the intended addressee
and is for the sole use of that intended addressee. If you are not the
intended addressee, any use, interference with, disclosure or copying of
this material is unauthorized and prohibited. If you have received this
e-mail in error please contact the sender immediately and then delete the
message and any attachment(s). There is no warranty that this email is
error, virus or defect free. This email is also subject to copyright. No
part of it should be reproduced, adapted or communicated without the written
consent of the copyright owner. If this is a private communication it does
not represent the views of Virgin Blue, Pacific Blue or their related
entities. Please be aware that the contents of any emails sent to or from
Virgin Blue, Pacific Blue or their related entities may be periodically
monitored and reviewed. Virgin Blue, Pacific Blue and their related entities
respect your privacy. Our privacy policy can be accessed from our website:
www.virginblue.com.au




loading different indexes in tomcat

2007-02-07 Thread Alvaro Cabrerizo

Hi:

I´ve got two different indexes: myIndexA and myIndexB. I wanted to load
both on tomcat but without merging. So I've created a new dir myIndexC.
Under myIndexC/indexes I've deployed myIndexA/indexes/part-0 and
myIndexB/indexes/part-0 renamed too part-1. Later I've copied
myIndexA/segments/* and myIndexB/segments/* to myIndexC/segments. I´ve also
moved linkdb/current/part-0 from myIndexA and myIindexB to myIndexC.

Summarizing, I've moved directories from indexA and B to C. Then I started
tomcat  (searcher.dir points to C ) and searches give me hits from both
indexes B and C (based on what explain.jsp says). So is there any difference
in the result I get (e.g. page ranking ) between this process  and a real
merge using  Nutch built-in commands?

Thanks.


Re: how to use PorterStemFilter with NutchDocumentAnalyzer

2007-01-29 Thread Alvaro Cabrerizo

Sorry but, don´t understand what you mean whit will that impact my search
results?



2007/1/23, DS jha [EMAIL PROTECTED]:


yeah - you are right. I can implement this by extending NutchAnalyzer
- however, that will bypass current NutchDocumentAnalyzer and
CommonGrams functionality - will that impact my search results?

Thanks,


On 1/23/07, Alvaro Cabrerizo [EMAIL PROTECTED] wrote:
 Hello:

 At least in nutch-0.8 you can find two plugins analysis-es and
analysis-de.
 The source code of this plugins is the best guide you can follow to
 implement your Analyzer.

 2007/1/19, DS jha [EMAIL PROTECTED]:
 
  Hello –
 
  I want to use PorterStemFilter (along with LowerCaseTokenizer) from
  Lucene library – and was wondering, how would I do it in Nutch. I can
  extend DocumentAnalyzer class for returning PorterStemFilter  and add
  the plugin the config file - however, in that scenario I think
  NutchDocumentAnalyzer won't get called – so is there a way to invoke
  PorterStemFilter before/after NutchDocumentAnalyzer?
 
 
  Thanks,
  DS Jha
 





Re: exact matches and stemming

2007-01-26 Thread Alvaro Cabrerizo

Maybe you could store in your index both the stemmed word and the original
one. Although it will increment the size of your index.
Another posibllity could be to develop a WildcardQuery plugin or a
FuzzyQuery plugin, because lucene comes with this capabilities, and  avoid
stemming task. But it is known that wildcard a fuzzy have poor performance.

Hope it helps.

2007/1/24, Aïcha [EMAIL PROTECTED]:


Hello,

I want to use the FrenchAnalyzer for stop word and stemming treatment
but I want to still be able to do exact search, the problem is that the
FrenchAnalyzer remove characters from the terms when the indexing is made so
it isn't possible to have only exact matches from an index indexed with the
FrenchAnalyzer

could someone help me,
thanks in advance
Aïcha







___
Découvrez une nouvelle façon d'obtenir des réponses à toutes vos questions
!
Profitez des connaissances, des opinions et des expériences des
internautes sur Yahoo! Questions/Réponses
http://fr.answers.yahoo.com



Re: how to use PorterStemFilter with NutchDocumentAnalyzer

2007-01-23 Thread Alvaro Cabrerizo

Hello:

At least in nutch-0.8 you can find two plugins analysis-es and analysis-de.
The source code of this plugins is the best guide you can follow to
implement your Analyzer.

2007/1/19, DS jha [EMAIL PROTECTED]:


Hello –

I want to use PorterStemFilter (along with LowerCaseTokenizer) from
Lucene library – and was wondering, how would I do it in Nutch. I can
extend DocumentAnalyzer class for returning PorterStemFilter  and add
the plugin the config file - however, in that scenario I think
NutchDocumentAnalyzer won't get called – so is there a way to invoke
PorterStemFilter before/after NutchDocumentAnalyzer?


Thanks,
DS Jha



Re: Searcher doesn't find what expected

2007-01-17 Thread Alvaro Cabrerizo

I recommend you to check you index using luke. Whith luke you can manage
(query, see structure..) your lucene index in order to discover if you have
a problem during indexation or during the search.

2007/1/16, kauu [EMAIL PROTECTED]:


so ,u must show us the logs ,
and did u change the nutch-site.xml in the tomcat ?

On 1/16/07, Libor Štefek [EMAIL PROTECTED] wrote:

 Hi,
 I'm using nutch 0.8.1 to index several thousand text files (source code)
 and I use
 intranet crawling method to create an index.

 Everything looks fine, but when I try to search something, it often
 doesn't find
 what it should. I'm sure that the term is in several pages, but I got
 result only
 for some of them.

 I tried to set limits in properties like page sizes, number of links
 etc. but nothing helped.
 There aren't any error messages in logfile during crawl.

 Is there any way how to find a reason for this behavior ?
 How to make nutch more reliable in results?

 Thanks for any hint.
 Libor




--
www.babatu.com




Re: problems to exclude subdirectories in a web site

2007-01-16 Thread Alvaro Cabrerizo

Try to write your excluding patterns before accepting patterns. If I'm not
wrong nutch follows the order of the patterns. So  it first check
+^http://toto.web-site.net  http://site.net/adding all the urls you want
to skip with -^http://toto.web-
http://site.net/site.net/de/([a-z0-9]*)...http://site.net/de/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/en/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/fr/mv/%28%5Ba-z0-9%5D*%29

Then your crawl-urlfilter.txt or regex-urlfilter.txt should look like this:

...
# URL to exclude for indexing
-^http://toto.web-site.net/de/([a-z0-9]*)
-^http://toto.web-site.net/en/([a-z0-9]*)
-^http://toto.web-site.net/fr/mv/([a-z0-9]*)http://site.net/de/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/en/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/fr/mv/%28%5Ba-z0-9%5D*%29

# Website hostname for indexing
+^http://toto.web-site.net

# skip everything else
-.


Hope it helps.

2007/1/12, yleny @ ifrance. com [EMAIL PROTECTED]:


Hello,

I want to exclude for indexing subdirectories in a website
and i have not found the goods parameters.
I use Nutch-0.7.2 because it is impossible
for me to index with Nutch-0.8.1 (it crash).

I want to exclude in my website the subdirectories :
/de/*
/en/*
/fr/mv/*

I try the command line
-^http://toto.web-site.net/de/([a-z0-9]*)
and
-^http://toto.web-site.net/de/*
in my crawl-urlfilter.txt file but
they don#39;t work and nutch index these url but i don#39;t want this.
Any idea ?

I have the default regex-urlfilter.txt
and my personnal crawl-urlfilter.txt is:

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by #39;+#39; or #39;-#39;. The first matching pattern in
the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

# skip file:, ftp:,  mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can#39;t yet parse

-.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*.)*MY.DOMAIN.NAME/

# Website hostname for indexing
+^http://toto.web-site.net

# URL to exclude for indexing
-^http://toto.web-site.net/de/([a-z0-9]*)
-^http://toto.web-site.net/en/([a-z0-9]*)
-^http://toto.web-site.net/fr/mv/([a-z0-9]*)

# skip everything else
-.


*** my default regex-urlfilter.txt file is **

# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by #39;+#39; or #39;-#39;. The first matching pattern in
the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can#39;t yet parse

-.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept anything else
+.

iFRANCE, exprimez-vous !
http://web.ifrance.com




Re: Written a plugin: now nutch fails with an error

2006-11-29 Thread Alvaro Cabrerizo

2006/11/21, Nicolás Lichtmaier [EMAIL PROTECTED]:



 nutch/hadoop are you using?  conf/log4j.properties file determines
 the logging factor. Maybe setting logging level to INFO or DEBUG may
 help.
 Oh, sorry =). I was trying to index some extra field I calculate from
 the URLs. So I followed the instructions at
 http://wiki.apache.org/nutch/CreateNewFilter .

 I've created two classes: one implementing IndexingFilter and another
 extending RawFieldQueryFilter. And this is the plugin.xml file:
 Did you added my-plugin to the src/plugin/build.xml file. If not add
 it to there, and recompile. And check the class files in the build dir
 : check for com.pierpoint/.. if they exist.

I just compiled it, made a jar and put it in an existing nutch
installation. Then I've added it to the regex in the plugin.includes
property in nutch-site.xml

 However, it seems that your error might have been caused from another
 reason. Since the generator runs neither indexingFilters nor
 QueryFilters. If you added another plugin (like url normalizer), i
 suggest you check that one.

Yes, but the real probem is not being able to see anything. Some part of
nutch is doing something ugly: Swallowing the real error and displaying
a generic one. That's always a bad thing to do. I've tried to enable
some more debug info bu I've failed, I've edited Java's
logging.properties and nutch's log4j.properties (replacing INFO for
DEBUG), but I couldn't get nutch to be more verbose... =/





Have you try to add your own LOG lines whithin the code you want to inspect,
at least in your new plugin, re-compile and execute.

Hope it helps


Re: Re-injecting URLS, perhaps by removing them from the CrawlDB first?

2006-11-02 Thread Alvaro Cabrerizo

Hi:

At least in nutch 0.8.X you can filter the crawldb, linkdb and segments
using URL patterns (e.g. editing  regex-urlfilter.txt). Just execute this
command:

nutch_home/bin/nutch mergedb

It will answer the usage information:

CrawlDbMerger output_crawldb crawldb1 [crawldb2 crawldb3 ...] [-filter]
   output_crawldb  output CrawlDb
   crawldb1 ...input CrawlDb-s
   -filter use URLFilters on urls in the crawldb(s)

So, in order to use it run a command like:

nutch_home/bin/nutch mergedb output_crawldb your_crawldb -filter

Before running you have to edit edit regex-urlfilter.txt, if you are using
urlfilter-regex plugin, adding the URL patterns you want to filter.

In order to check it works, dump your crawldb and the filtered one, using
readdb nutch command, and make a diff.

Hope it helps.


Re: I can not query myplugin in field category:test

2006-10-16 Thread Alvaro Cabrerizo

You can use the existing subcollection plugin in nutch 0.8.X and extend it
to use regular expressions. Basically you have to modify the class
org.apache.nutch.collection.Subcollection. Change the method filter (lines
146 154) and substitute if(urlString.indexOf(row) =! -1) with somethig like
if(Pattern.matches(row, urlString)).

This approach lets you:

-Use the existing file subcollection.xml to define your
url-expression/categories
-Use the package java.util.regex to define matching urls


Here is a sample of subcollection.xml, after modifying subcollection plugin.

subcollection
   namemyCategory/name
   idmyCategory/id
   whitelist
   http://www.categorySite.es/setcionA(.)*sectionB(.)*
   /whitelist
   blacklist

http://www.categorySite.es/setcionA(.)*sectionB(.)*sectionC(.)*
   /blacklist
/subcollection




2006/10/14, Ernesto De Santis [EMAIL PROTECTED]:


Hi Chad

The link was a configuration example.

more explained example:
http://www.misite.com/videos/.*=videos  (rule A)

if the url fetched match which rule A, then index a Field named =
'category' with value = 'videos'.

Later you can search over this field category to filter yours searches.

I will send this plugin in another new thread mail. I post the plugin
here, in the list. I don't know another way to share it with you.

Regards
Ernesto.





[EMAIL PROTECTED] escribió:
 couldn't get the link to work but yes if you could share that would be
 great.

 Chad Savage




 Ernesto De Santis wrote:
 I did a url-category-indexer.

 It works with a .properties file that map urls writed as regexp and
 categories.
 example:

 http://www.misite.com/videos/.*=videos

 If it seems useful, I can share it.

 Maybe, it could be better config it in a .xml file.

 Regards,
 Ernesto.

 Stefan Neufeind escribió:
 Alvaro Cabrerizo wrote:

 Have you included a node to describe your new searcher filter into
 plugin.xml?

 2006/10/11, xu nutch [EMAIL PROTECTED]:

 I have a question about myplugin for indexfilter and queryfilter.
 Can u Help me !
 -
 MoreIndexingFilter.java in add
 doc.add(new Field(category, test, false, true, false));
 -

 --


 package org.apache.nutch.searcher.more;

 import org.apache.nutch.searcher.RawFieldQueryFilter;

 /** Handles category: query clauses, causing them to search the
 field indexed by
  * BasicIndexingFilter. */
 public class CategoryQueryFilter extends RawFieldQueryFilter {
  public CategoryQueryFilter() {
super(category);
  }
 }
 ---
 ---

 property
  nameplugin.includes/name

valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)/value


  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints
 plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  /description
 /property

 property
  nameplugin.includes/name

valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)/value


  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints
 plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  /description
 /property
 ---

 I use luke to query category:test is ok!
 but I use tomcat webstie to query category:test ,
 no return result.


 In case you get the search working:
 How do you plan to categorize URLs/sites? I'm looking for a solution
 there, since I didn't yet manage to implement something
 URL-prefix-filter based to map categories to URLs or so.


 Regards,
  Stefan




__
 Preguntá. Respondé. Descubrí.
 Todo lo que querías saber, y lo que ni imaginabas,
 está en Yahoo! Respuestas (Beta).
 ¡Probalo ya! http://www.yahoo.com.ar/respuestas







__
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas




Re: I can not query myplugin in field category:test

2006-10-13 Thread Alvaro Cabrerizo

Have you included a node to describe your new searcher filter into plugin.xml?

2006/10/11, xu nutch [EMAIL PROTECTED]:

I have a question about myplugin for indexfilter and queryfilter.
Can u Help me !
-
MoreIndexingFilter.java in add
doc.add(new Field(category, test, false, true, false));
-

--


package org.apache.nutch.searcher.more;

import org.apache.nutch.searcher.RawFieldQueryFilter;

/** Handles category: query clauses, causing them to search the
field indexed by
 * BasicIndexingFilter. */
public class CategoryQueryFilter extends RawFieldQueryFilter {
 public CategoryQueryFilter() {
   super(category);
 }
}
---
---

property
 nameplugin.includes/name
valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)/value
 descriptionRegular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 /description
/property

property
 nameplugin.includes/name
valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)/value
 descriptionRegular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 /description
/property
---

I use luke to query category:test is ok!
but I use tomcat webstie to query category:test ,
no return result.



Re: category string gets matched as a term

2006-10-10 Thread Alvaro Cabrerizo

It looks you syntax is correct ( category:video searchString). Try to
write a LOG.info line into
org.apache.nutch.searcher.LuceneQueryOptimizer(Line 178), just at the
begining of the optimize method:

public TopDocs optimize(BooleanQuery original,
Searcher searcher, int numHits,
String sortField, boolean reverse)
throws IOException {
LOG.info(Query - +original.toString());

Recompile nutch a make a query, for example category:video funny if your
category plugin works fine you'll get an info line within hadoop.log similar
to this:

+(url:funny^0.0 anchor:funny^0.0 content:funny title:funny^0.0
host:funny^0.0) +category:video

First part means (+(url:funny^0.0 anchor:funny^0.0 content:funny
title:funny^0.0
host:funny^0.0)) that funny must appear at least in one of that fields (url,
anchor...). The second part filters results to obtain only the ones
tagged as video.

In your case it looks like the word video is being included into the first
part. Check your plugin implementation is correct, and the plugin.xml and
build.xml are correct. Your plugin.xml should look similar to this:

...
extension id=...
   name=
   point=org.apache.nutch.searcher.QueryFilter
  implementation id=...  class=/
  parameter name=raw-fields value=category/
/extension

Hope it helps.

2006/10/3, Dima Gritsenko  [EMAIL PROTECTED]:


Hi,

I have categorized web sites during crawl to provide filtered results
similar to google Video, Images tabs.

But when I enter
category:video MySearchString
nutch matches both the video and MySearchString as terms (though it
filters results correctly and displays links to only video categorized
pages) but the search is not relevant since video string is matched as
well.

How do I filter category string off during search?

Great thanks.
Dima.




Re: focussed crawling

2006-10-05 Thread Alvaro Cabrerizo

Although I havent use it. After making a crawl, at least in nutch 0.8,
you can make a ./bin/nutch/mergedb outputdb your_db -filter. Using
the filter option you can generate a new db, filtering links you wanna
remove. And use it to make a recrawl.

Hope it helps.


Re: stop an index server

2006-09-29 Thread Alvaro Cabrerizo

Hi:

It works fine. Thanks again.

2006/9/28, Sami Siren [EMAIL PROTECTED]:


hello,

here's an adhoc addition to search server to support shutdown command.

client calls server like this:

bin/nutch 'org.apache.nutch.searcher.DistributedSearch$Client'
-shutdown 127.0.0.1 

--
  Sami Siren

Alvaro Cabrerizo wrote:


 2006/9/27, Sami Siren [EMAIL PROTECTED] mailto:[EMAIL PROTECTED]:

 Alvaro Cabrerizo wrote:
   How could I stop an index server (started with bin/nutch server
 port
   index) knowing the port?
  
   Thanks in advance.
  

 It does not support such a feature. Can you describe a little bit
more
 what are you trying to accomplish something similar to tomcats
SHUTDOWN?


 Sure,
 That's right. If this feature doesn't exist, I'm looking for a clue to
 develop a SHUTDOWN and a RESTART command,  using NUTCH/HADOOP api. The
 idea is to have a group of JAVA classes that lets people execute a
 command like: SERVER_RESTART port or more advanced SERVER_RESTART
 port ip_address.

 Anyway I can execute ps aux | grep 4 in a shell and find out
 proccess number in order to kill it or I can make a ^C to stop it, but
 this is not the solution I'm looking for.


 Thanks, in advance.


 --
   Sami Siren








Re: stop an index server

2006-09-28 Thread Alvaro Cabrerizo

2006/9/27, Sami Siren [EMAIL PROTECTED]:


Alvaro Cabrerizo wrote:
 How could I stop an index server (started with bin/nutch server port
 index) knowing the port?

 Thanks in advance.


It does not support such a feature. Can you describe a little bit more
what are you trying to accomplish something similar to tomcats SHUTDOWN?



Sure,
That's right. If this feature doesn't exist, I'm looking for a clue to
develop a SHUTDOWN and a RESTART command,  using NUTCH/HADOOP api. The idea
is to have a group of JAVA classes that lets people execute a command like:
SERVER_RESTART port or more advanced SERVER_RESTART port
ip_address.

Anyway I can execute ps aux | grep 4 in a shell and find out proccess
number in order to kill it or I can make a ^C to stop it, but this is not
the solution I'm looking for.


Thanks, in advance.


--

  Sami Siren




Re: stop an index server

2006-09-26 Thread Alvaro Cabrerizo

Ok,

I'll try to explain it in a more clear way.

Imagine that you have finished crawling a group of sites and you have a well
formed index. Then you configure tomcat, create a nutch-site.xml, add the
property searcher.dir pointing to a search-servers.txt that contains this
line: 127.0.0.1 4. Then you start tomcat, and an index server using
the command nutch_home/bin/nutch server 4  myIndexDir. Now you can get
the results of that server via tomcat, in a distributed way.

At this point I would like to know how  to stop the server running on 4
port. I can execute ps aux | grep 4 in a shell and find out proccess
number in order to kill it or I can make a ^C to stop it, but this is not
the solution I'm looking for.

I've tried this piece of code (based on
org.apache.nutch.search.DistributedSearch):

Configuration conf = NutchConfiguration.create();
InetSocketAddress[] a_InetSocketAddress = new InetSocketAddress[1];
a_InetSocketAddress[0] = new InetSocketAddress(localhost, 4);
Object[][] params = new Object[1][0];
Method get_method = org.apache.hadoop.ipc.Server.class.getMethod(get, new
Class[] {});
org.apache.hadoop.ipc.Server[] servers =
(org.apache.hadoop.ipc.Server[])RPC.call(get_method,
params,
a_InetSocketAddress ,conf);
servers[0].stop();

Executing this code gives me a nullPointer exception because the RPC.call,
returns an array of nulls. If I understand it, when we execute
nutch_home/bin/nutch server 4  myIndexDir, we are  enveloping a
NutchBean in a RPC layer (), tha let us to access to the methods of
NutchBean via RPC.calls, but not to org.apache.hadoop.ipc.Server methods.

Summarizing, the question is how to get the instance of that server (the
org.apache.hadoop.ipc.Server running on 4 port) to make an STOP.

Once I can make an stop, i can update the index and restart it.

Thanks for your answer.



2006/9/26, Jim Wilson [EMAIL PROTECTED]:


Do you mean what crawl-urlfilter.txt line you'd need?  I think the
following would do it:

-^http://server:port/

But I'm not convinced that this is what you were asking ...

-- Jim

On 9/26/06, Alvaro Cabrerizo [EMAIL PROTECTED] wrote:

 How could I stop an index server (started with bin/nutch server port
 index) knowing the port?

 Thanks in advance.






gettin subcollections

2006-09-11 Thread Alvaro Cabrerizo

Hi,

I would like to know, how could I get all the subcollections and how many
documents belong to each subcollection after making a query.
The approach I took was to iterate over the results, getting details for
each one. The problem is that every query I make is limited by numHits [
LuceneQueryOptimizer.optimize(...INT NUMHITS ...) so I can iterate only over
that number of hits. Although I could fix a high number in numhits, this
solution doesnt let my to inspect every hit (i.e. when the hits total is
bigger than numhits).


Thanks in advance.


Getting subcollections

2006-09-08 Thread Alvaro Cabrerizo

Hi,

I would like to know, how could I get all the subcollections and how many
documents belong to each subcollection after making a query.
The approach I took was to iterate over the results, getting details for
each one. The problem is that every query I make is limited by numHits [
LuceneQueryOptimizer.optimize(...INT NUMHITS ...) and
org.apache.lucene.search.Searcher]. So I can iterate only by a subset of the
results.

Thanks in advance.