Re: Lucene query support in Nutch

2006-10-10 Thread Stefan Neufeind
Cristina Belderrain wrote: On 10/9/06, Tomi NA [EMAIL PROTECTED] wrote: This is *exactly* what I was thinking. Like Stefan, I believe the nutch analyzer is a good foundation and should therefore be extended to support the or operator, and possibly additional capabilities when the need

Searching terms saved in a file

2006-10-10 Thread frgrfg gfsdgffsd
Hi all, As a new nutch's user, I am quite stuck regarding this: How can I launch a search using a file containing my terms/keywords instead of typing them in search.jsp?? Do I have to use the Query.term class? If yes, How and where do I use this class? Thanks a lot! Mat

Deleting Pages

2006-10-10 Thread Gary Bone
Hi, Does anyone know how to force a page to be deleted. I have run the WebDBWriter class and removed the page from the database but it still shows on the search? Further checks using WebDBReader give a 'null' response when looking for the page. Most confusing? Gary CAUTION - This message may

Re: Lucene query support in Nutch

2006-10-10 Thread Tomi NA
2006/10/10, Cristina Belderrain [EMAIL PROTECTED]: On 10/9/06, Tomi NA [EMAIL PROTECTED] wrote: This is *exactly* what I was thinking. Like Stefan, I believe the nutch analyzer is a good foundation and should therefore be extended to support the or operator, and possibly additional

term frequencies for multiple term query

2006-10-10 Thread Erik J
Hello, I use the code below to get the term frequency for the term searched for by the user. However, if the query consists of more than one word (separated by space), or if it consists of a phrase within quotes, the term frequency equals zero with this code. How can I get the term

Re: Lucene query support in Nutch

2006-10-10 Thread Bill Goffe
Tomi said: In conclusion, my position is pragmatic: I welcome the simplest solution to implement the or search. I just believe that it'd be easiest to do that extending the nutch Analyzer. This seems like a very reasonable approach. I too would very much like OR. It would also be nice if it

Re: crawl db disrtibution on different data nodes

2006-10-10 Thread Dennis Kubes
It completely depends on the number of urls in the crawldb. Dennis jaison Qburst wrote: What will be the maximum size of crawlDb on a single node?

Re: Searching terms saved in a file

2006-10-10 Thread Dennis Kubes
You would have to write something that would loop through the file and then construct a Query object using the addRequired and addProhibited methods to add your terms and phrases. Then pass that into the appropriate NutchBean search method to get your results. Dennis frgrfg gfsdgffsd wrote:

Re: Database update

2006-10-10 Thread Dennis Kubes
You could write a MapReduce job that would use the parse_data folder as input and inside the map or reduce class depending on your logic use jdbc to update to mysql. It would look something like this for the job configuration. JobConf yourjob= new NutchJob(conf); for (int i = 0; i

Re: java.lang.NoSuchMethodError while indexing

2006-10-10 Thread Dennis Kubes
What java version are you using. Might be needing java 5? Dennis Adam Borkowski wrote: Question from then newbie. I've just downloaded version 0.8.1 and going trough the tutorial. Almost got to the end, but after index command: bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb

Re: category string gets matched as a term

2006-10-10 Thread Alvaro Cabrerizo
It looks you syntax is correct ( category:video searchString). Try to write a LOG.info line into org.apache.nutch.searcher.LuceneQueryOptimizer(Line 178), just at the begining of the optimize method: public TopDocs optimize(BooleanQuery original, Searcher searcher, int numHits, String sortField,

RE: java.lang.NoSuchMethodError while indexing

2006-10-10 Thread NG-Marketing, M.Schneider
IOException and setmergeFaktor...? You may check your config and have a look at your merging faktor. Set it to 50. What java version are you using. Might be needing java 5? Dennis Adam Borkowski wrote: Question from then newbie. I've just downloaded version 0.8.1 and going trough

summarizer extension

2006-10-10 Thread NG-Marketing, M.Schneider
Hy, how can I write a function to the basic-summarizer so that if (meta-description) show meta-description else continue with basic summarizer Matthias

Recrawl script

2006-10-10 Thread Chris Stephens
How does the depth option work on the 0.8 recrawl script that is on http://wiki.apache.org/nutch/IntranetRecrawl . I just want to re-index all of the pages currently in the db and not index any new pages these pages might link to. Should I use a 0 for this? It seems like the fetcher never

Segment size and mergesegs slicing

2006-10-10 Thread Jacob Brunson
For some tests, I ran two fetches on segments which I generated with topN=50. I then tried to merge these segments using mergesegs with slice=200 which resulted in 8 segments. If I only fetched about 100 URLs, why do I end up with 8 segments containing (supposedly) 200 URLs each? What is the

Re: Segment size and mergesegs slicing

2006-10-10 Thread Andrzej Bialecki
Jacob Brunson wrote: For some tests, I ran two fetches on segments which I generated with topN=50. I then tried to merge these segments using mergesegs with slice=200 which resulted in 8 segments. If I only fetched about 100 URLs, why do I end up with 8 segments containing (supposedly) 200

Re: Recrawl script

2006-10-10 Thread Chris Stephens
The -noAdditions feature would be ideal for my situation. Hopefully it will be released soon. Andrzej Bialecki wrote: Jacob Brunson wrote: So the depth number is the number of iterations the recrawl script will go through. In each iteration, it will select a number of URLs from the crawl

RE: Deleting Pages

2006-10-10 Thread Howie Wang
The webdb and the segments are two separate things. The webdb is basically used by fetcher to keep track of the status of the URL (like last fetch time, was there an error). The segments contain the data from the fetches themselves, and also the data's index, which is used during searches. So

Re: java.lang.NoSuchMethodError while indexing

2006-10-10 Thread Adam Borkowski
It's ok now. It was my fault. I unfortunatelly mixed Xalan jar with nutch distribution. After cleaning classpath, everything went ok. - Original Message - From: Dennis Kubes [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Tuesday, October 10, 2006 4:35 PM Subject: Re:

Fetcher aborts with hung threads

2006-10-10 Thread Bruno Thiel
All, I downloaded the nutch nightly build on 22/09/2006. I do a crawl over the file system and my current file list, generated by find is around 80,000 entries (12M). After around half way, the fetcher issueing the message Aborting with 3 hung threads. Anybody facing the same problem? Cheers,

Re: Fetcher aborts with hung threads

2006-10-10 Thread Jacob Brunson
I used to have that problem a lot, but not any more. The problem I thought was connected to http://issues.apache.org/jira/browse/NUTCH-344 which was closed on September 24th. I am running http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.8 Revision: 462538 and things seem to work

I can not query myplugin in field category:test

2006-10-10 Thread xu nutch
I have a question about myplugin for indexfilter and queryfilter. Can u Help me ! - MoreIndexingFilter.java in add doc.add(new Field(category, test, false, true, false)); - -- package

RE: Fetcher aborts with hung threads

2006-10-10 Thread Bruno Thiel
Here's an update on my investigations: I face this problem for quite a while now - and it seems to be that there is a correlation to the xls file format plugin. Each time the thread seems to get stuck parsing xls. -Original Message- From: Jacob Brunson [mailto:[EMAIL PROTECTED] Sent: