Re: Nutch Cannot Find Indexed Pages?

2006-09-14 Thread Dennis Kubes
Does it not have anything in the database or are there entries in the index but nothing is being returned by the search? Dennis victor_emailbox wrote: Can anyone help? Thanks. victor_emailbox wrote: Hi, I followed all the steps in the 0.8 tutorial except that I have only 2 urls in the

Fetcher File Error 404 when crawling through file system

2006-09-14 Thread Bruno Thiel
Hi, I am trying to configure a recent nutch (0.8+) to configure to fetch directly from the file system instead of http which is fairly slow. The fetcher hits a 404 - File not found (see below). When I'm copying the file:/// URL into lynx it gets found without any problems. 2006-09-15 10:29:57,

Re: 0.8 Intranet Crawl Output/Logging?

2006-09-14 Thread Tomi NA
On 9/14/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: Everyone, thanks for the help with this. I hope to return the assistance, once I am more familiar with 0.8. I am using tail -f now to monitor my test crawls. It also look like you can use conf/hadoop-env.sh to redirect log file output to

Re: 0.8 Intranet Crawl Output/Logging?

2006-09-14 Thread Renaud Richardet
Hello Jared, [EMAIL PROTECTED] wrote: Everyone, thanks for the help with this. I hope to return the assistance, once I am more familiar with 0.8. I am using tail -f now to monitor my test crawls. It also look like you can use conf/hadoop-env.sh to redirect log file output to a different locat

RE: 0.8 Intranet Crawl Output/Logging?

2006-09-14 Thread jared.dunne
Everyone, thanks for the help with this. I hope to return the assistance, once I am more familiar with 0.8. I am using tail -f now to monitor my test crawls. It also look like you can use conf/hadoop-env.sh to redirect log file output to a different location for each of your configurations. One

Re: Filtering pages before indexing

2006-09-14 Thread Andrzej Bialecki
[EMAIL PROTECTED] wrote: Hi, Is there a way to filter pages before they're indexed in Nutch? I try to crawl an Intranet site but only PDF documents should make it to the index (in later stages this will be extended but PDFs are the main focus). I've tried using the regex or suffix filters but

Re: how to combine two run's result for search

2006-09-14 Thread Zaheed Haque
Thats the way I set it up at first. This time, I started with a blank slate, unpacked nutch and tomcat, unpacked nutch-0.8.war into the webapps/ROOT and left the deployed app untouched. The above means that you have an empty nutch-site.xml under webapps/ROOT and you have a nutch-default.xml with

Filtering pages before indexing

2006-09-14 Thread roman . spitzbart
Hi, Is there a way to filter pages before they're indexed in Nutch? I try to crawl an Intranet site but only PDF documents should make it to the index (in later stages this will be extended but PDFs are the main focus). I've tried using the regex or suffix filters but this prevents the crawling

When I run example on hadoop0.6.1 release I get a error.

2006-09-14 Thread Ensheng Wang
[EMAIL PROTECTED] hpp]$ hadoop jar hadoop-0.6.1-examples.jar grep input output 'dfs[a-z.]+' 06/09/14 23:04:50 INFO conf.Configuration: parsing file:/home/wangensh/hadoop-0.

Re: how to combine two run's result for search

2006-09-14 Thread Tomi NA
On 9/14/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: On 9/14/06, Tomi NA <[EMAIL PROTECTED]> wrote: > On 9/5/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: > > Hi: > > I have a problem or two with the described procedure... > > > Assuming you have > > > > index 1 at /data/crawl1 > > index 2 at /data/

Re: how to combine two run's result for search

2006-09-14 Thread Zaheed Haque
On 9/14/06, Tomi NA <[EMAIL PROTECTED]> wrote: On 9/5/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: > Hi: I have a problem or two with the described procedure... > Assuming you have > > index 1 at /data/crawl1 > index 2 at /data/crawl2 Used ./bin/nutch crawl urls -dir /home/myhome/crawls/mycrawl

Re: how to combine two run's result for search

2006-09-14 Thread Tomi NA
On 9/5/06, Zaheed Haque <[EMAIL PROTECTED]> wrote: Hi: I have a problem or two with the described procedure... Assuming you have index 1 at /data/crawl1 index 2 at /data/crawl2 Used ./bin/nutch crawl urls -dir /home/myhome/crawls/mycrawldir to generate an index: luke says the index is vali

Re: How can I modify the crawler?

2006-09-14 Thread Jim Wilson
To answer your question, some more information is needed: 1) How do you decide which "topic" a particular page belongs to? URL segments? The Title? Other html page elements? Latent Semantic Analysis ( http://en.wikipedia.org/wiki/Latent_semantic_indexing)? 2) Given a topic, how will your end

French Analyzer with accents

2006-09-14 Thread Grégory Debord
Hi nutch-users, I have updated my nutch version (0.7.2) to include the analysis-fr plugin as described by Jérôme in the Nutch Wiki (Multi Lingual Support) and NUTCH-261. I've updated as well the front-end to take advantages of this analyzer in queries. The french stemming seems to work well (the

Re: Configuring Nutch

2006-09-14 Thread Andrzej Bialecki
Lakshman, Madhusudhan wrote: Hi Group, We have a requirement where we should display the search result along with the snippet (2-3 lines) of the content, something similar to Google, where this snippet is displayed after the title line as shown below: Welcome to Nutch! This is the fir

How can I modify the crawler?

2006-09-14 Thread suxiaoke79
I want to realize a topic-based search engine through modifing the nutch. For example I define a computer topic so I hope that I only find some information about computer. I can't find the appropriate point where I can insert myself sentence in Fetcher.java. Please tell me how can I modify t

Configuring Nutch

2006-09-14 Thread Lakshman, Madhusudhan
Hi Group, We have a requirement where we should display the search result along with the snippet (2-3 lines) of the content, something similar to Google, where this snippet is displayed after the title line as shown below: Welcome to Nutch! This is the first Nutch release as an Apache Luc

Re: caching - filetypes

2006-09-14 Thread Jacob Brunson
I don't know if I understand completely your email. What you mean with "cache"? So if you go with the standard search results page, there is a link to a cached copy of the page. If the page was html, then there are no problems, however, if the page was binary, it returns a http 500 internal se

Re: 0.8 Intranet Crawl Output/Logging?

2006-09-14 Thread Jacob Brunson
On my system, I run the crawl command in one shell while running this command in another shell to monitor the crawl: tail -f log/hadoop.log Of course this does about the same thing as listed below, but "tail -f" is a little easier to remember. On 9/13/06, Tomi NA <[EMAIL PROTECTED]> wrote: On 9/