Subcollections / Index Filters Questions

2006-09-25 Thread WebDev Freak
I just started using Nutch to index HTML, Text, Microsoft Documents, PDF. Our application is Struts Based and we are querying Nutch straight from our application. The query I have going right now is basically searching the whole site. I am trying to figure out two things : 1. How to create sear

term frequency

2006-09-25 Thread Chris K Wensel
Hi all I'm interested in playing with term frequency values in a nutch index on a per document and index wide scope. for example, something similar to this lucene faq entry. http://tinyurl.com/ra3ys so what is the 'correct' way to inspect the nutch index for these values. Particularly against t

RE: Automatic crawling

2006-09-25 Thread jared.dunne
Gianni- Here's the recrawl script that Jacob mentioned: http://wiki.apache.org/nutch/IntranetRecrawl [Note: There are 0.7.x and 0.8 versions] Jacob- I noticed that the 0.8 script had an issue with after merging too. After it merges the segments, it fails to remove all the segments that it used to

Re: Common terms

2006-09-25 Thread Lourival Júnior
Ok. If you're crawling with this settings you don't need to reindex your segments again. And how about the plugins that you are using? Are you using the language-identifier plugin? If not, try it. Regards, Obs: Eu falo português :) On 9/25/06, carmmello <[EMAIL PROTECTED]> wrote: This issue h

Re: Common terms

2006-09-25 Thread carmmello
This issue happens even when I start a new crawl. So, I'm not reindexing the segments. The indexing is done by nutch itself, using the intranet method. Do you mean that after this is done, do I have to reindex the segments, once again? But, if so, why the english common terms are recognized f

Re: Common terms

2006-09-25 Thread Lourival Júnior
Has you reindexed your segments? It's important, because it makes nutch recognize your common terms. I've tried it and the only thing I've noted was the index size that is more big than the original (before use the common terms). On 9/25/06, carmmello <[EMAIL PROTECTED]> wrote: I'm using Nutch

Common terms

2006-09-25 Thread carmmello
I'm using Nutch 0.7.2 and have added to the common-terms.utf8 in the conf folder (and also under the classes folder, inside the ROOT folder on TomCat), some common terms in portuguese, one per line , like: content:da contente:de contente:eu .. However, when I

Re: Exception in thread "main" java.lang.NoClassDefFoundError: srch\nutc

2006-09-25 Thread Steve J.
I had the same exact problem and discovered the source of the Exceptions. I now have Nutch crawling on Windows. The problem is with cygwin and Windows path names that contain spaces (or possibly even other characters). Since I couldn?t figure out how to make cygwin paths behave, I copied my Nu

Re : Re : problem with web site indexing

2006-09-25 Thread Aïcha
In fact , the site I want to index is on the web, when I make the crawl, many others sites are indexed, they are referenced in pages of the site I want to index. Until this point it seems to be good. But, the problem is that for all these others sites, I have many pages in the index and for my sp

Re: Which Operating-System do you use for Nutch

2006-09-25 Thread Jim Wilson
You can get it working on Windows if you're willing to work for it. To use Nutch OOTB, you have to install Cygwin since the provided Nutch launcher is written in Bash. Members of the community have provided alternatives, such as this Python lanucher: http://wiki.apache.org/nutch/CrossPlatformNu

Modifications necessary to upgrade to Hadoop 0.6.2

2006-09-25 Thread Marcel Petrisor
Hi, I did some search in the Nutch and Hadoop lists and there are some modifications mentioned (like UTF8 to Text) and somebody mentioned that with some adjusting it works. Can anybody confirm that using Nutch0.8/0.9 with Hadoop 0.6.x is robust enough and if the answer is yes can list some mod

Re: Ontology plugin in 0.8

2006-09-25 Thread [EMAIL PROTECTED]
Can anyone give me an opinion on how adding an ontology to their search engine helps to improve search results? Thanks in advance, Chad Florian Fricker wrote: Yea, after a closer look it seems to be a problem with your xeres library. To solve this, one needs to update tomcat's xerces librar

Re: Which Operating-System do you use for Nutch

2006-09-25 Thread Brian Cuttler
On Mon, Sep 25, 2006 at 04:03:45PM +0200, Kursun, Mahmut wrote: > Hi, > > I want to install Nutch for testing purposes and would like to know > which OS, Filesystem and sort of Harddiscs other Nutch users prefer. > > What I am going to use is Fedora Core 6 Test 3 with ext3 on a 40-80 GB > IDE Har

Re: Re : problem with web site indexing

2006-09-25 Thread David Podunavac
I think you have in your file which is being indexed something like javascript:something this makes nutch think javascript is a protocol and throws a malformed url exception try "javascript: somthing" or you go into the code and ignore the MalformedURLException at org.apache.nutch.net.BasicUrlNorm

Re: Which Operating-System do you use for Nutch

2006-09-25 Thread Florian Fricker
Hi, Works perfect with Mac OS X 10.4.7 as the operating system and HFS-Plus as filesystem. Regards Kursun, Mahmut wrote: Hi, I want to install Nutch for testing purposes and would like to know which OS, Filesystem and sort of Harddiscs other Nutch users prefer. What I am going to use is F

Re: Which Operating-System do you use for Nutch

2006-09-25 Thread Dima Mazmanov
daktion com! - Das Computer-Magazin > Neue Mediengesellschaft mbH > Bayerstr. 26 > 80335 München > Telefon: +49 / 89 / 74 117-641 > Telefax: +49 / 89 / 74 117-132 > [EMAIL PROTECTED] > http://www.com-magazin.de > __ NOD32 1.1773 (20060925) Information __ &

Re: Which Operating-System do you use for Nutch

2006-09-25 Thread Dima Mazmanov
un > Redaktion com! - Das Computer-Magazin > Neue Mediengesellschaft mbH > Bayerstr. 26 > 80335 München > Telefon: +49 / 89 / 74 117-641 > Telefax: +49 / 89 / 74 117-132 > [EMAIL PROTECTED] > http://www.com-magazin.de > __ NOD32 1.1773 (20060925) Information ___

Which Operating-System do you use for Nutch

2006-09-25 Thread Kursun, Mahmut
Hi, I want to install Nutch for testing purposes and would like to know which OS, Filesystem and sort of Harddiscs other Nutch users prefer. What I am going to use is Fedora Core 6 Test 3 with ext3 on a 40-80 GB IDE Harddisc. That is also to test Fedora Core 6 Test 3. But I am free to install any

java.io.Exception please help me urget

2006-09-25 Thread mohanlal sankaranarayanan
Hi, While im running WordCount Example in distributed Machines it makes the folloing err, please help me my hadoop-site.xml is fs.default.name localhost:9000 mapred.job.tracker localhost:9001 dfs.replication 1 06/09/25 15:56:21 INFO conf.Configuration: parsing jar:

Re : problem with web site indexing

2006-09-25 Thread Aïcha
Hi, I'm sorry but I still don't succeed in indexing all the content of my web site. In the log I have some errors : 2006-09-25 15:35:42,859 ERROR parse.OutlinkExtractor - getOutlinks java.net.MalformedURLException: unknown protocol: javascript at java.net.URL.(URL.java:574) at java.net.URL.(UR