Re: Plugins: directory not found: plugins

2006-02-07 Thread 盖世豪侠
Hi Do you mean I should create a dir called build and move dir plugins in? It seems it doesn't work either 2006/2/7, Saravanaraj Duraisamy [EMAIL PROTECTED]: Add build\plugins to your classpath On 2/7/06, 盖世豪侠 [EMAIL PROTECTED] wrote: I try to run nutch using command line and I've add

Re: Plugins: directory not found: plugins

2006-02-07 Thread Jack Tang
Please specify plugin.folders(in nutch-default/site.xml) to the real plugin built destination dir. Of course, you can use absolutely path. /Jack On 2/7/06, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi Do you mean I should create a dir called build and move dir plugins in? It seems it doesn't work either

Re: Plugins: directory not found: plugins

2006-02-07 Thread Saravanaraj Duraisamy
u build the application and u will get a build folder add that folder to your class path On 2/7/06, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi Do you mean I should create a dir called build and move dir plugins in? It seems it doesn't work either 2006/2/7, Saravanaraj Duraisamy [EMAIL PROTECTED]:

nutch 0.8-devel and url redirect

2006-02-07 Thread Enrico Triolo
I'm switching to nutch-0.8 but I'm facing a problem with url redirects. To let you understand better I'll explain my problem with a real example: I created an 'urls' directory and inside it I created an 'urls.txt' file containing only this line: http://www.punto-informatico.it;. If pointed to

Re: nutch 0.8-devel and url redirect

2006-02-07 Thread Raghavendra Prabhu
Check the url filters Crawl-filter.txt see whether the rule is allowed see whether the link below matches with url pattern there in the crawl-filter.txt file http://*punto-informatico.it http://punto-informatico.it* On 2/7/06, Enrico Triolo [EMAIL PROTECTED] wrote: I'm switching to

Re: nutch 0.8-devel and url redirect

2006-02-07 Thread Enrico Triolo
Thank you for your reply. My crawl-urlfilter.txt file allows any url, since I set only this rule: +. btw, this is the same rule I set for the 0.7 version. On 2/7/06, Raghavendra Prabhu [EMAIL PROTECTED] wrote: Check the url filters Crawl-filter.txt see whether the rule is allowed see

Re: Installing nutch

2006-02-07 Thread Bernd Fehling
For those of you who are also reinventing the wheel like me getting nutch-0.8-dev with MapReduce running on a single box here are some updates. This is about revision #374443. The DmozParser class mentioned in quick tutorial for nutch 0.8 and later seams to be in

Re: Installing nutch

2006-02-07 Thread Zaheed Haque
Hi: Have you looked at the nutch-default.xml config file under namesearcher.dir/name ?? You need to modify this to reflect DFS where your crawl directory is I think you will have something like /user/nutch etc etc.. you can find it by trying the following bin/hadoop dfs and bin/hadoop dfs -ls

RE: How deep to go

2006-02-07 Thread Vanderdray, Jacob
If you only want to crawl www.woodward.edu, then change +^http://([a-z0-9]*\.)*woodward.edu/ To: +^http://www.woodward.edu/ Jake. -Original Message- From: Andy Morris [mailto:[EMAIL PROTECTED] Sent: Monday, February 06, 2006 9:00 PM To: nutch-user@lucene.apache.org Subject:

Re: Installing nutch

2006-02-07 Thread Zaheed Haque
Sorry you should update your nutch svn update to revision 375624 Cheers On 2/7/06, Zaheed Haque [EMAIL PROTECTED] wrote: Hi: Have you looked at the nutch-default.xml config file under namesearcher.dir/name ?? You need to modify this to reflect DFS where your crawl directory is I think

Re: Installing nutch

2006-02-07 Thread Bernd Fehling
Zaheed Haque schrieb: Sorry you should update your nutch svn update to revision 375624 Cheers Thanks will do that and redo all from scratch. Bernd

Re: Speeding up initial searches using cache

2006-02-07 Thread Byron Miller
I use OSCache with great success. I would an amazing amount (more then i assumed) of queries we get are duplicate of one fashion or another so on top of warming things up as much as possible to the OS buffer cache we use OSCache as well. You could also use Squid to cache pages for x amount of

Re: Categorizing content

2006-02-07 Thread 盖世豪侠
Hi I think you have to hack the parsed content from the parse-html plugin and filter the string with your terms? It will of course contain modifying or adding some codes. 2006/2/8, Byron Miller [EMAIL PROTECTED]: Is there an easy way to categorize content on parse? I have an extensive list

Re: Categorizing content

2006-02-07 Thread Jack Tang
Hi Byron I am thinking will it be faster to do this offline? I mean you can re-visit webdb and link db and generate the index. /Jack On 2/8/06, Byron Miller [EMAIL PROTECTED] wrote: Is there an easy way to categorize content on parse? I have an extensive list of adult terms and i would like

opensearch support

2006-02-07 Thread Geraint Williams
Is OpenSearch being developed? I am using nutch 0.7 and it seems to have some opensearch support. However, I failed to get either a python or perl opensearch client library (admittedly these are also in early development). The perl library seemed to choke at not finding the

Re: Categorizing content

2006-02-07 Thread 盖世豪侠
It sounds OK. But I think if you don't check it on line, maybe you will get many unrequired contents in your index. 2006/2/8, Jack Tang [EMAIL PROTECTED]: Hi Byron I am thinking will it be faster to do this offline? I mean you can re-visit webdb and link db and generate the index. /Jack

hadoop-default.xml

2006-02-07 Thread Mike Smith
There is no setting file for Hadoop in conf/. Should it be hadoop-default.xml? It seems this file is not committed but it is packaged into hadoop jar file. Thanks, Mike.

Re: hadoop-default.xml

2006-02-07 Thread Doug Cutting
The file packaged in the jar is used for the defaults. It is read from the jar file. So it should not need to be committed to Nutch. Mike Smith wrote: There is no setting file for Hadoop in conf/. Should it be hadoop-default.xml? It seems this file is not committed but it is packaged into

Re: Categorizing content

2006-02-07 Thread Andrzej Bialecki
Byron Miller wrote: Is there an easy way to categorize content on parse? I have an extensive list of adult terms and i would like to update meta info on the page if the combination of terms exist to flag it as adult content so i can exclude it from the search results unless people opt in.

Hadoop Jobtracker fails

2006-02-07 Thread Mike Smith
I am using the new nutch with hadoop, jobtracker fails at initilazation with this exceptopn: 060207 121953 Property 'file.separator' is / 060207 121953 Property 'java.vendor.url.bug' is http://java.sun.com/cgi-bin/bugreport.cgi 060207 121953 Property 'sun.io.unicode.encoding' is UnicodeLittle

new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Rafit Izhak_Ratzin
Hi, I am trying to run with the new svn version (375414), I am working under nutch/trunk directory. When I ran the next command bin/hadoop jobtracker or bin/hadoop-daemon.sh start jobtracker I got the next message, Exception in thread main java.lang.NoClassDefFoundError:

Re: new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Mike Smith
The problem is that jetty jar files are missing from the SVN., I replaced the Jetty jar files but I get another exception: 060207 123447 Property 'sun.cpu.isalist' is Exception in thread main java.lang.NullPointerException at org.apache.hadoop.mapred.JobTrackerInfoServer.init(

Re: new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Mike Smith
I could make it work in a strange way. There should be a problem with hadoop jar file. I downloaded the hadoop project and made the project using ant. then I could start the jobtracker successfully, but when I removed the build folder and just used the hadoop jar file, it failed again. So I

bug fixes

2006-02-07 Thread Raghavendra Prabhu
Hi I think even NUTCH-94 and NUTCH-96 have been fixed Rgds Prabhu

Re: new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Rafit Izhak_Ratzin
I am still getting teh next Exception: Exception in thread main java.lang.NullPointerException at org.apache.hadoop.mapred.JobTrackerInfoServer.init(JobTrackerInfoServer.java:56) at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:303) at

Re: new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Mike Smith
Rafit, get the hadoop project make that and use the build folder instead of the jar file. It will work fine then. Something probabely is missing from the hadoop jar. M On 2/7/06, Rafit Izhak_Ratzin [EMAIL PROTECTED] wrote: I am still getting teh next Exception: Exception in thread main

Re: new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Mike Smith
I finally finished my first successfull experinece with nutch/hadoop, I started from 70,000 seeds and this is results of 1 cycle crawl. 060207 153956 Statistics for CrawlDb: t1/crawldb 060207 153956 TOTAL urls: 850726 060207 153956 avg score:1.037 060207 153956 max score:

Re: new svn version:NoClassDefFoundError - JobTracker

2006-02-07 Thread Rafit Izhak_Ratzin
After copying teh build directory from hadoop to nutch I can run the crawl cycle, however I get the next Exeption (in jobtracking file) lot of times: 060207 215603 Server connection on port 50020 from IP... caught: java.lang.RuntimeException: java.lang.ClassNotFoundException:

Re: Speeding up initial searches using cache

2006-02-07 Thread Chris Lamprecht
Just out of curiousity, does anyone here know how well query caching works in general with an extremely high-volume search engine? It seems like as your search volume goes up, and the number of unique queries goes up with it, the cache hit rate would go down, and caching would help less and less.

Re: Nutch-general digest, Vol 1 #935 - 8 msgs

2006-02-07 Thread David Wallace
Hi Saravanaraj, For each URL, Nutch reads your filter file from top to bottom, until it finds a line (+ or -) that matches the URL. Then it stops reading. Therefore, any files inside E:/Index Samples/Index/ will be INCLUDED, because they match the line that says +^file:/E:/Index Samples/. I

Re: Nutch-general digest, Vol 1 #935 - 8 msgs

2006-02-07 Thread Saravanaraj Duraisamy
Hi David, Thanks... Is there a way in nutch to reindex the files based on the last modified date??? I have large numbers of pdf's and doc's in a folder. Do i need to reindex all the files every time i want to update my index? On 2/8/06, David Wallace [EMAIL PROTECTED] wrote: Hi Saravanaraj,