Hi
Do you mean I should create a dir called build and move dir plugins in?
It seems it doesn't work either
2006/2/7, Saravanaraj Duraisamy [EMAIL PROTECTED]:
Add build\plugins
to your classpath
On 2/7/06, 盖世豪侠 [EMAIL PROTECTED] wrote:
I try to run nutch using command line and I've add
Please specify plugin.folders(in nutch-default/site.xml) to the real
plugin built destination dir. Of course, you can use absolutely path.
/Jack
On 2/7/06, 盖世豪侠 [EMAIL PROTECTED] wrote:
Hi
Do you mean I should create a dir called build and move dir plugins in?
It seems it doesn't work either
u build the application
and u will get a build folder
add that folder to your class path
On 2/7/06, 盖世豪侠 [EMAIL PROTECTED] wrote:
Hi
Do you mean I should create a dir called build and move dir plugins in?
It seems it doesn't work either
2006/2/7, Saravanaraj Duraisamy [EMAIL PROTECTED]:
I'm switching to nutch-0.8 but I'm facing a problem with url redirects.
To let you understand better I'll explain my problem with a real example:
I created an 'urls' directory and inside it I created an 'urls.txt' file
containing only this line: http://www.punto-informatico.it;.
If pointed to
Check the url filters
Crawl-filter.txt
see whether the rule is allowed
see whether the link below matches with url pattern there in the
crawl-filter.txt file
http://*punto-informatico.it http://punto-informatico.it*
On 2/7/06, Enrico Triolo [EMAIL PROTECTED] wrote:
I'm switching to
Thank you for your reply.
My crawl-urlfilter.txt file allows any url, since I set only this rule:
+.
btw, this is the same rule I set for the 0.7 version.
On 2/7/06, Raghavendra Prabhu [EMAIL PROTECTED] wrote:
Check the url filters
Crawl-filter.txt
see whether the rule is allowed
see
For those of you who are also reinventing the wheel like me
getting nutch-0.8-dev with MapReduce running on a single box
here are some updates.
This is about revision #374443.
The DmozParser class mentioned in quick tutorial for nutch
0.8 and later seams to be in
Hi:
Have you looked at the nutch-default.xml config file under
namesearcher.dir/name ??
You need to modify this to reflect DFS where your crawl directory is I
think you will have something like /user/nutch etc etc.. you can find
it by trying the following
bin/hadoop dfs
and
bin/hadoop dfs -ls
If you only want to crawl www.woodward.edu, then change
+^http://([a-z0-9]*\.)*woodward.edu/
To:
+^http://www.woodward.edu/
Jake.
-Original Message-
From: Andy Morris [mailto:[EMAIL PROTECTED]
Sent: Monday, February 06, 2006 9:00 PM
To: nutch-user@lucene.apache.org
Subject:
Sorry you should update your nutch
svn update
to revision 375624
Cheers
On 2/7/06, Zaheed Haque [EMAIL PROTECTED] wrote:
Hi:
Have you looked at the nutch-default.xml config file under
namesearcher.dir/name ??
You need to modify this to reflect DFS where your crawl directory is I
think
Zaheed Haque schrieb:
Sorry you should update your nutch
svn update
to revision 375624
Cheers
Thanks will do that and redo all from scratch.
Bernd
I use OSCache with great success.
I would an amazing amount (more then i assumed) of
queries we get are duplicate of one fashion or another
so on top of warming things up as much as possible to
the OS buffer cache we use OSCache as well.
You could also use Squid to cache pages for x amount
of
Hi
I think you have to hack the parsed content from the parse-html plugin and
filter the string with your terms?
It will of course contain modifying or adding some codes.
2006/2/8, Byron Miller [EMAIL PROTECTED]:
Is there an easy way to categorize content on parse?
I have an extensive list
Hi Byron
I am thinking will it be faster to do this offline? I mean you can
re-visit webdb and link db and generate the index.
/Jack
On 2/8/06, Byron Miller [EMAIL PROTECTED] wrote:
Is there an easy way to categorize content on parse?
I have an extensive list of adult terms and i would
like
Is OpenSearch being developed?
I am using nutch 0.7 and it seems to have some opensearch support.
However, I failed to get either a python or perl opensearch client
library (admittedly these are also in early development). The perl
library seemed to choke at not finding the
It sounds OK.
But I think if you don't check it on line, maybe you will get many
unrequired contents in your index.
2006/2/8, Jack Tang [EMAIL PROTECTED]:
Hi Byron
I am thinking will it be faster to do this offline? I mean you can
re-visit webdb and link db and generate the index.
/Jack
There is no setting file for Hadoop in conf/. Should it be
hadoop-default.xml?
It seems this file is not committed but it is packaged into hadoop jar file.
Thanks, Mike.
The file packaged in the jar is used for the defaults. It is read from
the jar file. So it should not need to be committed to Nutch.
Mike Smith wrote:
There is no setting file for Hadoop in conf/. Should it be
hadoop-default.xml?
It seems this file is not committed but it is packaged into
Byron Miller wrote:
Is there an easy way to categorize content on parse?
I have an extensive list of adult terms and i would
like to update meta info on the page if the
combination of terms exist to flag it as adult content
so i can exclude it from the search results unless
people opt in.
I am using the new nutch with hadoop, jobtracker fails at initilazation with
this exceptopn:
060207 121953 Property 'file.separator' is /
060207 121953 Property 'java.vendor.url.bug' is
http://java.sun.com/cgi-bin/bugreport.cgi
060207 121953 Property 'sun.io.unicode.encoding' is UnicodeLittle
Hi,
I am trying to run with the new svn version (375414), I am working under
nutch/trunk directory.
When I ran the next command bin/hadoop jobtracker or bin/hadoop-daemon.sh
start jobtracker
I got the next message,
Exception in thread main java.lang.NoClassDefFoundError:
The problem is that jetty jar files are missing from the SVN., I replaced
the Jetty jar files but I get another exception:
060207 123447 Property 'sun.cpu.isalist' is
Exception in thread main java.lang.NullPointerException
at org.apache.hadoop.mapred.JobTrackerInfoServer.init(
I could make it work in a strange way. There should be a problem with hadoop
jar file.
I downloaded the hadoop project and made the project using ant. then I could
start the jobtracker successfully, but when I removed the build folder and
just used the hadoop jar file, it failed again. So I
Hi
I think even
NUTCH-94 and NUTCH-96 have been fixed
Rgds
Prabhu
I am still getting teh next Exception:
Exception in thread main java.lang.NullPointerException
at
org.apache.hadoop.mapred.JobTrackerInfoServer.init(JobTrackerInfoServer.java:56)
at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:303)
at
Rafit,
get the hadoop project make that and use the build folder instead of the jar
file. It will work fine then. Something probabely is missing from the hadoop
jar.
M
On 2/7/06, Rafit Izhak_Ratzin [EMAIL PROTECTED] wrote:
I am still getting teh next Exception:
Exception in thread main
I finally finished my first successfull experinece with nutch/hadoop, I
started from 70,000 seeds and this is results of 1 cycle crawl.
060207 153956 Statistics for CrawlDb: t1/crawldb
060207 153956 TOTAL urls: 850726
060207 153956 avg score:1.037
060207 153956 max score:
After copying teh build directory from hadoop to nutch
I can run the crawl cycle, however I get the next Exeption (in jobtracking
file) lot of times:
060207 215603 Server connection on port 50020 from IP...
caught: java.lang.RuntimeException: java.lang.ClassNotFoundException:
Just out of curiousity, does anyone here know how well query caching
works in general with an extremely high-volume search engine?
It seems like as your search volume goes up, and the number of unique
queries goes up with it, the cache hit rate would go down, and caching
would help less and less.
Hi Saravanaraj,
For each URL, Nutch reads your filter file from top to bottom, until it
finds a line (+ or -) that matches the URL. Then it stops reading.
Therefore, any files inside E:/Index Samples/Index/ will be INCLUDED,
because they match the line that says +^file:/E:/Index Samples/.
I
Hi David,
Thanks... Is there a way in nutch to reindex the files based on the last
modified date???
I have large numbers of pdf's and doc's in a folder. Do i need to reindex
all the files every time i want to update my index?
On 2/8/06, David Wallace [EMAIL PROTECTED] wrote:
Hi Saravanaraj,
31 matches
Mail list logo