+1 for a solution to this pressing issue!
I am seeing the same problem, in my case two symptoms:
1) low fetch speeds
2) crawls end before their time with aborting with xxx hung
threads error message
I am doing a focussed crawl on about 70.000 domains.
crawl.ignore.external.links is set to
Adaptive Refetch Interval Patch: http://issues.apache.org/jira/browse/NUTCH-61
(Thanks to Andrzej)
Rgrds. Thomas
On 6/28/06, HUYLEBROECK Jeremy RD-ILAB-SSF
[EMAIL PROTECTED] wrote:
Hey Thomas,
Do you have any pointer to that work?
Thanks
Jeremy.
-Original Message-
There is also
Matt,
AFAIK Nutch does not support fetching arbitrary fetch lists out of the box.
here is a tool in JIRA that supports this though:
http://issues.apache.org/jira/browse/NUTCH-68.
- Thomas
On 6/25/06, Honda-Search Administrator [EMAIL PROTECTED] wrote:
I'm having a difficult time configuring
In 0.8-dev score is calculated in a ScoringFilter implementaion,
default is score-opic plugin
(org.apache.nutch.scoring.opic.OPICScoringFilter).
AFAIK the scoring plugin has to be included in nutch-site. Score
calculation is done as part of updatedb step. Please correct me if I
am wrong about
: TDLN [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org; Honda-Search Administrator
[EMAIL PROTECTED]
Sent: Sunday, June 25, 2006 3:02 AM
Subject: Re: Will pay for someone to help
Matt,
AFAIK Nutch does not support fetching arbitrary fetch lists out of the
box.
here is a tool in JIRA
search engine and
just let nutch do it's thing knowing that everything will eventually get
indexed.
Matt
- Original Message -
From: TDLN [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org; Honda-Search Administrator
[EMAIL PROTECTED]
Sent: Sunday, June 25, 2006 3:02 AM
Subject: Re: Will pay
Please specify what exact sequence of commands you are using.
For incremental crawling best to follow the whole web style process
as outlined in the tutorial. The one stop crawl command cannot be used
effectively for that.
HTH Thomas
On 6/23/06, Honda-Search Administrator [EMAIL PROTECTED]
Prune is ok to remove the docs from the index, but it will not prevent
the pages from being refetched, so you might also want to change the
regex-urlfilter (or crawl-ulrfilter if you are usign the crawltool)
for that purpose.
Rgrds,. Thomas
On 6/22/06, Dima Mazmanov [EMAIL PROTECTED] wrote:
Maybe try again on hadoop-user mailing list?
On 6/20/06, William Choi [EMAIL PROTECTED] wrote:
Hi,
I would like to know for now is the input formats that we are supporting
now is SequenceFileFormat and TextInputForma only? If I want to do sth like
indexing files, I would need to
(no need to crawl).
Thanks again,
roberto
On 6/17/06, TDLN [EMAIL PROTECTED] wrote:
Likely org.apache.nutch.net.RegexUrlNormalizer will also change the
URL in the database, thus affecting (re)fetching of your log files.
Thus this might not be the way to go.
Instead you might want to change
You will first have to install Apache Ant (http://ant.apache.org/).
Calling 'ant' in the top level Nutch directory will compile the code.
Calling 'ant tar' will create a distribution tar.
Other targets for testing can be viewed in the build.xml file.
Rgrds,. Thomas
On 6/19/06, Honda-Search
You can start here for learning more about Nutch:
http://wiki.apache.org/nutch/
And here is an excellent tutorial that covers getting your custom
fields in the index:
http://wiki.apache.org/nutch/WritingPluginExample
If you have read all this you can come back and we will discuss sorting :)
Unfortunately this is only feasible with *a lot* of custom code.
Probably you will be done sooner refetching and indexing your pages.
Rgrds, Thomas
On 6/19/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
Hi,
Is there any way to migrate segments and webdb data generated using 0.7.1 to
0.8-dev
see I'm crawling with a depth of 1, which is intentional. I only
desire to recrawl the specific pages injected each night. I'm wondering if
the 'adddays' parameter is messing me up.
Matt
- Original Message -
From: TDLN [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org; Honda-Search
This is 0.7.2 right?
The QueryFilter implementation code didn't make it through.
Rgrds, Thomas
On 6/23/06, Jayant Kumar Gandhi [EMAIL PROTECTED] wrote:
I also tried with
field=rating instead of fields=DEFAULT in plugin.xml, still no luck
On 6/24/06, TDLN [EMAIL PROTECTED] wrote:
Please
RatingQueryFilter() {
super(rating, 5f);
LOG.info(Added a rating query);
}
}
On 6/24/06, TDLN [EMAIL PROTECTED] wrote:
This is 0.7.2 right?
The QueryFilter implementation code didn't make it through.
Rgrds, Thomas
On 6/23/06, Jayant Kumar Gandhi [EMAIL PROTECTED] wrote:
I also tried
, TDLN [EMAIL PROTECTED] wrote:
I mean disable the cache link in the search.jsp.
On 6/15/06, TDLN [EMAIL PROTECTED] wrote:
As far as I know, content in the segments is used to generate the
summary in the search results and off course for the cache feature.
If you don't need these you can
from the Nutch process.
HTH Thomas
On 6/16/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
Just a +1 for sending your thumbnail-creating code.
Otis
- Original Message
From: TDLN [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Wednesday, June 14, 2006 5:07:34 PM
Subject: Re: [Nutch
Has anyone seen this with 0.8?
I think everybody has seen this :)
It *is* intentional and part of how Nutch and MapReduce/Hadoop works, I believe.
Rgrds, Thomas
On 6/16/06, Howie Wang [EMAIL PROTECTED] wrote:
You're right. I guess I misunderstood the term hard limit
when talking about file
Yes, this is the wrong forum :)
This has been discussed many times, please search the archives.
Rgrds, thomas
On 6/14/06, Dagum, Leo [EMAIL PROTECTED] wrote:
Apologies if this is the wrong forum..
Just downloaded the nutch .72 release and tried building, using
jdk1.5.0_03 and ant 1.6.5.
the relevant threads I'd
be very grateful, none of the obvious ones relating to broken builds,
build errors, compile errors etc were helpful.
- leo
-Original Message-
From: TDLN [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 15, 2006 1:44 PM
To: nutch-user@lucene.apache.org
Cc: Dagum, Leo
Hello Marco,
I am creating the thumbnails during the parse phase in a custom
HtmlParseFilter implementation. The images are selected form the
Outlink array.
The disadvantage of this approach is that the thumbs are recreated
when the page is fetched again so just like with the segments you have
Take a look at the ImageJ library; http://rsb.info.nih.gov/ij/
I don't have access to my repository now but as soon as I have I will
send you the code I am using to create thumbnails.
Rgrds, Thomas
On 6/12/06, Marco Pereira [EMAIL PROTECTED] wrote:
Hi everybody,
As I have said on another
I cannot see any other likely cause than that you did not configure
Tomcat to unpack WARs.
Rgrds. Thomas
On 6/5/06, Matthew Holt [EMAIL PROTECTED] wrote:
Hi all,
Just attempting to install a demo Intranet crawl on my local machine.
I followed the tutorial directions step by step and ran the
This error ussualy occurs when you forget to add the plugin to the
plugin.includes var in nutch-site.xml. Can you check if the proper
conf directory and files are being used? This should be visible from
when Nutch loads its configuration.
Rgrds. Thomas
On 6/3/06, Jason Camp [EMAIL PROTECTED]
The syntax for the crawl command is
Crawl urlDir [-dir d] [-threads n] [-depth i] [-topN N]
So your first parameter should point to the *directory* containing the
file with seed urls, not the file itself.
Please fix your syntax and try again.
Rgrds, Thomas
On 6/3/06, Teruhiko Kurosaka [EMAIL
I am interested in developing such a solution as well.
I am currently storing the thumbnails on the file system under a
system generated name. My indexing plugin stores the filename in the
index. Thumbnails are later served to the client by seperate Apache
HTTP server. This required some changes
like you can open an issue to request a nutch sandbox project
image search.
If we got enough people vote for this issue we may have a chance to
got it created.
Stefan
Am 03.06.2006 um 10:38 schrieb TDLN:
I am interested in developing such a solution as well.
I am currently
(E.G. Nutch define one url == one index document.)
Why can't we create a document for every image that is found?
Then it is as if we will have a parse-image plugin just like we have a
parse-html and parse-pdf plugin, with the only difference that it will
be run after all the pages in the
] wrote:
Well I can do the project management side of it, and can volunteer some
time, but have never done this in an open source model before. But I can do
documentation, project management support, and make a decent cheer leader as
well.
Let me know.
r/d
-Original Message-
From: TDLN
in this area I cannot answer your question.
Anyway, now I think is time to read hadoop MapReduce code :)
Rgrds, Thomas
On 6/3/06, Dima Mazmanov [EMAIL PROTECTED] wrote:
Hi,TDLN.
But how image data will be stored in nutch database?
Would it affect on rest data in it?
(E.G. Nutch define one url
I am seeing IOException's running the nightly build from 22-05.
Anybody seen these before?
nutch inject crawl/crawldb urls/
060529 174013 java.io.IOException: config()
at org.apache.hadoop.conf.Configuration.init(Configuration.java:66)
at
Did you add the plugins directory to your classpath and does it
contain all of your plugins?
Rgrds, Thomas
On 5/23/06, Murat Ali Bayir [EMAIL PROTECTED] wrote:
Hi everbody, I am running Nuth 0.8 under windows by using Eclipse
I got the following error. I added conf directory to my classpath.
Hi Stefan
try running bin/nutch org.apache.nutch.net.URLFilterChecker
Rgrds, Thomas
On 5/22/06, Stefan Neufeind [EMAIL PROTECTED] wrote:
Hi,
is there a way to debug rules for RegexUrlNormalizer, e.g. test the
substitution from commandline?
bin/nutch
Sorry, I was a bit too fast there, the answer applies to the
RegexURLFilter not the RegexUrlNormalizer. I don't think there is a
similar facility for the RegexUrlNormalizer, but let me know if you
find it :)
Rgrds, Thomas
On 5/22/06, TDLN [EMAIL PROTECTED] wrote:
Hi Stefan
try running bin
+1
I would be interested as well.
Rgrds, Thomas Delnoij
On 5/10/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
+1 to this!
I won't be in San Francisco on the 11th, but would be interested in
seeing/listening either in real-time or a recorded version.
Thanks,
Otis
- Original Message
Or maybe one of the mailing list adminstrators can exert some control
here; Herman's emails are not really adding to the readability of the
archives as well :)
http://mail-archive.com/nutch-user%40lucene.apache.org/
Rgrds, Thomas
On 5/3/06, Herman Hardenbol [EMAIL PROTECTED] wrote:
Sorry, I
let me
know and we will disable the autoreply altogether.
kind regards,
John Steenwinkel
IT Services, ISS
Helpdesk Officer on 03 May 2006 at 10:49 +0100 wrote:
- Original Message -
03 May 2006 10:43:56
Message
From: TDLN [EMAIL PROTECTED]
Subject:Spam
Hi Chun Wei.
just google for 'tomcat performance tuning', you will find a lot of helpfull
information.
For instance:
http://tomcat.apache.org/articles/performance.pdf
http://www.javaworld.com/channel_content/jw-performance-index.shtml
Rgrds, Thomas
On 4/27/06, Chun Wei Ho [EMAIL PROTECTED]
Since there are number of file format and I can't add each of them in ignore
list.
Why not? You can add something like
-\.(java|.class|jar|dll)
etc.
Rgrds, Thomas
Alternative could be that it fetch and show result only of parsable
documents.
can anybody help me in this
I would be very interested in a European user meeting. Berlin would be
fine as well.
Great idea!
Thomas
On 4/22/06, Stefan Groschupf [EMAIL PROTECTED] wrote:
Hi Sami, Hi Dawid, Hi All,
yes if there are enough people interested I would love to get a
European user meeting organized as well.
I think the nutch readdb command only gives statistics for the crawldb
(crawled Pages) and not the index.
Rgrds, Thomas
On 4/18/06, Michael Levy [EMAIL PROTECTED] wrote:
Ben, how about this:
bin/nutch readdb crawled/db -stats
where crawled is the directory holding the index?
Here's a good
I think it is plain Lucene syntax that is expected, for instance:
#delete docs from www.cnn.com
url:www cnn com
#delete docs that contain p0rn in their content,
#but not study or research, and which come from www.cnn.com
content:p0rn -content:(study research) +url:www cnn com
#
Luke (http://www.getopt.org/luke/) comes in handy for those purposes.
Rgrds, Thomas
On 4/18/06, Benjamin Higgins [EMAIL PROTECTED] wrote:
Hi, I looked through the FAQ but found nothing about getting basic index
statistics, like quite simply, how many pages are in the index.
How can I figure
I disagree that it should be difficult to stay uptodate with the main
codeline if you have a lot of local changes.
You can put your code under local version control in subversion and
then use the process described in the Vendor branches chapter of the
subversion book (found here:
My guess is you have to override the searcher.dir property in
nutch-site.xml and have it point to your crawl dir.
Rgrds, Thomas
On 4/5/06, Paul Stewart [EMAIL PROTECTED] wrote:
Hi there...
I was having a number of problems with my install, mainly because I'm
not used to Tomcat and/or Nutch
I am (finally) moving my installation to 0.8-dev. Now I was wondering
if one of the developers
could post their .classpath and .project eclipse settings files. I
have seen those files being posted for 0.7, so I thought I might as
well ask.
Rgrds, Thomas
nutch-users -
both in the whole web and intranet scenario's, I am now getting
060406 154710 Generator: Partitioning selected urls by host, for politeness.
060406 154710 parsing
jar:file:/home/tdelnoij/dev/sandbox/nutch-0.8-dev/lib/hadoop-0.1.0.jar!/hadoop-default.xml
060406 154710 parsing
Oops, this one seems to have been fixed already:
http://mail-archive.com/nutch-user%40lucene.apache.org/msg04130.html
I will give it a shot with the last nightly build.
Rgrds, Thomas
On 4/6/06, TDLN [EMAIL PROTECTED] wrote:
nutch-users -
both in the whole web and intranet scenario's, I am
for the plugins
folder under the build and will load all necessary plugins from there.
Dennis
-Original Message-
From: TDLN [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 06, 2006 7:51 AM
To: nutch-user@lucene.apache.org
Subject: .classpath and .project for 0.8
I am (finally) moving my
That't it, thanks again!
On 4/6/06, Dennis Kubes [EMAIL PROTECTED] wrote:
Here they are zipped up.
-Original Message-
From: TDLN [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 06, 2006 11:44 AM
To: nutch-user@lucene.apache.org
Subject: Re: .classpath and .project for 0.8
Thanks
. parse.getData().get(index) to get the
meta-data value for index. What am I missing?
Thanks for the pointers!
Ben
On 4/3/06, TDLN [EMAIL PROTECTED] wrote:
It depends if you control the seed pages or not; if you do, you could tag
them index=no
and skip them during indexing. You would
How can I have the status of the crawl process ?
In general this should be apparent from the crawl log.
- number of fetched pages is printed to the logs at certain
intervals (also number of pages/sec etc.)
- number of indexed pages if you use the crawl too, indexing is
done after all pages
It depends if you control the seed pages or not; if you do, you could tag
them index=no
and skip them during indexing. You would have to change HtmlParser and
BasicIndexingFilter.
Rgrds, Thomas
On 4/4/06, Benjamin Higgins [EMAIL PROTECTED] wrote:
Hello,
I've gone through the documentation
I am also interested in this. 'till now I didn't find any good OS tools for
this purpose, just this one: www.splunk.com.
Rgrds, Thomas
On 3/31/06, Vanderdray, Jacob [EMAIL PROTECTED] wrote:
What open source tools do people like for analyzing nutch search
log files? I'm specifically
Yes! This is great news, thank you so much.
By the way: in the revision of the release notes that you posted (292986),
the changes for 0.7.2 are missing.
Rgrds, Thomas
On 4/1/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote:
Hello all,
The 0.7.2 release of Nutch is now available. This is a bug
Is this the correct revision of the release notes?
http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158
Rgrds, Thomas
On 4/1/06, TDLN [EMAIL PROTECTED] wrote:
Yes! This is great news, thank you so much.
By the way: in the revision of the release notes
Google's and Yahoo's Terms of Service provide interesting reading regarding
such legal issues.
http://www.google.com/terms_of_service.html
http://docs.yahoo.com/info/terms/
Rgrds, Thomas
On 3/30/06, gekkokid [EMAIL PROTECTED] wrote:
Shouldn't be a problem if your honouring the robots.txt
passwords,
and honor robots.txt and they post it on the web, it is considered
public in
that regard.
I am not a lawyer, check grocklaw.
r/d
-Original Message-
From: TDLN [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 30, 2006 3:34 AM
To: nutch-user@lucene.apache.org
Subject: Re
Wojtek,
those commands apply to 0.7.1 (the version I am still working with).
For 0.8 I think you can use 'nutch readdb' and 'nutch readlinkdb'.
How to get the Content by URL, I don't know, but it should be possible
somehow on 0.8.
Rgrds, Thomas
On 3/27/06, TDLN [EMAIL PROTECTED] wrote
Wojchiech,
1. list of crawled pages
There's the 'nutch admin' command:
java org.apache.nutch.tools.WebDBAdminTool (-local | -ndfs namenode:port)
db [-create] [-textdump dumpPrefix] [-scoredump] [-top k]
Using '-textDump' will dump the contents of the WebDB to a text file.
Then there is the
Just create the directory
'/home/scott/downloads/nutch-0.7.1/src/plugin/nutch-extensionpoints/src/java'
and run ant again.
Rgrds, Thomas
On 3/24/06, keren nutch [EMAIL PROTECTED] wrote:
Hi,
I extracted tar -xf nutch-0.7.1.tar.gz and got the info
tar: A lone zero block at 132784
When I
PrefixURLFilter should consume less RAM
than the hashmap presumably underlying your cache, while still
delivering similar lookup speed. But perhaps I'm wrong?)
--Matt
On Mar 19, 2006, at 1:09 PM, TDLN wrote:
I agree with you. That was a bold statement, not necessarily backed
up by
any hard
I don't think there is a plugin that does that. If you're using the
OpenSearchServlet, you could create a ServletFilter that intercepts the
requests and calculates the time it takes to perform a search.
Maybe others have more creative ideas
Rgrds, Thomas
On 3/20/06, Edward Quick [EMAIL
There's the DBUrlFilter as well, that stores the Whitelist in the database:
http://issues.apache.org/jira/browse/NUTCH-100
It performs better than the PrefixURLFilter and also makes the management of
the list more easy.
Rgrds, Thomas
On 3/15/06, Matt Kangas [EMAIL PROTECTED] wrote:
For a
explain?
--Matt
On Mar 19, 2006, at 3:13 AM, TDLN wrote:
There's the DBUrlFilter as well, that stores the Whitelist in the
database:
http://issues.apache.org/jira/browse/NUTCH-100
It performs better than the PrefixURLFilter and also makes the
management of
the list more easy
I can only speak for myself, but I would need the output from the different
Nutch commands to analyse this problem.
Rrgds. Thomas
On 3/13/06, Richard Braman [EMAIL PROTECTED] wrote:
-Original Message-
From: Alen [mailto:[EMAIL PROTECTED]
Sent: Monday, March 13, 2006 1:42 AM
To:
Unfortunately, in the 0.7 release, the NutchBean does not clean up properly
after itself, so some SegementReaders and IndexReaders remain open. I think
this is fixed in the current code line. I had similar problems in my app
based on 0.7 - all that helped was killing the processes blocking the
/14/06, Laurent Michenaud [EMAIL PROTECTED] wrote:
It would be interesting to have a fix for 0.7
-Message d'origine-
De : TDLN [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 mars 2006 12:32
À : nutch-user@lucene.apache.org
Objet : Re: Problems
Unfortunately, in the 0.7 release
Richard.
So would I do something like
1. parse out the citation
2. metadata.put(citation, citation);
Yes, I think that is the way to proceed. And then on implementing the
Indexing and Query FIlters, all as desribed in the WritingPlugin tutorial:
You can start here http://wiki.apache.org/nutch/NutchDistributedFileSystem
Also, I think there have been several posts in the mailing list that contain
such a step-by-step overview.
Rgrds, Thomas
On 3/8/06, Olive g [EMAIL PROTECTED] wrote:
Hi I am new here.
Could someone please let me know
Detailed distributed crawl implementation:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02270.html
I am not sure it applies to 0.7 though, but it has a lot of info.
Rgrds, Thomas
Stefan.
I know people having 500 mio pages index and I personal run crawls with
~300 pages per second.
Sorry, but I have to ask: what kind of setup do you have (network, hw, nutch
version) that you manage so many pages per second?
Unless this is a company secret, it would be very nice to know
You need to do both: seed the WebDB with the 14k urls extracted from the
dmoz
content file AND filter newly found urls against the urls in the mysql
database using the urlfilter-db.
This is significantly faster than adding the 14k urls to the
regex-urlfilter.txt file and checking against that.
for the fields you're interested
in. Instead of fields=DEFAULT in the example, you'll want
raw-fields=language and raw-fields=category. Assuming you name the
fields language and category when you add them to the index.
Jake.
-Original Message-
From: TDLN [mailto:[EMAIL PROTECTED]
Sent
You can follow the tutorial at
http://wiki.apache.org/nutch/WritingPluginExample. Just replace
recommended with category, and it will show you what to do.
(I just implemened a category filter this way ...)
Rgrds, T.
On 2/23/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
Hi,
I have added on
I am still using 0.7.1 - I think the CrawlDatum.setMetaData is only part of
the trunk.
Is it not possible to just hack the MoreIndexingFilter and calculate the
date_indexed field there (similar to how the lastModified field is
calculated), and add a DateIndexedQueryFilter to the
otherwise would be lost, right?
Rgrds, Thomas
On 2/17/06, TDLN [EMAIL PROTECTED] wrote:
I am still using 0.7.1 - I think the CrawlDatum.setMetaData is only part
of the trunk.
Is it not possible to just hack the MoreIndexingFilter and calculate the
date_indexed field there (similar to how
78 matches
Mail list logo