Re: .8 svn - fetcher performance..

2006-06-28 Thread TDLN
+1 for a solution to this pressing issue! I am seeing the same problem, in my case two symptoms: 1) low fetch speeds 2) crawls end before their time with aborting with xxx hung threads error message I am doing a focussed crawl on about 70.000 domains. crawl.ignore.external.links is set to

RE: Will pay for someone to help

2006-06-27 Thread TDLN
Adaptive Refetch Interval Patch: http://issues.apache.org/jira/browse/NUTCH-61 (Thanks to Andrzej) Rgrds. Thomas On 6/28/06, HUYLEBROECK Jeremy RD-ILAB-SSF [EMAIL PROTECTED] wrote: Hey Thomas, Do you have any pointer to that work? Thanks Jeremy. -Original Message- There is also

Re: Will pay for someone to help

2006-06-25 Thread TDLN
Matt, AFAIK Nutch does not support fetching arbitrary fetch lists out of the box. here is a tool in JIRA that supports this though: http://issues.apache.org/jira/browse/NUTCH-68. - Thomas On 6/25/06, Honda-Search Administrator [EMAIL PROTECTED] wrote: I'm having a difficult time configuring

Re: page ranking computation in Nutch 08

2006-06-25 Thread TDLN
In 0.8-dev score is calculated in a ScoringFilter implementaion, default is score-opic plugin (org.apache.nutch.scoring.opic.OPICScoringFilter). AFAIK the scoring plugin has to be included in nutch-site. Score calculation is done as part of updatedb step. Please correct me if I am wrong about

Re: Will pay for someone to help

2006-06-25 Thread TDLN
: TDLN [EMAIL PROTECTED] To: nutch-user@lucene.apache.org; Honda-Search Administrator [EMAIL PROTECTED] Sent: Sunday, June 25, 2006 3:02 AM Subject: Re: Will pay for someone to help Matt, AFAIK Nutch does not support fetching arbitrary fetch lists out of the box. here is a tool in JIRA

Re: Will pay for someone to help

2006-06-25 Thread TDLN
search engine and just let nutch do it's thing knowing that everything will eventually get indexed. Matt - Original Message - From: TDLN [EMAIL PROTECTED] To: nutch-user@lucene.apache.org; Honda-Search Administrator [EMAIL PROTECTED] Sent: Sunday, June 25, 2006 3:02 AM Subject: Re: Will pay

Re: ERROR when recrawling... can ANYONE help?

2006-06-23 Thread TDLN
Please specify what exact sequence of commands you are using. For incremental crawling best to follow the whole web style process as outlined in the tutorial. The one stop crawl command cannot be used effectively for that. HTH Thomas On 6/23/06, Honda-Search Administrator [EMAIL PROTECTED]

Re: Deleting documents

2006-06-23 Thread TDLN
Prune is ok to remove the docs from the index, but it will not prevent the pages from being refetched, so you might also want to change the regex-urlfilter (or crawl-ulrfilter if you are usign the crawltool) for that purpose. Rgrds,. Thomas On 6/22/06, Dima Mazmanov [EMAIL PROTECTED] wrote:

Re: hadoop Input format

2006-06-23 Thread TDLN
Maybe try again on hadoop-user mailing list? On 6/20/06, William Choi [EMAIL PROTECTED] wrote: Hi, I would like to know for now is the input formats that we are supporting now is SequenceFileFormat and TextInputForma only? If I want to do sth like indexing files, I would need to

Re: managing content size in segments folder

2006-06-23 Thread TDLN
(no need to crawl). Thanks again, roberto On 6/17/06, TDLN [EMAIL PROTECTED] wrote: Likely org.apache.nutch.net.RegexUrlNormalizer will also change the URL in the database, thus affecting (re)fetching of your log files. Thus this might not be the way to go. Instead you might want to change

Re: Compiling Nutch

2006-06-23 Thread TDLN
You will first have to install Apache Ant (http://ant.apache.org/). Calling 'ant' in the top level Nutch directory will compile the code. Calling 'ant tar' will create a distribution tar. Other targets for testing can be viewed in the build.xml file. Rgrds,. Thomas On 6/19/06, Honda-Search

Re: Newbie needs help with fielded searching and sorting on custom fields

2006-06-23 Thread TDLN
You can start here for learning more about Nutch: http://wiki.apache.org/nutch/ And here is an excellent tutorial that covers getting your custom fields in the index: http://wiki.apache.org/nutch/WritingPluginExample If you have read all this you can come back and we will discuss sorting :)

Re: Migrating crawled data (urls) from version 0.7.1 to 0.8-dev.

2006-06-23 Thread TDLN
Unfortunately this is only feasible with *a lot* of custom code. Probably you will be done sooner refetching and indexing your pages. Rgrds, Thomas On 6/19/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, Is there any way to migrate segments and webdb data generated using 0.7.1 to 0.8-dev

Re: ERROR when recrawling... can ANYONE help?

2006-06-23 Thread TDLN
see I'm crawling with a depth of 1, which is intentional. I only desire to recrawl the specific pages injected each night. I'm wondering if the 'adddays' parameter is messing me up. Matt - Original Message - From: TDLN [EMAIL PROTECTED] To: nutch-user@lucene.apache.org; Honda-Search

Re: Newbie needs help with fielded searching and sorting on custom fields

2006-06-23 Thread TDLN
This is 0.7.2 right? The QueryFilter implementation code didn't make it through. Rgrds, Thomas On 6/23/06, Jayant Kumar Gandhi [EMAIL PROTECTED] wrote: I also tried with field=rating instead of fields=DEFAULT in plugin.xml, still no luck On 6/24/06, TDLN [EMAIL PROTECTED] wrote: Please

Re: Newbie needs help with fielded searching and sorting on custom fields

2006-06-23 Thread TDLN
RatingQueryFilter() { super(rating, 5f); LOG.info(Added a rating query); } } On 6/24/06, TDLN [EMAIL PROTECTED] wrote: This is 0.7.2 right? The QueryFilter implementation code didn't make it through. Rgrds, Thomas On 6/23/06, Jayant Kumar Gandhi [EMAIL PROTECTED] wrote: I also tried

Re: managing content size in segments folder

2006-06-17 Thread TDLN
, TDLN [EMAIL PROTECTED] wrote: I mean disable the cache link in the search.jsp. On 6/15/06, TDLN [EMAIL PROTECTED] wrote: As far as I know, content in the segments is used to generate the summary in the search results and off course for the cache feature. If you don't need these you can

Re: [Nutch-general] Cached.jsp for image content type (OFF TOPIC, LONGISH)

2006-06-16 Thread TDLN
from the Nutch process. HTH Thomas On 6/16/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Just a +1 for sending your thumbnail-creating code. Otis - Original Message From: TDLN [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wednesday, June 14, 2006 5:07:34 PM Subject: Re: [Nutch

Re: Too many open files

2006-06-16 Thread TDLN
Has anyone seen this with 0.8? I think everybody has seen this :) It *is* intentional and part of how Nutch and MapReduce/Hadoop works, I believe. Rgrds, Thomas On 6/16/06, Howie Wang [EMAIL PROTECTED] wrote: You're right. I guess I misunderstood the term hard limit when talking about file

Re: nutch .72 out-of-the-box build issue

2006-06-15 Thread TDLN
Yes, this is the wrong forum :) This has been discussed many times, please search the archives. Rgrds, thomas On 6/14/06, Dagum, Leo [EMAIL PROTECTED] wrote: Apologies if this is the wrong forum.. Just downloaded the nutch .72 release and tried building, using jdk1.5.0_03 and ant 1.6.5.

Re: nutch .72 out-of-the-box build issue

2006-06-15 Thread TDLN
the relevant threads I'd be very grateful, none of the obvious ones relating to broken builds, build errors, compile errors etc were helpful. - leo -Original Message- From: TDLN [mailto:[EMAIL PROTECTED] Sent: Thursday, June 15, 2006 1:44 PM To: nutch-user@lucene.apache.org Cc: Dagum, Leo

Re: Nutch Image Search

2006-06-14 Thread TDLN
Hello Marco, I am creating the thumbnails during the parse phase in a custom HtmlParseFilter implementation. The images are selected form the Outlink array. The disadvantage of this approach is that the thumbs are recreated when the page is fetched again so just like with the segments you have

Re: Cached.jsp for image content type

2006-06-14 Thread TDLN
Take a look at the ImageJ library; http://rsb.info.nih.gov/ij/ I don't have access to my repository now but as soon as I have I will send you the code I am using to create thumbnails. Rgrds, Thomas On 6/12/06, Marco Pereira [EMAIL PROTECTED] wrote: Hi everybody, As I have said on another

Re: Intranet Crawl Demo

2006-06-05 Thread TDLN
I cannot see any other likely cause than that you did not configure Tomcat to unpack WARs. Rgrds. Thomas On 6/5/06, Matthew Holt [EMAIL PROTECTED] wrote: Hi all, Just attempting to install a demo Intranet crawl on my local machine. I followed the tutorial directions step by step and ran the

Re: No scoring plugins problem

2006-06-03 Thread TDLN
This error ussualy occurs when you forget to add the plugin to the plugin.includes var in nutch-site.xml. Can you check if the proper conf directory and files are being used? This should be visible from when Nutch loads its configuration. Rgrds. Thomas On 6/3/06, Jason Camp [EMAIL PROTECTED]

Re: help running 5/31 version of nightly build

2006-06-03 Thread TDLN
The syntax for the crawl command is Crawl urlDir [-dir d] [-threads n] [-depth i] [-topN N] So your first parameter should point to the *directory* containing the file with seed urls, not the file itself. Please fix your syntax and try again. Rgrds, Thomas On 6/3/06, Teruhiko Kurosaka [EMAIL

Re: Image Search

2006-06-03 Thread TDLN
I am interested in developing such a solution as well. I am currently storing the thumbnails on the file system under a system generated name. My indexing plugin stores the filename in the index. Thumbnails are later served to the client by seperate Apache HTTP server. This required some changes

Re: Re[2]: Image Search

2006-06-03 Thread TDLN
like you can open an issue to request a nutch sandbox project image search. If we got enough people vote for this issue we may have a chance to got it created. Stefan Am 03.06.2006 um 10:38 schrieb TDLN: I am interested in developing such a solution as well. I am currently

Re: Image Search

2006-06-03 Thread TDLN
(E.G. Nutch define one url == one index document.) Why can't we create a document for every image that is found? Then it is as if we will have a parse-image plugin just like we have a parse-html and parse-pdf plugin, with the only difference that it will be run after all the pages in the

Re: Re[2]: Image Search

2006-06-03 Thread TDLN
] wrote: Well I can do the project management side of it, and can volunteer some time, but have never done this in an open source model before. But I can do documentation, project management support, and make a decent cheer leader as well. Let me know. r/d -Original Message- From: TDLN

Re: Re[2]: Image Search

2006-06-03 Thread TDLN
in this area I cannot answer your question. Anyway, now I think is time to read hadoop MapReduce code :) Rgrds, Thomas On 6/3/06, Dima Mazmanov [EMAIL PROTECTED] wrote: Hi,TDLN. But how image data will be stored in nutch database? Would it affect on rest data in it? (E.G. Nutch define one url

IOException nightly build 22-05

2006-05-29 Thread TDLN
I am seeing IOException's running the nightly build from 22-05. Anybody seen these before? nutch inject crawl/crawldb urls/ 060529 174013 java.io.IOException: config() at org.apache.hadoop.conf.Configuration.init(Configuration.java:66) at

Re: Run-Time Error

2006-05-26 Thread TDLN
Did you add the plugins directory to your classpath and does it contain all of your plugins? Rgrds, Thomas On 5/23/06, Murat Ali Bayir [EMAIL PROTECTED] wrote: Hi everbody, I am running Nuth 0.8 under windows by using Eclipse I got the following error. I added conf directory to my classpath.

Re: Debugging rules for RegexUrlNormalizer

2006-05-22 Thread TDLN
Hi Stefan try running bin/nutch org.apache.nutch.net.URLFilterChecker Rgrds, Thomas On 5/22/06, Stefan Neufeind [EMAIL PROTECTED] wrote: Hi, is there a way to debug rules for RegexUrlNormalizer, e.g. test the substitution from commandline? bin/nutch

Re: Debugging rules for RegexUrlNormalizer

2006-05-22 Thread TDLN
Sorry, I was a bit too fast there, the answer applies to the RegexURLFilter not the RegexUrlNormalizer. I don't think there is a similar facility for the RegexUrlNormalizer, but let me know if you find it :) Rgrds, Thomas On 5/22/06, TDLN [EMAIL PROTECTED] wrote: Hi Stefan try running bin

Re: [Nutch-general] Re: Extending Nutch talk, May 11th, Palo Alto, CA

2006-05-10 Thread TDLN
+1 I would be interested as well. Rgrds, Thomas Delnoij On 5/10/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: +1 to this! I won't be in San Francisco on the 11th, but would be interested in seeing/listening either in real-time or a recorded version. Thanks, Otis - Original Message

Re: Fwd: Spam warning.

2006-05-03 Thread TDLN
Or maybe one of the mailing list adminstrators can exert some control here; Herman's emails are not really adding to the readability of the archives as well :) http://mail-archive.com/nutch-user%40lucene.apache.org/ Rgrds, Thomas On 5/3/06, Herman Hardenbol [EMAIL PROTECTED] wrote: Sorry, I

Fwd: Fwd: Spam warning.

2006-05-03 Thread TDLN
let me know and we will disable the autoreply altogether. kind regards, John Steenwinkel IT Services, ISS Helpdesk Officer on 03 May 2006 at 10:49 +0100 wrote: - Original Message - 03 May 2006 10:43:56 Message From: TDLN [EMAIL PROTECTED] Subject:Spam

Re: Optimizing the performance of a Nutch-based web application?

2006-04-27 Thread TDLN
Hi Chun Wei. just google for 'tomcat performance tuning', you will find a lot of helpfull information. For instance: http://tomcat.apache.org/articles/performance.pdf http://www.javaworld.com/channel_content/jw-performance-index.shtml Rgrds, Thomas On 4/27/06, Chun Wei Ho [EMAIL PROTECTED]

Re: unable to filter different file format like .java,.jar,.class with nutch version 0.7.2

2006-04-24 Thread TDLN
Since there are number of file format and I can't add each of them in ignore list. Why not? You can add something like -\.(java|.class|jar|dll) etc. Rgrds, Thomas Alternative could be that it fetch and show result only of parsable documents. can anybody help me in this

Re: yes, a European nutch meeting is also planed :)

2006-04-22 Thread TDLN
I would be very interested in a European user meeting. Berlin would be fine as well. Great idea! Thomas On 4/22/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi Sami, Hi Dawid, Hi All, yes if there are enough people interested I would love to get a European user meeting organized as well.

Re: Index statistics

2006-04-19 Thread TDLN
I think the nutch readdb command only gives statistics for the crawldb (crawled Pages) and not the index. Rgrds, Thomas On 4/18/06, Michael Levy [EMAIL PROTECTED] wrote: Ben, how about this: bin/nutch readdb crawled/db -stats where crawled is the directory holding the index? Here's a good

Re: USing PruneTool

2006-04-19 Thread TDLN
I think it is plain Lucene syntax that is expected, for instance: #delete docs from www.cnn.com url:www cnn com #delete docs that contain p0rn in their content, #but not study or research, and which come from www.cnn.com content:p0rn -content:(study research) +url:www cnn com #

Re: Index statistics

2006-04-18 Thread TDLN
Luke (http://www.getopt.org/luke/) comes in handy for those purposes. Rgrds, Thomas On 4/18/06, Benjamin Higgins [EMAIL PROTECTED] wrote: Hi, I looked through the FAQ but found nothing about getting basic index statistics, like quite simply, how many pages are in the index. How can I figure

Re: Can nutch fit to thi task ?

2006-04-17 Thread TDLN
I disagree that it should be difficult to stay uptodate with the main codeline if you have a lot of local changes. You can put your code under local version control in subversion and then use the process described in the Vendor branches chapter of the subversion book (found here:

Re: Nutch 500 Error

2006-04-06 Thread TDLN
My guess is you have to override the searcher.dir property in nutch-site.xml and have it point to your crawl dir. Rgrds, Thomas On 4/5/06, Paul Stewart [EMAIL PROTECTED] wrote: Hi there... I was having a number of problems with my install, mainly because I'm not used to Tomcat and/or Nutch

.classpath and .project for 0.8

2006-04-06 Thread TDLN
I am (finally) moving my installation to 0.8-dev. Now I was wondering if one of the developers could post their .classpath and .project eclipse settings files. I have seen those files being posted for 0.7, so I thought I might as well ask. Rgrds, Thomas

RuntimeException running Generator

2006-04-06 Thread TDLN
nutch-users - both in the whole web and intranet scenario's, I am now getting 060406 154710 Generator: Partitioning selected urls by host, for politeness. 060406 154710 parsing jar:file:/home/tdelnoij/dev/sandbox/nutch-0.8-dev/lib/hadoop-0.1.0.jar!/hadoop-default.xml 060406 154710 parsing

Re: RuntimeException running Generator

2006-04-06 Thread TDLN
Oops, this one seems to have been fixed already: http://mail-archive.com/nutch-user%40lucene.apache.org/msg04130.html I will give it a shot with the last nightly build. Rgrds, Thomas On 4/6/06, TDLN [EMAIL PROTECTED] wrote: nutch-users - both in the whole web and intranet scenario's, I am

Re: .classpath and .project for 0.8

2006-04-06 Thread TDLN
for the plugins folder under the build and will load all necessary plugins from there. Dennis -Original Message- From: TDLN [mailto:[EMAIL PROTECTED] Sent: Thursday, April 06, 2006 7:51 AM To: nutch-user@lucene.apache.org Subject: .classpath and .project for 0.8 I am (finally) moving my

Re: .classpath and .project for 0.8

2006-04-06 Thread TDLN
That't it, thanks again! On 4/6/06, Dennis Kubes [EMAIL PROTECTED] wrote: Here they are zipped up. -Original Message- From: TDLN [mailto:[EMAIL PROTECTED] Sent: Thursday, April 06, 2006 11:44 AM To: nutch-user@lucene.apache.org Subject: Re: .classpath and .project for 0.8 Thanks

Re: Crawling a file but not indexing it

2006-04-05 Thread TDLN
. parse.getData().get(index) to get the meta-data value for index. What am I missing? Thanks for the pointers! Ben On 4/3/06, TDLN [EMAIL PROTECTED] wrote: It depends if you control the seed pages or not; if you do, you could tag them index=no and skip them during indexing. You would

Re: Crawl status

2006-04-05 Thread TDLN
How can I have the status of the crawl process ? In general this should be apparent from the crawl log. - number of fetched pages is printed to the logs at certain intervals (also number of pages/sec etc.) - number of indexed pages if you use the crawl too, indexing is done after all pages

Re: Crawling a file but not indexing it

2006-04-03 Thread TDLN
It depends if you control the seed pages or not; if you do, you could tag them index=no and skip them during indexing. You would have to change HtmlParser and BasicIndexingFilter. Rgrds, Thomas On 4/4/06, Benjamin Higgins [EMAIL PROTECTED] wrote: Hello, I've gone through the documentation

Re: Log Analysis

2006-04-01 Thread TDLN
I am also interested in this. 'till now I didn't find any good OS tools for this purpose, just this one: www.splunk.com. Rgrds, Thomas On 3/31/06, Vanderdray, Jacob [EMAIL PROTECTED] wrote: What open source tools do people like for analyzing nutch search log files? I'm specifically

Re: Nutch 0.7.2 release

2006-04-01 Thread TDLN
Yes! This is great news, thank you so much. By the way: in the revision of the release notes that you posted (292986), the changes for 0.7.2 are missing. Rgrds, Thomas On 4/1/06, Piotr Kosiorowski [EMAIL PROTECTED] wrote: Hello all, The 0.7.2 release of Nutch is now available. This is a bug

Re: Nutch 0.7.2 release

2006-04-01 Thread TDLN
Is this the correct revision of the release notes? http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/branch-0.7/CHANGES.txt?rev=390158 Rgrds, Thomas On 4/1/06, TDLN [EMAIL PROTECTED] wrote: Yes! This is great news, thank you so much. By the way: in the revision of the release notes

Re: Legal issues

2006-03-30 Thread TDLN
Google's and Yahoo's Terms of Service provide interesting reading regarding such legal issues. http://www.google.com/terms_of_service.html http://docs.yahoo.com/info/terms/ Rgrds, Thomas On 3/30/06, gekkokid [EMAIL PROTECTED] wrote: Shouldn't be a problem if your honouring the robots.txt

Re: Legal issues

2006-03-30 Thread TDLN
passwords, and honor robots.txt and they post it on the web, it is considered public in that regard. I am not a lawyer, check grocklaw. r/d -Original Message- From: TDLN [mailto:[EMAIL PROTECTED] Sent: Thursday, March 30, 2006 3:34 AM To: nutch-user@lucene.apache.org Subject: Re

Re: Getting contents of crawled pages by URL

2006-03-29 Thread TDLN
Wojtek, those commands apply to 0.7.1 (the version I am still working with). For 0.8 I think you can use 'nutch readdb' and 'nutch readlinkdb'. How to get the Content by URL, I don't know, but it should be possible somehow on 0.8. Rgrds, Thomas On 3/27/06, TDLN [EMAIL PROTECTED] wrote

Re: Getting contents of crawled pages by URL

2006-03-27 Thread TDLN
Wojchiech, 1. list of crawled pages There's the 'nutch admin' command: java org.apache.nutch.tools.WebDBAdminTool (-local | -ndfs namenode:port) db [-create] [-textdump dumpPrefix] [-scoredump] [-top k] Using '-textDump' will dump the contents of the WebDB to a text file. Then there is the

Re: Problem with nutch-0.7.1.tar.gz

2006-03-24 Thread TDLN
Just create the directory '/home/scott/downloads/nutch-0.7.1/src/plugin/nutch-extensionpoints/src/java' and run ant again. Rgrds, Thomas On 3/24/06, keren nutch [EMAIL PROTECTED] wrote: Hi, I extracted tar -xf nutch-0.7.1.tar.gz and got the info tar: A lone zero block at 132784 When I

Re: Searching only a whitelist (country specific SE)

2006-03-20 Thread TDLN
PrefixURLFilter should consume less RAM than the hashmap presumably underlying your cache, while still delivering similar lookup speed. But perhaps I'm wrong?) --Matt On Mar 19, 2006, at 1:09 PM, TDLN wrote: I agree with you. That was a bold statement, not necessarily backed up by any hard

Re: Search Time Taken

2006-03-20 Thread TDLN
I don't think there is a plugin that does that. If you're using the OpenSearchServlet, you could create a ServletFilter that intercepts the requests and calculates the time it takes to perform a search. Maybe others have more creative ideas Rgrds, Thomas On 3/20/06, Edward Quick [EMAIL

Re: Searching only a whitelist (country specific SE)

2006-03-19 Thread TDLN
There's the DBUrlFilter as well, that stores the Whitelist in the database: http://issues.apache.org/jira/browse/NUTCH-100 It performs better than the PrefixURLFilter and also makes the management of the list more easy. Rgrds, Thomas On 3/15/06, Matt Kangas [EMAIL PROTECTED] wrote: For a

Re: Searching only a whitelist (country specific SE)

2006-03-19 Thread TDLN
explain? --Matt On Mar 19, 2006, at 3:13 AM, TDLN wrote: There's the DBUrlFilter as well, that stores the Whitelist in the database: http://issues.apache.org/jira/browse/NUTCH-100 It performs better than the PrefixURLFilter and also makes the management of the list more easy

Re: FW: about nutch

2006-03-14 Thread TDLN
I can only speak for myself, but I would need the output from the different Nutch commands to analyse this problem. Rrgds. Thomas On 3/13/06, Richard Braman [EMAIL PROTECTED] wrote: -Original Message- From: Alen [mailto:[EMAIL PROTECTED] Sent: Monday, March 13, 2006 1:42 AM To:

Re: Problems

2006-03-14 Thread TDLN
Unfortunately, in the 0.7 release, the NutchBean does not clean up properly after itself, so some SegementReaders and IndexReaders remain open. I think this is fixed in the current code line. I had similar problems in my app based on 0.7 - all that helped was killing the processes blocking the

Re: Problems

2006-03-14 Thread TDLN
/14/06, Laurent Michenaud [EMAIL PROTECTED] wrote: It would be interesting to have a fix for 0.7 -Message d'origine- De : TDLN [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 mars 2006 12:32 À : nutch-user@lucene.apache.org Objet : Re: Problems Unfortunately, in the 0.7 release

Re: writing a metadata content tag:use case example

2006-03-10 Thread TDLN
Richard. So would I do something like 1. parse out the citation 2. metadata.put(citation, citation); Yes, I think that is the way to proceed. And then on implementing the Indexing and Query FIlters, all as desribed in the WritingPlugin tutorial:

Re: help - distributed crawl in 0.7.1

2006-03-08 Thread TDLN
You can start here http://wiki.apache.org/nutch/NutchDistributedFileSystem Also, I think there have been several posts in the mailing list that contain such a step-by-step overview. Rgrds, Thomas On 3/8/06, Olive g [EMAIL PROTECTED] wrote: Hi I am new here. Could someone please let me know

Re: help - distributed crawl in 0.7.1

2006-03-08 Thread TDLN
Detailed distributed crawl implementation: http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02270.html I am not sure it applies to 0.7 though, but it has a lot of info. Rgrds, Thomas

Re: project vitality?

2006-03-06 Thread TDLN
Stefan. I know people having 500 mio pages index and I personal run crawls with ~300 pages per second. Sorry, but I have to ask: what kind of setup do you have (network, hw, nutch version) that you manage so many pages per second? Unless this is a company secret, it would be very nice to know

Re: urlfilter-db plugin usage...

2006-02-28 Thread TDLN
You need to do both: seed the WebDB with the 14k urls extracted from the dmoz content file AND filter newly found urls against the urls in the mysql database using the urlfilter-db. This is significantly faster than adding the 14k urls to the regex-urlfilter.txt file and checking against that.

Re: meta in search query string

2006-02-24 Thread TDLN
for the fields you're interested in. Instead of fields=DEFAULT in the example, you'll want raw-fields=language and raw-fields=category. Assuming you name the fields language and category when you add them to the index. Jake. -Original Message- From: TDLN [mailto:[EMAIL PROTECTED] Sent

Re: meta in search query string

2006-02-23 Thread TDLN
You can follow the tutorial at http://wiki.apache.org/nutch/WritingPluginExample. Just replace recommended with category, and it will show you what to do. (I just implemened a category filter this way ...) Rgrds, T. On 2/23/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi, I have added on

Re: Date first indexed

2006-02-17 Thread TDLN
I am still using 0.7.1 - I think the CrawlDatum.setMetaData is only part of the trunk. Is it not possible to just hack the MoreIndexingFilter and calculate the date_indexed field there (similar to how the lastModified field is calculated), and add a DateIndexedQueryFilter to the

Re: Date first indexed

2006-02-17 Thread TDLN
otherwise would be lost, right? Rgrds, Thomas On 2/17/06, TDLN [EMAIL PROTECTED] wrote: I am still using 0.7.1 - I think the CrawlDatum.setMetaData is only part of the trunk. Is it not possible to just hack the MoreIndexingFilter and calculate the date_indexed field there (similar to how