Re: Sorting in nutch-webinterface - how?

2006-05-26 Thread Doug Cutting
Stefan Neufeind wrote: Can you maybe also help me out with sort=title? Lucene's works with indexed, non-tokenized fields. The title field is tokenized. If you need to sort by title then you'd need to add a plugin that indexes another field (e.g., sortTitle) containing the un-tokenized

Re: .job file?

2006-05-26 Thread Doug Cutting
The .job file is a jar file for submission to Hadoop's MapReduce. It is Hadoop-specific, although very similar to war and ear files. Teruhiko Kurosaka wrote: Nutch's top-level bulid.xml file's default target is job, and it build a zip file called nutch-0.8-dev.job. project name=Nutch

0.8 release soon?

2006-05-26 Thread Doug Cutting
Andrzej Bialecki wrote: 0.8 is pretty stable now, I think we should start considering a release soon, within the next month's time frame. +1 Are there substantial features still missing from 0.8 that were supported in 0.7? Are there any showstopping bugs, things that worked in 0.7 that are

Re: Can't access nightly build nutch 0.8

2006-05-11 Thread Doug Cutting
The nightly build is not mirrored. It is only available from cvs.apache.org, which has been down, but is now up. http://cvs.apache.org/dist/lucene/nutch/nightly/ Note that no nightly build was done last night, since Subversion was down. Doug Michael Plax wrote: I tried randomly some of

Re: MultiSearcher skewed IDF values

2006-04-28 Thread Doug Cutting
Andrzej Bialecki wrote: Unfortunately, this is still an existing problem, and neither Nutch nor Lucene does the right job here. Please see NUTCH-92 for more information, and a sketch of solution for this issue. Lucene's MultiSearcher now implements this correctly, no? But Nutch's

Re: Problem with sorting index

2006-04-28 Thread Doug Cutting
It sounds like you're sorting a segment index after dedup, rather than a merged index. It also looks like there's a bug in IndexSorter. But you should be able to work around it by merging your segment indexes after deduping, so there are no deletions. Please file a bug in Jira. Doug

Re: Admin Gui beta test (was Re: ATB: Heritrix)

2006-04-28 Thread Doug Cutting
Andrzej Bialecki wrote: I think it should be possible to put your binary at the Apache site, probably Doug will be the right person to talk to ... Have you tried attaching it to a Jira issue? If that fails, you could attach it to a page on the Wiki, no? Doug

Re: java.io.IOException: No input directories specified in

2006-04-26 Thread Doug Cutting
Chris Fellows wrote: I'm having what appears to be the same issue on 0.8 trunk. I can get through inject, generate, fetch and updatedb, but am getting the IOException: No input directories on invertlinks and cannot figure out why. I'm only using nutch on a single local windows machine. Any

Re: How to get Text and Parse data for URL

2006-04-25 Thread Doug Cutting
NutchBean.getContent() and NutchBean.getParseData() do this, but require a HitDetails instance. In the non-distributed case, the only required field of the HitDetails for these calls is url. In the distributed case, the segment field must also be provided, so that the request can be routed

Re: How to get Text and Parse data for URL

2006-04-25 Thread Doug Cutting
Dennis Kubes wrote: I think that I am not fully understanding the role the segments directory and its contents play. A segment is simply a set of urls fetched in the same round, and data associated with these urls. The content subdirectory contains the raw http content. The parse-text

Re: java.io.IOException: Cannot create file

2006-04-20 Thread Doug Cutting
[EMAIL PROTECTED] wrote: First question. Updatedb won't run against the segment so what can I do to salvage it? Is the segment salvageable? Probably. I think you're hitting some current bugs in DFS MapReduce. Once these are fixed, then your updatedb's should succeed! Second question,

Re: java.io.IOException: Cannot create file

2006-04-20 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Actually, I think that updatedb won't run because the fetched segment didn't complete correctly. Don't know whether the instructions in the 0.7 FAQ apply: %touch /index/segments/2005somesegment/fetcher.done Ah. That's different. No, the 0.7 trick probably won't

Re: Using Nutch's distributed search server mode

2006-04-20 Thread Doug Cutting
Scott Simpson wrote: I don't quite understand how to set up distributed searching with relation to DFS (and the Tom White documents don't discuss this either). There are three databases with relation to Nutch: 1. Web database (dfs) 2. Segments (regular fs) 3. The index (regular fs) From your

Re: nutch user meeting in San Francisco: May 18th

2006-04-20 Thread Doug Cutting
Folks can say whether they'll attend at: http://www.evite.com/app/publicUrl/[EMAIL PROTECTED]/nutch-1 Doug

Re: Using Nutch's distributed search server mode

2006-04-17 Thread Doug Cutting
Shawn Gervais wrote: I was not able to use the literal instructions, as my indexes and segments are in DFS while the document presumes a local filesystem installation Search performance is not good with DFS-based indexes segments. This is not recommended. Distributed search is not meant

Re: java.net.SocketTimeoutException: Read timed out

2006-04-12 Thread Doug Cutting
Elwin wrote: When I use the httpclient.HttpResponse to get http content in nutch, I often get SocketTimeoutExceptions. Can I solve this problem by enlarging the value of http.timeout in conf file? Perhaps, if you're working with slow sites. But, more likely, you're using too many fetcher

Re: Question about crawldb and segments

2006-04-12 Thread Doug Cutting
Jason Camp wrote: Unfortunately in our scenario, bw is cheap at our fetching datacenter, but adding additional disk capacity is expensive - so we are fetching the data and sending it back to another cluster (by exporting segments from ndfs, copy, importing). But to perform the copies, you're

Re: plugins directory

2006-04-12 Thread Doug Cutting
mikeyc wrote: Any idea how the 'plugins' directory gets populated? I noticed microformats-hreview was not there. It does exist in the build directory with its jar and class files. Could this be the issue? The plugins directory exists in release builds. When developing, plugins live in

Re: How best to debug failed fetch-reduce task

2006-04-12 Thread Doug Cutting
Shawn Gervais wrote: When I have been at the terminal to observe the timed out process before it is reaped, I have seen that it continues to use 100% of a single processor. strace of the java process did not produce any usable leads. When the reduce task is reassigned, either to the same

Re: When Nutch fetches using mapred ...

2006-04-10 Thread Doug Cutting
Shawn Gervais wrote: When I perform a search large enough to observe the fetch process for an extended period of time (1M pages over 16 nodes, in this case), I notice there is one map task which performs _very_ poorly compared to the others: 4905 pages, 33094 errors, 3.5 pages/s, 432 kb/s,

Re: lost NDFS blocks following network reorg

2006-03-26 Thread Doug Cutting
Ken Krugler wrote: Anyway, curious if anybody has insights here. We've done a fair amount of poking around, to no avail. I don't think there's any way to get the blocks back, as they definitely seem to be gone, and file recovery on Linux seems pretty iffy. I'm mostly interested in figuring out

Re: How to terminate the crawl?

2006-03-21 Thread Doug Cutting
You can limit the number of pages by using the -topN parameter. This limits the number of pages fetched in each round. Pages are prioritized by how well-linked they are. The maximum number of pages that can be fetched is topN*depth. Doug Olena Medelyan wrote: Hi, I'm using the crawl

Re: Delete Files from NDFS

2006-03-21 Thread Doug Cutting
Blocks are not deleted immediately. Check back in a while to see that they're actually removed. Doug Dennis Kubes wrote: Is there a way to delete files from the DFS? I used the dfs -rm option, but the data blocks still are there. Dennis

Re: Nutch and Hadoop Tutorial Finished

2006-03-20 Thread Doug Cutting
Dennis Kubes wrote: Here it is for the list, I will try to put it on the wiki as well. Thanks for writing this! I've added a few comments below. Some things are assumed for this tutorial. First, you will need root level access to all of the boxes you are deploying to. Root access should

Re: Help Setting Up Nutch 0.8 Distributed

2006-03-16 Thread Doug Cutting
Dennis Kubes wrote: localhost:9000: command-line: line 0: Bad configuration option: ConnectTimeout devcluster02:9000: command-line: line 0: Bad configuration option: ConnectTimeout [ ... ] localhost:9000: command-line: line 0: Bad configuration option: ConnectTimeout devcluster02:9000:

Re: Help Setting Up Nutch 0.8 Distributed

2006-03-16 Thread Doug Cutting
Dennis Kubes wrote: : command not foundlaves.sh: line 29: : command not foundlaves.sh: line 32: localhost: ssh: \015: Name or service not known devcluster02: ssh: \015: Name or service not known And still getting this error: 060316 175355 parsing file:/nutch/search/conf/hadoop-site.xml

Re: javascript in summaries [nutch-0.7.1]

2006-03-15 Thread Doug Cutting
Jérôme Charron wrote: I reproduce this with nutch-0.8 with neko html parser (it seems that script tags are not removed). You can switch the html parser implementation to tagsoup. In my tests, all is ok. (property parser.html.impl) Should we switch the default from neko to tagsoup? Are there

Re: Question on scalability

2006-03-15 Thread Doug Cutting
Olive g wrote: Is hadoop/nutch scalable at all or I can tune some other parameters? I'm not sure what you're asking. How long does it take to run this on a single machine? My guess is that it's much longer. So things are scaling: they're running faster when more hardware is added. In all

Re: Boolean OR QueryFilter

2006-03-15 Thread Doug Cutting
relevent posts in the mailing list archive, but I think I'm missing something. For example, here's a snippet from a post from Doug Cutting: snip that said, one can implement OR as a filter (replacing or altering BasicQueryFilter) that scans for terms whose text is OR in the default field. /snip

Re: Site: invalid Jira link

2006-03-15 Thread Doug Cutting
I just fixed this. Thanks, Doug ArentJan Banck wrote: on: http://lucene.apache.org/nutch/issue_tracking.html http://nagoya.apache.org/jira/browse/Nutch no longer works. Should be: http://issues.apache.org/jira/browse/Nutch - Arent-Jan

Re: Adaptive Refetching

2006-03-08 Thread Doug Cutting
Andrzej Bialecki wrote: What i infer is, 1. For every refetch, the score of files (but not the directory) is increasing This is curious, it should not be so. However, it's the same in the vanilla version of Nutch (without this patch), so we'll address this separately. The OPIC

Re: Boolean OR QueryFilter

2006-03-08 Thread Doug Cutting
David Odmark wrote: So am I correct in believing that in order to implement boolean OR using Nutch search and a QueryFilter, one must also (minimally) hack the NutchAnalysis.jj file to produce a new analyzer? Also, given that a Nutch Query object doesn't seem to have a method to add a

Re: Adaptive Refetching

2006-03-08 Thread Doug Cutting
Andrzej Bialecki wrote: Doug Cutting wrote: are refetched, their links are processed again. I think the easiest way to fix this would be to change ParseOutputFormat to not generate STATUS_LINKED crawldata when a page has been refetched. That way scores would only be adjusted for links

Re: .8 svn - fetcher performance..

2006-03-07 Thread Doug Cutting
Byron Miller wrote: Anything i should change/tweak on my fetcher config for .8 release? i'm only getting 5 pages/sec and i was getting nearly 50 on .7 with 125 threads going. Does .8 not use threads like 7 did? Byron, Have you tried again more recently? A number of bugs have been fixed in

Re: Problems with hadoop

2006-03-07 Thread Doug Cutting
Jon Blower wrote: My guess is that the source program is not available on your version of FreeBSD. Try running the source program (with no arguments) from the command line or type man source. Do you see anything? If not, you probably don't have the source program, which is called by the

Re: retry later

2006-03-07 Thread Doug Cutting
Richard Braman wrote: when you get an error while fetching, and you get the org.apache.nutch.protocol.retrylater because the max retries have been reached, nutch says it has given up and will retry later, when does that retry occur? How would you make a fetchlist of all urls that have failed?

Re: Tutorial on the Wiki

2006-03-07 Thread Doug Cutting
Vanderdray, Jacob wrote: I've changed the language a bit. If you're interested, take a look: http://wiki.apache.org/nutch/NutchTutorial This looks great! Thanks so much for adding this to the wiki! We might add something to the Step-by-Step introduction to the effect that: This

Re: still not so clear to me

2006-03-07 Thread Doug Cutting
Richard Braman wrote: Can someone confirm this: Uou start a crawldb from a list of urls and you generate a fetch list, which is akin to seeding your crawldb. When you fetch it just fetches those seed urls. When you do your next round of generate/fetch/update, The fetch list will have the

Re: project vitality?

2006-03-06 Thread Doug Cutting
Richard Braman wrote: I realy do think nutch is great, but I echo Matthias's comments that the community needs to come together and contirbute more back. And that comes with the requirement of making sure volunteers are given access to make their contributions part of the project. Here's how

Re: project vitality?

2006-03-06 Thread Doug Cutting
David Wallace wrote: Also, I've lost count of the number of times someone has posted something to the effect of I'll pay someone to give me Nutch support, simply because they find the existing documentation and mailing lists inadequate. Usually, that person gets told that the best way to get

Re: issues w/ new nutch versions

2006-03-06 Thread Doug Cutting
Florent Gluck wrote: In hadoop jobtracker's log, I can see several tasks being losts as follow: 060306 184155 Aborting job job_hyhtho 060306 184156 Task 'task_m_7qgat2' has been lost. 060306 184156 Aborting job job_hyhtho 060306 184156 Task 'task_m_lph5qs' has been lost. 060306 184156 Aborting

Re: Help with bin/nutch server 8081 crawl

2006-03-06 Thread Doug Cutting
Monu Ogbe wrote: Caused by: java.lang.InstantiationException: org.apache.nutch.searcher.Query at java.lang.Class.newInstance0(Unknown Source) at java.lang.Class.newInstance(Unknown Source) at org.apache.hadoop.io.WritableFactories.newInstance(WritableFactories.jav It

Re: Moving tutorial link to wiki

2006-03-06 Thread Doug Cutting
Matthias Jaekle wrote: Maybe we should move the tutorial to the wiki so it can be commented on. +1 +1 Doug

Re: exception during fetch using hadoop

2006-02-24 Thread Doug Cutting
It looks like the child JVM is silently exiting. The error reading child output just shows that the child's standard output has been closed, and the child error says the JVM exited with non-zero. Perhaps you can get a core dump by setting 'ulimit -c' to something big. JVM core dumps can

Re: url: search fail

2006-02-24 Thread Doug Cutting
0.7 and 0.8 are not compatible. You need to re-crawl. Sorry! Once we have a 1.0 release then we'll make sure things are back-compatible. Doug Martin Gutbrod wrote: I changed from 0.7.1 to one of the latest nightly builds (0.8) and now search for url: fields fail. E.g. [ url:my.doman.com ]

Re: Link to Search Interface for List

2006-02-16 Thread Doug Cutting
Vanderdray, Jacob wrote: I get the same thing from my linux box. The only reference I can find to linkmap.html is a commented out line in forrest.properties. FWIW: I've already made the changes to my copy of mailing_lists.xml. Let me know if you want me to just send someone that.

Re: Problem/bug setting java_home in hadoop nightly 16.02.06

2006-02-16 Thread Doug Cutting
Have you edited conf/hadoop-env.sh, and defined JAVA_HOME there? Doug Håvard W. Kongsgård wrote: I am unable to set java_home in bin/hadoop, is there a bug? I have used nutch 0.7.1 with the same java path. localhost: Error: JAVA_HOME is not set. if [ -f $HADOOP_HOME/conf/hadoop-env.sh ];

Re: The latest svn version is not stable

2006-02-10 Thread Doug Cutting
Rafit Izhak_Ratzin wrote: I just check out the latest svn version (376446), I built it from scratch. When I tried to run the jobtrucker I got the next message in the jobtracker log file: 060209 164707 Property 'sun.cpu.isalist' is Exception in thread main java.lang.NullPointerException

Re: nutch inject problem with hadoop

2006-02-10 Thread Doug Cutting
Michael Nebel wrote: I upgraded to the last version from the svn today. After having some nuts and bolts fixes (missing hadoop-site.xml, webapps-dir). I just fixed these issues. I finally tried to inject a new set of urls. Doing so, I get the exception below. I am not seeing this. Are you

Re: nutch inject problem with hadoop

2006-02-10 Thread Doug Cutting
Michael Nebel wrote: Now it's complaining about a missing class org/apache/nutch/util/LogFormatter :-( That's been moved to Hadoop: org.apache.hadoop.util.LogFormatter. Doug

Re: hadoop-default.xml

2006-02-07 Thread Doug Cutting
The file packaged in the jar is used for the defaults. It is read from the jar file. So it should not need to be committed to Nutch. Mike Smith wrote: There is no setting file for Hadoop in conf/. Should it be hadoop-default.xml? It seems this file is not committed but it is packaged into

Re: Recovering from Socket closed

2006-01-31 Thread Doug Cutting
Chris Schneider wrote: Also, since we've been running this crawl for quite some time, we'd like to preserve the segment data if at all possible. Could someone please recommend a way to recover as gracefully as possible from this condition? The Crawl .main process died with the following

Re: Parsing PDF Nutch Achilles heel?

2006-01-25 Thread Doug Cutting
Steve Betts wrote: I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run a lot faster, but it does allow it to complete. I find xpdf much faster than PDFBox. http://www.mail-archive.com/nutch-dev@incubator.apache.org/msg00161.html Does this work any better for you? Doug

Re: How do I control log level with MapReduce?

2006-01-19 Thread Doug Cutting
Chris Schneider wrote: I'm trying to bring up a MapReduce system, but am confused about how to control the logging level. It seems like most of the Nutch code is still logging the way it used to, but the -logLevel parameter that was getting passed to each tool's main() method no longer exists

Re: Can't index some pages

2006-01-19 Thread Doug Cutting
Michael Plax wrote: Question summery: Q: How can I set up crawler in order to index all web site? I'm trying to run crawl with command from tutorial 1. In urls file I have start page (index.html). 2. In the configuration file conf/crawl-urlfilter.txt domain was changed. 3. I run: $ bin/nutch

Re: So many Unfetched Pages using MapReduce

2006-01-19 Thread Doug Cutting
Florent Gluck wrote: I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient With the old protocol I got 5 as expected. There have been a number of complaints about unreliable fetching with protocol-httpclient, so

Re: Error at end of MapReduce run with indexing

2006-01-19 Thread Doug Cutting
Matt Zytaruk wrote: I am having this same problem during the reduce phase of fetching, and am now seeing: 060119 132458 Task task_r_obwceh timed out. Killing. That is a different problem: a different timeout. This happens when a task does not report status for too long then it is assumed

Re: Can't index some pages

2006-01-19 Thread Doug Cutting
att Kangas wrote: Doug, would it make sense to print a LOG.info() message every time the fetcher bumps into one of these db.max limits? This would help users find out when they need to adjust their configuration. I can prepare a patch if it seems sensible. Sure, this is sensible. But it's

Re: large filter file, time to update db

2006-01-12 Thread Doug Cutting
Insurance Squared Inc. wrote: I'm trying to determine if there's a better way to whitelist a large number of domains than just adding them as a regular expression in the filter. Have a look at the urlfilter-prefix plugin. This is more efficient for filtering urls by a large list of domains.

Re: Full Range of Results Not Showing

2006-01-11 Thread Doug Cutting
Neal Whitley wrote: Now here's another question. How can I obtain the exact number of searches being displayed on the screen. I have been fishing around and can not find a variable being output to the page with this date. In my example below 81 total matches were found. But because of the

Re: Is any one able to successfully run Distributed Crawl?

2006-01-09 Thread Doug Cutting
Pushpesh Kr. Rajwanshi wrote: Just wanted to confirm that this distributed crawl you did using nutch version 0.7.1 or some other version? And was that a successful distributed crawl using map reduce or some work around for distributed crawl? No, this is 0.8-dev. This was using in early

Re: Multi CPU support

2006-01-09 Thread Doug Cutting
Teruhiko Kurosaka wrote: Can I use MapReduce to run Nutch on a multi CPU system? Yes. I want to run the index job on two (or four) CPUs on a single system. I'm not trying to distribute the job over multiple systems. If the MapReduce is the way to go, do I just specify config parameters

Re: Multiple anchors on same site - what's better than making these unique?

2006-01-05 Thread Doug Cutting
David Wallace wrote: I've been grubbing around with Nutch for a while now, although I'm still working with 0.7 code. I notice that when anchors are collected for a document, they're made unique by domain and by anchor text. Note that this is only done when collecting anchor texts, not when

Re: Is any one able to successfully run Distributed Crawl?

2006-01-04 Thread Doug Cutting
Earl Cahill wrote: Any chance you could walk through your implementation? Like how the twenty boxes were assigned? Maybe upload your confs somewhere, and outline what commands you actually ran? All 20 boxes are configured identically, running a Debian 2.4 kernel. These are dual-processor

Re: Is any one able to successfully run Distributed Crawl?

2006-01-02 Thread Doug Cutting
Pushpesh Kr. Rajwanshi wrote: I want to know if anyone is able to successfully run distributed crawl on multiple machines involving crawling millions of pages? and how hard is to do that? Do i just have to do some configuration and set up or do some implementations also? I recently performed a

Re: Linking Document scores together in a query

2005-12-12 Thread Doug Cutting
Can you please describe the higher-level problem you're trying to solve? Doug Matt Zytaruk wrote: Hello, I am trying to implement a system where to get the score for certain documents in a query, I need to average the score of two different documents for that query. Does anyone have any

Re: How to get page content given URL only?

2005-12-12 Thread Doug Cutting
Nguyen Ngoc Giang wrote: I'm writing a small program which just utilizes Nutch as a crawler only, with no search functionality. The program should be able to return page content given an url input. In the mapred branch this is directly supported by NutchBean. Doug

Re: Incremental crawl w/ map reduce

2005-12-09 Thread Doug Cutting
Did you update the crawldb after the first fetch? The mapred crawler does not update the next-fetch date of pages when the fetch list is generated, as in 0.7. So, until that changes, you must update the crawldb before you next generate a fetch list. Doug Florent Gluck wrote: Hi, As a

Re: mapred branch: IOException in invertlinks (No input directories specified)

2005-12-02 Thread Doug Cutting
Florent Gluck wrote: 8. invertlinks linkdb segments/SEG_NAME This should be instead: invertlinks linkdb segments Doug

Re: Fetch Errors

2005-11-28 Thread Doug Cutting
Ben Halsted wrote: When I check the fetch status pages in the JobTracker web GUI I saw that I was getting on average more errors than pages. 95 pages, 119 errors, 1.0 pages/s, 63 kb/s Is there a way to find out what the errors are? Look in the tasktracker logs. Typically they're max delays

Re: NDFS / WebDB QUestion

2005-11-28 Thread Doug Cutting
Thomas Delnoij wrote: So, say I want to setup a machine as a DataNode that has two or more disks, do I have to configure and setup a DataNode Deamon for every disk? How else could I use all disks if the ndfs.data.dir property only accepts one path (assumed I don't want to rely on MS Windows'

Re: Crawl auto updated in nutch?

2005-11-28 Thread Doug Cutting
Håvard W. Kongsgård wrote: - I want to index about 50 – 100 sites with lots of documents, is it best use the Intranet Crawling or Whole-web Crawling method. The intranet style is simpler and hence a good place to start. If it doesn't work well for you then you might try the whole-web style.

Re: Fetcher url sorting

2005-11-22 Thread Doug Cutting
Matt Zytaruk wrote: Indeed, that does work, although that ends up slowing down the fetch a fair amount because a lot of threads end up idle, waiting, and I was hoping to avoid that slowdown if possible. What should these threads be doing? If you have a site with N pages to fetch, and you

Re: Fetcher url sorting

2005-11-22 Thread Doug Cutting
Matt Zytaruk wrote: Well, if we want to fetch pages from N different sites, ideally we should be able to have N threads running, without any of them having to wait. I guess ideally what the fetcher should probably do is instead of waiting, put the url it was trying to fetch back into the queue

Re: Merging many indexes

2005-11-22 Thread Doug Cutting
Ben Halsted wrote: I'm getting the dreaded: Too many open files error. I've checked my system settings for file-max: $ cat /proc/sys/fs/file-nr 2677 1945 478412 $ cat /proc/sys/fs/file-max 478412 What does 'ulimit -n' print? Look in /etc/security/limits.conf to increase the limit. What

Re: merging auto-crawls

2005-11-21 Thread Doug Cutting
Ben Halsted wrote: I've modified the auto-crawl to always use a pre-existing crawldb. If I run it multiple times I get multiple linkdb, segments, indexes, and index directories. Is it possible to merge the results using the bin/nutch comamnds? You should also have it use a single linkdb.

Re: Filesystem structure for the web front-end.

2005-11-21 Thread Doug Cutting
Ben Halsted wrote: I was wondering what the required file structure is for the web gui to work properly. Are all of these required? /db/crawldb /db/index /db/indexes /db/segments /db/linkdb The indexes directory is not used when a merged index is present. The crawldb and

Re: merging auto-crawls

2005-11-21 Thread Doug Cutting
Ben Halsted wrote: When I merge this stuff, do I need to merge the segments/* for each crawl into a single segments directory? Or is there data in the merged index file that will direct the web component to the correct segment? Put the segments in a single directory. The index only has the

Re: sorting on multiple fields

2005-11-21 Thread Doug Cutting
James Nelson wrote: I need to sort the search results on two fields for a project I'm working on, but nutch only seems to support sorting on one. I'm wondering if I missed something and there is actually a way or if there is a reason for restricting sort to one field that I'm not aware of.

Re: Which fields can you call via detail.getvalue(....) out of the box?

2005-11-01 Thread Doug Cutting
The explain page lists all stored fields by calling the toHtml() method of HitDetails. You can also list things with: for (int i = 0; i detail.getLength(); i++) { String field = detail.getField(i); String value = detail.getValue(i); ... } Doug Byron Miller wrote: I'm looking to see

Re: mapred error on windows

2005-10-31 Thread Doug Cutting
It looks like you are using ndfs but not running any datanodes. An ndfs filesystem requires one namenode and at least one datanode, typically a large number running on different machines. Look at the bin/start-all.sh script for an example of what is started in a typical mapred/ndfs

Re: fetch questions - freezing

2005-10-28 Thread Doug Cutting
Ken van Mulder wrote: Initially, its able to reach ~25 pages/s with 150 threads. The fetcher gets progressivly slower though, dropping down to about ~15 pages/s after about 2-3 hours or so and continues to slow down. I've seen a few references on these lists to the issue, but I'm not clear on

Re: Peak index performance

2005-10-28 Thread Doug Cutting
Byron Miller wrote: For example i've been tweaking max merge/min merge and such and i've been able to double my performance without increasing anything but cpu load.. Smaller maxMergeDocs will cost you in the end, since these will eventually be merged during the index optimization at the end.

Re: fetch questions - freezing

2005-10-28 Thread Doug Cutting
Ken Krugler wrote: We're only using the html text parsers, so I don't think that's the problem. Plus we dumping the thread stack when it hangs, and it's always in the ChunkedInputStream.exhaustInputStream() process (see trace below). The trace did not make it. Have you tried protocol-http

Re: fetch questions - freezing

2005-10-28 Thread Doug Cutting
Ken van Mulder wrote: As a side note, does anyone have any recommendations for profiling software? I've used the standard hprof, which slows down the process to much for my needs and jmp which seems pretty unstable. I recommend 'kill -QUIT' as a poor-man's profiler. With a few stack dumps

Re: Peak index performance

2005-10-28 Thread Doug Cutting
Byron Miller wrote: property nameindexer.mergeFactor/name value350/value description /description /property Initially high index merge factor caused out of file handle errors but increasing the others along with it seemed to help get around that. That is a very large mergeFactor,

Re: crawl problems

2005-10-19 Thread Doug Cutting
The only link on http://shopthar.com/ to the domain shopthar.com is a link to http://shopthar.com/. So a crawl starting from that page that only visits pages in shopthar.com will only find that one page. % wget -q -O - http://shopthar.com/ | grep shopthar.com trtd colspan=2Welcome to

Re: Nutch Search Speed Concern

2005-10-18 Thread Doug Cutting
TL wrote: You mentioned that as a rule of thumb each node should only have about 20M pages. What's the main bottleneck that's encountered around 20M pages? Disk i/o , cpu speed? Either or both, depending on your hardware, index, traffic, etc. CPU-time to compute results serially can average

Re: Nutch Search Speed Concern

2005-10-17 Thread Doug Cutting
Murray Hunter wrote: We tested search for a 20 Million page index on a dual core 64 bit machine with 8 GB of ram using storage of the nutch data on another server through linux nfs, and it's performance was terrible. It looks like the bottleneck was nfs, so I was wondering how you had your

Re: Do you believe in Clause sanity?

2005-10-17 Thread Doug Cutting
Andy Lee wrote: Not to become a one-person thread or anything (and I'll shut up if this attempt gets no answers), but this seems like a straightforward question. Is there some design principle I'm missing that would be violated if clauses could be removed from a query? No, not that I can

Re: Do you believe in Clause sanity?

2005-10-17 Thread Doug Cutting
Andy Lee wrote: Thanks, Doug. In that case, please consider this a request for a couple of API changes which you may be planning anyway: * addClause() and removeClause() methods in Query. * Setters in Query.Clause for its term/phrase. Please submit a bug report, ideally with a patch file

Re: Unlimited access to a web server for Nutch

2005-10-11 Thread Doug Cutting
Ngoc Giang Nguyen wrote: I'm running Nutch to crawl some specific websites that I know the web admins personally. So is there anyway to change the settings of the target web servers such that they give my Nutch higher priority, let's say unlimited access, assuming they are all Apache servers?

Re: a simple map reduce tutorial

2005-10-04 Thread Doug Cutting
Earl Cahill wrote: 1. Sounds like some of you have some glue programs that help run the whole process. Are these going to end up in subversion sometime? I am guessing there is much duplicated effort. I'm not sure what you mean. I set environment variables in my .bashrc, then simply use

Re: mapred Sort Progress Reports

2005-10-04 Thread Doug Cutting
Rod Taylor wrote: Tell me how it behaves during the sort phase. I ran 8 jobs simultaneously. Very high await time (1200) and it was doing about 22MB/sec data writes. Nearly 0 reads from disk (everything would be cached in memory). This is during the sort part? This first writes a big file,

Re: How to get real Explanation instead of crippled HTML version?

2005-10-03 Thread Doug Cutting
Ilya Kasnacheev wrote: So I only get HTMLised version, which is useless if I need only page rating (top Explanation.getValue()). How would I get page rating (i.e. number from 0 to 1 showing how relevant Hit was to Query) from nutch? Explanations are not a good way to get this, as, for each

Re: MapReduce

2005-10-03 Thread Doug Cutting
Paul van Brouwershaven wrote: The AcceptEnv option is only avalible with ssh 3.9 Debian currently only has 3.8.1p1 in stable and testing. (4.2 unstable) Is there an other way to solve the env. problem? I don't know. The Fedora and Debian systems that I use have AcceptEnv. Doug

Re: mapred Sort Progress Reports

2005-10-03 Thread Doug Cutting
Rod Taylor wrote: I see. Is there any way to speed up this phase? It seems to be taking as long to run the sort phase as it did to download the data. It would appear that nearly 30% of the time for the nutch fetch segment is spent doing the sorts, so I'm well off the 20% overhead number you

Re: mapred Sort Progress Reports

2005-10-03 Thread Doug Cutting
Rod Taylor wrote: Virtually no IO reported at all. Averages about 200kB/sec read and writes are usually 0, but burst to 120MB/sec for under 1 second once every 30 seconds or so. That's strange. I wonder what it's doing. Can you use 'kill -QUIT' to get a thread dump? Try a few of these to

Re: mapred Sort Progress Reports

2005-10-03 Thread Doug Cutting
Try the following on your system: bin/nutch org.apache.nutch.io.TestSequenceFile -fast -count 2000 -megabytes 100 foo Tell me how it behaves during the sort phase. Thanks, Doug

Re: MapRed - how can I get the fetcher logs?

2005-10-03 Thread Doug Cutting
Gal Nitzan wrote: I only have two log files: -rw-r--r-- 1 root root 8090 Oct 3 07:01 nutch-root-jobtracker-kunzon.log -rw-r--r-- 1 root root 4290 Oct 3 07:01 nutch-root-namenode-kunzon.log The tasktracker logs would be on the machines running the tasktracker, which might be

  1   2   >