[Nutch-dev] access to old versions of nutch?

2005-07-23 Thread Luke Baker
Hey, I'm trying to checkout a version of nutch from March 9, however it doesn't seem to be available in the svn repository. Is this because of the switch from incubator to lucene? Is there any way I can get that version? I have a bunch of crawled docs that were crawled with that version an

[Nutch-dev] [jira] Created: (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag

2005-04-21 Thread Luke Baker (JIRA)
: fetcher Reporter: Luke Baker Priority: Minor Attachments: fetchnewonly.patch It would be useful, especially for research/testing purposes, to have a flag for the FetchListTool that make sure to only include URLs in the fetchlist that have not already been fetched (according to the

[Nutch-dev] [jira] Updated: (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag

2005-04-21 Thread Luke Baker (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-49?page=all ] Luke Baker updated NUTCH-49: Attachment: fetchnewonly.patch Attached is a patch that provides this functionality to the FetchListTool (generate). > Flag for generate to fetch only new pages

[Nutch-dev] dedup and redirect handling

2005-04-14 Thread Luke Baker
d have a link to www.example.com, but parse_text would be empty. Does that make sense? Does anybody have any brighter ideas on how to rid Nutch of this problem? Thanks, Luke Baker --- SF email is sponsored by - The IT Product Guide Read honest

[Nutch-dev] Optimal segment size?

2005-04-13 Thread Luke Baker
Hey, Is there some sort of optimal or maximum segment size? I have a segment with 3.9 million records and it appears to be taking a really long time to index. The index process has been optimizing the index for over a week. The server I'm running it on is a dual Xeon 3.0 Ghz with 2GB of RAM.

Re: [Nutch-dev] Converted Wiki

2005-04-08 Thread Luke Baker
On 04/08/2005 04:27 PM, Chirag Chaman wrote: So, I'm all ready to post the new wikibut hit a little snafu. Basically, I can't seem to login via my script. Have a Perl script that needs to do the following: 1. Login Chirag, Was your script able to login previously? It needs to keep track of co

[Nutch-dev] Re: updatedb ioexception

2005-04-04 Thread Luke Baker
Hey, I've figured out the problem was. Somehow while transferring the data to different servers, the data got very slightly corrupted, which caused this error. Luke On 04/01/2005 09:17 AM, Luke Baker wrote: Hey, When updating the db with a certain segment, I get the following error: 0

[Nutch-dev] updatedb ioexception

2005-04-02 Thread Luke Baker
Hey, When updating the db with a certain segment, I get the following error: 050330 235110 Processing pagesByURL: Sorted 28711.666716283053 instructions/second Exception in thread "main" java.io.IOException: key out of order: gopher://Gopher.wkap..l:70/11gopher_root%3A%5B_journal._jrnl.acbi%5D af

[Nutch-dev] Re: errors:

2005-03-17 Thread Luke Baker
On 03/16/2005 07:20 PM, tigger . wrote: What does this mean: This was a bug in an earlier version of Nutch. Please try upgrading to the newest version in the subversion respository. http://incubator.apache.org/nutch/version_control.html Luke HTTP Status 500 - type Exception report message desc

[Nutch-dev] Re: Simple bug in jsp pages.

2005-03-11 Thread Luke Baker
On 03/11/2005 01:31 PM, Piotr Kosiorowski wrote: Hello all, In latest SVN version (but it was introduced in CVS I think) there is a simple error in two JSP pages. refine-query-init.jsp : String urls = org.apache.nutch.util.NutchConf.get("extension.ontology.urls"); should be: String urls = org.ap

Re: [Nutch-dev] updatedb error

2005-03-02 Thread Luke Baker
, Looks like I'm using the nightly-release from the 15th of Februrary. I'll upgrade to the latest. Thanks, Luke Doug Luke Baker wrote: Has anybody seen this error? I'm trying to update the db from scratch using some previously fetched segments. I get this error on a 2 milli

[Nutch-dev] updatedb error

2005-03-01 Thread Luke Baker
Has anybody seen this error? I'm trying to update the db from scratch using some previously fetched segments. I get this error on a 2 million page segment during the past two attempts to update the db with this segment's data: - 050301 012722 Processing document 200

Re: [Nutch-dev] nutch server

2005-01-24 Thread Luke Baker
servers.txt should designate a search server on each line. So on each line you'd have the IP address of a search server, tab, and the port on which the search server is running. On each search server, you would need to run: bin/nutch server

Re: [Nutch-dev] map&reduce code

2005-01-24 Thread Luke Baker
ibuted system. Once that's working we can start porting Nutch algorithms (db update, dedup, etc.) to use this system. Please be patient. This will all take a while. Doug (or Michael), Is there anything we can do to help or be involved in this pr

Re: [Nutch-dev] Nutch 0.6 release does not compile under java 1.5

2005-01-20 Thread Luke Baker
On 01/20/2005 04:54 PM, Andrzej Bialecki wrote: Jay Yu wrote: Or, only import the classes you want use (at least from java lib). Who has the permission to change and commit to CVS? This is fixed now. Thanks! Hey, I realize this is a little late now that the changes are commited, but I'd thought

Re: [Nutch-dev] Patch: Multiple threads per host

2005-01-13 Thread Luke Baker
On 12/01/2004 01:42 PM, Luke Baker wrote: Hey, Here's a patch that'll allow users to configure how many threads they want to access the same host at the same time. Right Nutch only allows one thread at a time to access any given host. The default will still be 1 thread per host. Th

Re: [Nutch-dev] PruneIndexTool

2005-01-04 Thread Luke Baker
On 01/04/2005 10:22 AM, Andrzej Bialecki wrote: Hi, Some time ago I created this tool (see attached code) to address specific needs I and some other people had. If you think it's generally useful I can add it to the net.nutch.tools.* package. I've used it briefly, and it worked well for me. It

Re: [Nutch-dev] improve fetcher thread handling?

2004-12-20 Thread Luke Baker
e some of them. This will not speed up a fetchlist, but it will allow users to not worry about http.max.delays. If we also monitor bandwidth usage, we could use this functionality to do global rate limiting by removing or adding fetcher threads. Luke Baker -

[Nutch-dev] Search Performance

2004-12-17 Thread Luke Baker
of 2 million URLs and those major index structures in RAM what is limiting us to only 20 queries/second? Is there anything we could do to increase our throughput of queries? (adding more RAM? CPUs? faster disks?) Thanks, Luke Baker --- SF ema

Re: [Nutch-dev] Alternative Content Types

2004-12-07 Thread Luke Baker
There [snip] PDF parsing is available in Nutch via the parse-pdf plugin. It uses the PDFBox library. Luke Baker --- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Disco

[Nutch-dev] segslice question

2004-12-06 Thread Luke Baker
parsing. I tried just parsing the whole segment, but it seemed to slow way down after parsing awhile. Thanks, Luke Baker --- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real u

Re: [Nutch-dev] fetcher.server.delay and http.max.delays

2004-12-03 Thread Luke Baker
lso if you wanted to go even faster, you could try out my recently submitted patch to allow multiple threads to access the host at the same time. You probably wouldn't want use more threads than you're allowing to access your 1 host. Luke Baker Vince --

Re: [Nutch-dev] updatedb performance issues

2004-12-03 Thread Luke Baker
and comparisons that need to happen on each URL that the update process encounters. For this reason I was forced to use the prefix filter, but it fits my needs just fine. Luke Baker is this normal? we're running on a dual xeon machine w/4 gigs ram. are there any optimizations i can m

[Nutch-dev] fetcher.server.delay and http.max.delays

2004-12-01 Thread Luke Baker
? Currently if your fetchlist has a high percentage of a host and your fetcher.server.delay is set to 0, then you end up getting a ton of http.max.delay errors because the threads aren't really sleeping during their "delay". Thoughts on this? Tha

[Nutch-dev] Patch: Multiple threads per host

2004-12-01 Thread Luke Baker
will wait the fetcher.server.delay only when it pops off the last thread accessing a host. With 1 thread per host this results in identical behavior as currently. Let me know what you think. I've tested it a little and seems to work as it is supposed to. Thanks, Luke Baker diff -Nur n

Re: [Nutch-dev] Experience with a big index

2004-11-29 Thread Luke Baker
eliminates massive amounts of network traffic. This is just one of the key features that I think that should be included in such a system for Nutch. Thanks, Luke Baker [1] http://labs.google.com/papers/mapreduce.html [2] http://www.cs.rochester.edu/sos

Re: [Nutch-dev] split a htmlpage or push and pull

2004-11-18 Thread Luke Baker
oesn't happen because we never have the same URL twice in our webdb, which is where our fetchlist comes from. Any thoughts or suggestions? Luke Baker From my point of view we have a kind of pull mechanism driven by the fetcher and the pulled content from the net will be processed in a pipeline.

Re: [Nutch-dev] [SPAM] url normalization

2004-11-12 Thread Luke Baker
t was fixed for you didn't have to do with URL normalization but rather URL parsing? Meaning for you, Nutch was previously not "parsing" the URLs properly when it was encountering them? I believe code to normalize these URLs should be put in BasicUrlNormalizer.java (and add re

Re: [Nutch-dev] Nutch content filter

2004-11-09 Thread Luke Baker
think it might do some things similar to what you might want to do. You'd have to write your own plugin if you went this route, but it is possible you could do it easily after looking at the creative commons plugin. Hope that helps, Luke Baker

Re: [Nutch-dev] PDF parsing speed

2004-10-31 Thread Luke Baker
On 10/31/2004 12:22 PM, John X wrote: [snip] What are the numbers for kb/s and bytes/page? I have a collection of mostly mswords, ppts and some pdfs, the numbers are 041001 194517 10 status: 0.17712256 pages/s, 8246.524 kb/s, 5959461.5 bytes/page Some files are very large in size: 100 - 300 Mbytes.

[Nutch-dev] PDF parsing speed

2004-10-30 Thread Luke Baker
Hey all, Does anyone else have the problem of the pdf parser taking up so many resources that it slows down the whole parsing process? I ran the fetch with the -noParsing option (thanks John!). I then ran the parser on the documents with the pdf parser enabled. The speed for parsing was quite

Re: [Nutch-dev] NDFS, DistributedSearch - redundant deployment proposal

2004-10-23 Thread Luke Baker
rvers each is on and if it is active on each server. Then when searching, the master sends (or tells clients to send) the searches to a subset of search servers but also tells each search server which "shard(s)" that the server should return results for. Maybe this is already what you

Re: [Nutch-dev] URL Normalization

2004-10-20 Thread Luke Baker
hold the normalized URLs or the Raw URL? Or does it normalize the URL just before doing a fetch? Thankx for taking the time to answer CC- URL normalization occurs when new URLs are injected or found by crawling. So, prior to being stored in WebDB, the URLs are normalized. Luke Baker

Re: [JNK] Re: [Nutch-dev] Memory usage

2004-10-14 Thread Luke Baker
#x27;s one email I found from Doug: http://sourceforge.net/mailarchive/message.php?msg_id=8078457 Luke Baker --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of the

Re: [Nutch-dev] Memory usage

2004-10-14 Thread Luke Baker
marks other folks have run on their own nutch installations. Hope this helps, Luke Baker --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us You

Re: [Nutch-dev] Shell scripts for intranet search/ crawl

2004-09-23 Thread Luke Baker
Dawid Weiss wrote: Re: urls containing parameters, there is a default rule in crawl-urlfilter.txt that removes these: # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] Maybe I wasn't clear: I know I can remove URLs with those characters, the thing is, I don't

Re: [Nutch-dev] Unit tests failing

2004-09-08 Thread Luke Baker
Andy Hedges wrote: Yes, check out from anonymous cvs a few hours ago. What OS are you running on? Oh, I'm running linux. I'm not familiar enough with Nutch on Windows to help on this one. Sorry. Luke --- This SF.Net email is sponsored by BEA W

Re: [Nutch-dev] Unit tests failing

2004-09-08 Thread Luke Baker
Andy Hedges wrote: SegmentMergeTool seems to have stopped working normally - I ran the Unit tests to sanity check and it seems they also point to this problem. I imagine someone familiar with the code can fix it faster than me :D The following test is failing: [junit] Running net.nutch.tools.Te

Re: [Nutch-dev] query filter fields bug

2004-09-07 Thread Luke Baker
Doug Cutting wrote: Luke Baker wrote: Here's a one-liner patch for this problem. Thanks for the help, Doug. Thanks for testing this. I checked it in as FieldQueryFilter.java. Please feel free to now submit a patch for "url:" queries. It should be a pretty simple plugin now!

Re: [Nutch-dev] query filter fields bug

2004-09-07 Thread Luke Baker
Luke Baker wrote: Doug, It fixed the exception I was getting when trying to run a query like the ones I listed. However it if I just search for: url:store It gets translated to: +url:url:store I haven't looked into why that Here's a one-liner patch for this problem. Thanks for the

[Nutch-dev] http code and fetcher speed

2004-09-05 Thread Luke Baker
Hello everyone, I was looking at the code in Http.java. It looks like that when fetching Nutch _never_ has multiple threads requesting or waiting on the same IP (purpose of BLOCKED_ADDR_TO_TIME). Is this just something that is done so that Nutch is nice to the hosts it crawls? Or is it the wa

Re: [Nutch-dev] query filter fields bug

2004-09-04 Thread Luke Baker
Doug, It fixed the exception I was getting when trying to run a query like the ones I listed. However it if I just search for: url:store It gets translated to: +url:url:store I haven't looked into why that might be happening but I will eventually if you don't get to it first :-) Other than th

Re: [Nutch-dev] UrlNormlize patch

2004-09-04 Thread Luke Baker
Luke Baker wrote: Hey, Attached is a patch which will allow people to easily create their own URL normalization class, just like what can be done with the URL filters. I haven't made any changes in functionality, just made UrlNormalize an interface instead of a class and used it appropri

[Nutch-dev] RegexUrlNormalizer

2004-09-04 Thread Luke Baker
; + +import java.util.List; +import java.util.ArrayList; +import java.util.Iterator; +import java.util.logging.Logger; +import net.nutch.util.LogFormatter; + +import javax.xml.parsers.*; +import org.w3c.dom.*; +import org.apache.oro.text.regex.*; + +import net.nutch.util.*; + +/** Allows users to do

[Nutch-dev] query filter fields bug

2004-09-03 Thread Luke Baker
same is true when using a '.' or '/' in the query using the cc: query filter. Any idea what is causing this? Thanks, Luke Baker --- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer

[Nutch-dev] test configuration directory?

2004-09-03 Thread Luke Baker
e should I put this test configuration file? Should there be an special conf/ directory for all tests that would have higher priority over anything in the usual conf/ directory? If yes, where should this directory be and would it just be a matter of changing the main build.xml? Thanks,

[Nutch-dev] UrlNormlize patch

2004-08-27 Thread Luke Baker
). I changed the JUnit test as well, which it passed. Let me know if I need any changes or of any objections as to how things were done. Thanks, Luke Baker diff -Nur --exclude='*.txt' --exclude='*-site.xml' --exclude='*.html' --exclude='*.jar' --exclude=

Re: [Nutch-dev] URL cleaning and munging

2004-05-24 Thread Luke Baker
Andrzej Bialecki on 05/21/2004 03:03 PM wrote: Hi, What would be the best place to put a code for URL cleaning, such as removing ";jsessionid", normalizing de-normalized file references (e.g. "site/html/../../img/logo.gif"), etc? I guess that would be net.nutch.net.UrlNormalizer... or perhaps I

[Nutch-dev] URL regex replace

2004-05-15 Thread Luke Baker
of automatic detection, then it might affect where we want the URL regex functionality to go. Thanks for the pointers, Luke Baker --- This SF.Net email is sponsored by: SourceForge.net Broadband Sign-up now for SourceForge Broadband and get

Re: [Nutch-dev] Comments on summaries and score tuning

2004-05-14 Thread Luke Baker
eir site. So count this as a vote for this functionality, but I also hope to start looking at adding the features I'd like in Nutch. Luke Baker Also, in the QueryTranslator currently the same boost value is used for sloppy and exact phrases. My intuition suggests that exact phrases should g

Re: [Nutch-dev] RE: TheServerSide Symposium Nutch presentation

2004-05-08 Thread Luke Baker
Asim Iqbal on 05/08/2004 12:48 AM wrote: "Doug received quite a nice review of his TSSS presentation on Nutch: " This is a good news for Nutch community.Keep up the good work. Asim Iqbal Yes, it is. Doug, any chance we could g

Re: [Nutch-dev] IO exception when searching

2004-05-06 Thread Luke Baker
[EMAIL PROTECTED] on 05/06/2004 01:11 PM wrote: Now a different exception is being thrown. I'm now starting catalina.sh in /usr/local/nutch-nightly/bin under /bin are db/ and segments/ org.apache.jasper.JasperException: /usr/local/nutch-nightly/bin/segments/20040505214118/index not a directory N