Re: IndexOptimizer bug?
Dear Michael, I writed a tool OptimizeIndex.java, this is faster and there aren't questions: what it is do? After you optimize index with IndexOptimizer, the number of searching for 'http' is the same? Regards, Ferenc Michael Nebel wrotte: Hi, I fixed the problem with the following patch: --- IndexOptimizer.java 2005-08-04 12:55:54.0 +0200 +++ IndexOptimizer.java.~1.6.~ 2005-01-21 00:48:50.0 +0100 @@ -138,7 +138,7 @@ if (score minScore) { sdq.put(new ScoreDoc(doc, score)); - if (sdq.size() = count) { // if sdq overfull + if (sdq.size() count) { // if sdq overfull sdq.pop();// remove lowest in sdq minScore = ((ScoreDoc)sdq.top()).score; // reset minScore } My index shrinked from 8.5 GB to 0.5 GB. I found no documentation about the background of this tool. Can anyone tell me, what's the idea behind? Regards Michael Andy Liu wrote: I believe this tool is unfinished and unsupported. On 7/22/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I found an IndexOptimzer in nutch. When I run it, it dorps an exception: Optimizing url:http from 226957 to 22696 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 22697 at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:46) at org.apache.nutch.indexer.IndexOptimizer$OptimizingTermPositions.seek(IndexOptimizer.java:153) at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96) at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:578) at org.apache.nutch.indexer.IndexOptimizer.optimize(IndexOptimizer.java:215) at org.apache.nutch.indexer.IndexOptimizer.main(IndexOptimizer.java:235)
Re: Documentation
try: http://wiki.media-style.com/display/nutchDocu/Home Stefan Am 04.08.2005 um 19:54 schrieb Nishant Chandra: Hi, I am new to nutch. Is there any articles/tutorials which explains the internal working of the crawler (crawl stratergy) etc. Nishant --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: near-term plan
Hi Doug, The slides from my talk yesterday at OSCON give some hints on how to get started. We need a MapReduce tutorial. http://wiki.apache.org/nutch/Presentations Can you explan what this means: Page 20: - cheduling is bottleneck, not disk, network or CPU? Thanks. Stefan
Re: near-term plan
Stefan Groschupf wrote: http://wiki.apache.org/nutch/Presentations Can you explan what this means: Page 20: - cheduling is bottleneck, not disk, network or CPU? I mean that neither the CPUs, disks or network are at 100% of capacity. Disks are running around 50% busy, CPUs a bit higher, and the network switch has lots of bandwidth left. (Although, if we used multiple racks connected with gigabit links, these inter-rack links would already be near capacity.) So sometimes the CPU is busy generating random data and stuffing it in a buffer, and sometimes the disk is busy writing data, but we're not keeping both busy at the same time all the time. Perhaps if more threads/processes and/or bigger buffers would increase the utilization--I have not tried to tune things for this benchmark. But I am not dissapointed with this performance. Rather, I think that it is fast enough so that with real applications, with non-trival map and reduce functions, NDFS will not be a bottleneck. Doug
Re: near-term plan
Hello, I think it is good idea to release ASAP. I wanted to contribute my code for fault-tolerant searching - it takes more time than I expected because as some of you know in meantime I become a father. But I hope I will be able to send something for comments early next week. I will look at the Jira to check if some more bugs can be fixed before deadline proposed by Andrzej. Regards Piotr Andrzej Bialecki wrote: Doug Cutting wrote: Here's a near-term plan for Nutch. 1. Release Nutch 0.7, based on current trunk. We should do this ASAP. Are there bugs in trunk that we need to fix before this can be done? The trunk will be copied to a 0.7 release branch. I'll be back from vacation in 3-4 days, I hope I can do some work in the meantime; I'd like to close some bugs marked with Major (e.g. the multi-line protocol properties), and perhaps integrate the RSS parser before the release. Other than that I think we should do it ASAP. So, I would propose a deadline of Aug 8 for the last commits, and then perhaps Aug 15 for the release? 2. Merge the mapred branch to trunk. 3. Move the packages org.apache.nutch.{io,ipc,fs,ndfs,mapred} into a separate project for distributed computing tools. If the Lucene PMC approves this, it would be a new Lucene sub-project, a Nutch sibling. I concur. They are very useful at times in unrelated projects.
Re: near-term plan
Doug I also ran into this when I was testing ndfs the system would have to wait for the namenode to tell the datanodes what data to recieve and which data to replicate, I'm currently setting up lustre to see how it works, its at the kernel level that it operates, do you think if the namenode was not java that it would perform better? I plan on running a system where the namenode (metadata) server will have to perform thousands of i-o's a sec,concurrently updating indexes of multiple segments simultaniously, updating the db on one machine, and fetching multiple segments on multiple machines, all accessing the same logical filesystem at the same time. the way that namenode responded it took a few seconds to replicate data to other datanodes, and it took time to start the copying of data, if writing an index imagine if you have to wait 1-10 secs per file to be written(if queued), that will cause serious problems. also I was able to saturate gigabit with ndfs (well about 50-60MBytes a sec its hard to get better than that with copper) , it just took a few secs to ramp up to speed, thats including file copying and replication. -Jay PS: where can I find out about the mapreduce, I read the presentations, but I dont get the core concept of it? PSS: via chips aernt very fpu powerfull try an opteron for your namenode, I bet you will see a huge improvement in speed, even over xeon's p4's etc... I was only able to test 5 machines but I was able to saturate 50-60mb a sec to each (mainly replication throughput running level 1) - Original Message - From: Doug Cutting [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, August 04, 2005 3:54 PM Subject: Re: near-term plan Stefan Groschupf wrote: http://wiki.apache.org/nutch/Presentations Can you explan what this means: Page 20: - cheduling is bottleneck, not disk, network or CPU? I mean that neither the CPUs, disks or network are at 100% of capacity. Disks are running around 50% busy, CPUs a bit higher, and the network switch has lots of bandwidth left. (Although, if we used multiple racks connected with gigabit links, these inter-rack links would already be near capacity.) So sometimes the CPU is busy generating random data and stuffing it in a buffer, and sometimes the disk is busy writing data, but we're not keeping both busy at the same time all the time. Perhaps if more threads/processes and/or bigger buffers would increase the utilization--I have not tried to tune things for this benchmark. But I am not dissapointed with this performance. Rather, I think that it is fast enough so that with real applications, with non-trival map and reduce functions, NDFS will not be a bottleneck. Doug
Re: near-term plan
Jay Pound wrote: Doug I also ran into this when I was testing ndfs the system would have to wait for the namenode to tell the datanodes what data to recieve and which data to replicate When did you test this? Which version of Nutch? How many nodes? My benchmark results from just a few days ago. There've been a lot of fixes in the past week and NDFS now works much better. I'm currently setting up lustre to see how it works, its at the kernel level that it operates, do you think if the namenode was not java that it would perform better? I plan on running a system where the namenode (metadata) server will have to perform thousands of i-o's a sec,concurrently updating indexes of multiple segments simultaniously, updating the db on one machine, and fetching multiple segments on multiple machines, all accessing the same logical filesystem at the same time. While running the benchmark the namenode was typically using only 2% of its 1Ghz CPU. PS: where can I find out about the mapreduce, I read the presentations, but I dont get the core concept of it? http://labs.google.com/papers/mapreduce.html PSS: via chips aernt very fpu powerfull try an opteron for your namenode, I bet you will see a huge improvement in speed, even over xeon's p4's etc... I was only able to test 5 machines but I was able to saturate 50-60mb a sec to each (mainly replication throughput running level 1) Via is not my first choice of CPU, it's simply what the Internet Archive has given me to use. With hundreds of datanodes a Via-based namenode could become a bottleneck. Right now it is not. Doug
Detecting unmodified content patches (Re: near-term plan)
Doug Cutting wrote: Andrzej Bialecki wrote: So, I would propose a deadline of Aug 8 for the last commits, and then perhaps Aug 15 for the release? Sounds good to me. Thanks for helping with this! Unfortunately, the patches related to detecting the unmodified content will have to wait until after the release. Here's the problem: It's quite easy to add this checking and recording capability to all fetcher plugins, fetchlist generation and db update tools, and I've done this in my local patches. However, after a while I discovered a serious problem in the way Nutch currently manages phasing out of old segment data. If we assume that we always refresh after some fixed interval (30 days, or whatever), then we can safely delete segments older than 30 days. If the interval varies, then potentially we could be stuck with some segments with very old (but still valid) data. This is very inefficient, because in a single given segment there might be only a couple of such pages left after a while, and the rest of them would have to be removed again and again by deduplication because newer pages would exist in newer segments. Moreover (and this is the worst problem) if such segments are lost, the information in webdb must be updated in a way to force refetching, even though the If-Modified-Since or the MD5 points out that the page is still unchanged since the last fetching. Currently the only way to do this is to add days - but if we use a variable refetch interval then it doesn't make much sense. I think we need to track in a better way which pages are missing from the segments, and have to be re-fetched, or to have a better DB update mechanism if we lose some segments. Perhaps we should extend the Page to record which segment holds the latest version of the page? But segments don't have unique ID's now (a directory name is too fragile and too easily changed) ... Related question: in the FetchListEntry we have a fetch flag. I think that after minor modifications of the FetchListTool (to generate only entries, which we are supposed to fetch) we could get rid of this flag, or change its semantics to mean unconditionally fetch, even if unmodified. Any comments? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Closed: (NUTCH-65) index-more plugin can't parse large set of modification-date
[ http://issues.apache.org/jira/browse/NUTCH-65?page=all ] Andrzej Bialecki closed NUTCH-65: -- Resolution: Fixed Patches applied. Thanks! index-more plugin can't parse large set of modification-date - Key: NUTCH-65 URL: http://issues.apache.org/jira/browse/NUTCH-65 Project: Nutch Type: Bug Components: indexer Environment: nutch 0.7, java 1.5, linux Reporter: Lutischán Ferenc I found a problem in MoreIndexingFilter.java. When I indexing segments, I get large list of error messages: can't parse errorenous date: Wed, 10 Sep 2003 11:59:14 or can't parse errorenous date: Wed, 10 Sep 2003 11:59:14GMT I modifiing source code (I don't make a 'patch'): Original (lines 137-138): DateFormat df = new SimpleDateFormat(EEE MMM dd HH:mm:ss zzz); Date d = df.parse(date); New: DateFormat df = new SimpleDateFormat(EEE, MMM dd HH:mm:ss , Locale.US); Date d = df.parse(date.substring(0,25)); The modified code works fine. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira