Re: IndexOptimizer bug?

2005-08-04 Thread [EMAIL PROTECTED]

Dear Michael,

I writed a tool OptimizeIndex.java, this is faster and there aren't 
questions: what it is do?
After you optimize index with IndexOptimizer, the number of searching 
for 'http' is the same?


Regards,
   Ferenc

Michael Nebel wrotte:


Hi,

I fixed the problem with the following patch:

--- IndexOptimizer.java 2005-08-04 12:55:54.0 +0200
+++ IndexOptimizer.java.~1.6.~  2005-01-21 00:48:50.0 +0100
@@ -138,7 +138,7 @@

 if (score  minScore) {
   sdq.put(new ScoreDoc(doc, score));
-  if (sdq.size() = count) {   // if sdq overfull
+  if (sdq.size()  count) {   // if sdq overfull
 sdq.pop();// remove lowest in 
sdq

 minScore = ((ScoreDoc)sdq.top()).score; // reset minScore
   }

My index shrinked from 8.5 GB to 0.5 GB. I found no documentation 
about the background of this tool. Can anyone tell me, what's the idea 
behind?


Regards

Michael



Andy Liu wrote:


I believe this tool is unfinished and unsupported.

On 7/22/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:


I found an IndexOptimzer in nutch.
When I run it, it dorps an exception:

Optimizing url:http from 226957 to 22696
Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 
22697
   at 
org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:46)

   at
org.apache.nutch.indexer.IndexOptimizer$OptimizingTermPositions.seek(IndexOptimizer.java:153) 


   at
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:325) 


   at
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:296) 


   at
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:270) 


   at
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:234) 


   at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:96)
   at
org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:578)
   at
org.apache.nutch.indexer.IndexOptimizer.optimize(IndexOptimizer.java:215) 


   at
org.apache.nutch.indexer.IndexOptimizer.main(IndexOptimizer.java:235)








Re: Documentation

2005-08-04 Thread Stefan Groschupf

try:
http://wiki.media-style.com/display/nutchDocu/Home

Stefan

Am 04.08.2005 um 19:54 schrieb Nishant Chandra:


Hi,
I am new to nutch. Is there any articles/tutorials which explains the
internal working of the crawler (crawl stratergy) etc.

Nishant




---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: near-term plan

2005-08-04 Thread Stefan Groschupf

Hi Doug,
The slides from my talk yesterday at OSCON give some hints on how  
to get started.  We need a MapReduce tutorial.


http://wiki.apache.org/nutch/Presentations


Can you explan what this means: Page 20:
- cheduling is bottleneck, not disk, network or CPU?

Thanks.
Stefan 

Re: near-term plan

2005-08-04 Thread Doug Cutting

Stefan Groschupf wrote:

http://wiki.apache.org/nutch/Presentations


Can you explan what this means: Page 20:
- cheduling is bottleneck, not disk, network or CPU?


I mean that neither the CPUs, disks or network are at 100% of capacity. 
 Disks are running around 50% busy, CPUs a bit higher, and the network 
switch has lots of bandwidth left.  (Although, if we used multiple racks 
connected with gigabit links, these inter-rack links would already be 
near capacity.)  So sometimes the CPU is busy generating random data and 
stuffing it in a buffer, and sometimes the disk is busy writing data, 
but we're not keeping both busy at the same time all the time.  Perhaps 
if more threads/processes and/or bigger buffers would increase the 
utilization--I have not tried to tune things for this benchmark.  But I 
am not dissapointed with this performance.  Rather, I think that it is 
fast enough so that with real applications, with non-trival map and 
reduce functions, NDFS will not be a bottleneck.


Doug


Re: near-term plan

2005-08-04 Thread Piotr Kosiorowski

Hello,
I think it is good idea to release ASAP. I wanted to contribute my code
for fault-tolerant searching - it takes more time than I expected 
because as some of you know in meantime I become a father. But I hope I 
will be able to send something for comments early next week. I will look 
at the Jira to check if some more bugs can be fixed before deadline 
proposed by Andrzej.

Regards
Piotr


Andrzej Bialecki wrote:

Doug Cutting wrote:


Here's a near-term plan for Nutch.

1. Release Nutch 0.7, based on current trunk.  We should do this ASAP. 
Are there bugs in trunk that we need to fix before this can be done? 
The trunk will be copied to a 0.7 release branch.




I'll be back from vacation in 3-4 days, I hope I can do some work in the 
meantime; I'd like to close some bugs marked with Major (e.g. the 
multi-line protocol properties), and perhaps integrate the RSS parser 
before the release. Other than that I think we should do it ASAP. So, I 
would propose a deadline of Aug 8 for the last commits, and then perhaps 
Aug 15 for the release?



2. Merge the mapred branch to trunk.

3. Move the packages org.apache.nutch.{io,ipc,fs,ndfs,mapred} into a 
separate project for distributed computing tools.  If the Lucene PMC 
approves this, it would be a new Lucene sub-project, a Nutch sibling.



I concur. They are very useful at times in unrelated projects.






Re: near-term plan

2005-08-04 Thread Jay Pound
Doug I also ran into this when I was testing ndfs the system would have to
wait for the namenode to tell the datanodes what data to recieve and which
data to replicate, I'm currently setting up lustre to see how it works, its
at the kernel level that it operates, do you think if the namenode was not
java that it would perform better? I plan on running a system where the
namenode (metadata) server will have to perform thousands of i-o's a
sec,concurrently updating indexes of multiple segments simultaniously,
updating the db on one machine, and fetching multiple segments on multiple
machines, all accessing the same logical filesystem at the same time. the
way that namenode responded it took a few seconds to replicate data to other
datanodes, and it took time to start the copying of data, if writing an
index imagine if you have to wait 1-10 secs per file to be written(if
queued), that will cause serious problems. also I was able to saturate
gigabit with ndfs (well about 50-60MBytes a sec its hard to get better than
that with copper) , it just took a few secs to ramp up to speed, thats
including file copying and replication.
-Jay
PS: where can I find out about the mapreduce, I read the presentations, but
I dont get the core concept of it?

PSS: via chips aernt very fpu powerfull try an opteron for your namenode, I
bet you will see a huge improvement in speed, even over xeon's p4's etc... I
was only able to test 5 machines but I was able to saturate 50-60mb a sec to
each (mainly replication throughput running level 1)

- Original Message - 
From: Doug Cutting [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, August 04, 2005 3:54 PM
Subject: Re: near-term plan


 Stefan Groschupf wrote:
  http://wiki.apache.org/nutch/Presentations
 
  Can you explan what this means: Page 20:
  - cheduling is bottleneck, not disk, network or CPU?

 I mean that neither the CPUs, disks or network are at 100% of capacity.
   Disks are running around 50% busy, CPUs a bit higher, and the network
 switch has lots of bandwidth left.  (Although, if we used multiple racks
 connected with gigabit links, these inter-rack links would already be
 near capacity.)  So sometimes the CPU is busy generating random data and
 stuffing it in a buffer, and sometimes the disk is busy writing data,
 but we're not keeping both busy at the same time all the time.  Perhaps
 if more threads/processes and/or bigger buffers would increase the
 utilization--I have not tried to tune things for this benchmark.  But I
 am not dissapointed with this performance.  Rather, I think that it is
 fast enough so that with real applications, with non-trival map and
 reduce functions, NDFS will not be a bottleneck.

 Doug






Re: near-term plan

2005-08-04 Thread Doug Cutting

Jay Pound wrote:

Doug I also ran into this when I was testing ndfs the system would have to
wait for the namenode to tell the datanodes what data to recieve and which
data to replicate


When did you test this?  Which version of Nutch?  How many nodes?  My 
benchmark results from just a few days ago.  There've been a lot of 
fixes in the past week and NDFS now works much better.



I'm currently setting up lustre to see how it works, its
at the kernel level that it operates, do you think if the namenode was not
java that it would perform better? I plan on running a system where the
namenode (metadata) server will have to perform thousands of i-o's a
sec,concurrently updating indexes of multiple segments simultaniously,
updating the db on one machine, and fetching multiple segments on multiple
machines, all accessing the same logical filesystem at the same time.


While running the benchmark the namenode was typically using only 2% of 
its 1Ghz CPU.



PS: where can I find out about the mapreduce, I read the presentations, but
I dont get the core concept of it?


http://labs.google.com/papers/mapreduce.html


PSS: via chips aernt very fpu powerfull try an opteron for your namenode, I
bet you will see a huge improvement in speed, even over xeon's p4's etc... I
was only able to test 5 machines but I was able to saturate 50-60mb a sec to
each (mainly replication throughput running level 1)


Via is not my first choice of CPU, it's simply what the Internet Archive 
has given me to use.  With hundreds of datanodes a Via-based namenode 
could become a bottleneck.  Right now it is not.


Doug


Detecting unmodified content patches (Re: near-term plan)

2005-08-04 Thread Andrzej Bialecki

Doug Cutting wrote:

Andrzej Bialecki wrote:

So, I would propose a deadline of Aug 8 for the last commits, and then 
perhaps Aug 15 for the release?



Sounds good to me.  Thanks for helping with this!


Unfortunately, the patches related to detecting the unmodified content 
will have to wait until after the release.


Here's the problem: It's quite easy to add this checking and recording 
capability to all fetcher plugins, fetchlist generation and db update 
tools, and I've done this in my local patches. However, after a while I 
discovered a serious problem in the way Nutch currently manages phasing 
out of old segment data. If we assume that we always refresh after some 
fixed interval (30 days, or whatever), then we can safely delete 
segments older than 30 days. If the interval varies, then potentially we 
could be stuck with some segments with very old (but still valid) data. 
This is very inefficient, because in a single given segment there might 
be only a couple of such pages left after a while, and the rest of them 
would have to be removed again and again by deduplication because newer 
pages would exist in newer segments.


Moreover (and this is the worst problem) if such segments are lost, the 
information in webdb must be updated in a way to force refetching, even 
though the If-Modified-Since or the MD5 points out that the page is 
still unchanged since the last fetching. Currently the only way to do 
this is to add days - but if we use a variable refetch interval then 
it doesn't make much sense. I think we need to track in a better way 
which pages are missing from the segments, and have to be re-fetched, 
or to have a better DB update mechanism if we lose some segments.


Perhaps we should extend the Page to record which segment holds the 
latest version of the page? But segments don't have unique ID's now (a 
directory name is too fragile and too easily changed) ...


Related question: in the FetchListEntry we have a fetch flag. I think 
that after minor modifications of the FetchListTool (to generate only 
entries, which we are supposed to fetch) we could get rid of this flag, 
or change its semantics to mean unconditionally fetch, even if unmodified.


Any comments?

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Closed: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-08-04 Thread Andrzej Bialecki (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-65?page=all ]
 
Andrzej Bialecki  closed NUTCH-65:
--

Resolution: Fixed

Patches applied. Thanks!

 index-more plugin can't parse large set of  modification-date
 -

  Key: NUTCH-65
  URL: http://issues.apache.org/jira/browse/NUTCH-65
  Project: Nutch
 Type: Bug
   Components: indexer
  Environment: nutch 0.7, java 1.5, linux
 Reporter: Lutischán Ferenc


 I found a problem in MoreIndexingFilter.java.
 When I indexing segments, I get large list of error messages:
 can't parse errorenous date: Wed, 10 Sep 2003 11:59:14 or
 can't parse errorenous date: Wed, 10 Sep 2003 11:59:14GMT
 I modifiing source code (I don't make a 'patch'):
 Original (lines 137-138):
 DateFormat df = new SimpleDateFormat(EEE MMM dd HH:mm:ss  zzz);
 Date d = df.parse(date);
 New:
 DateFormat df = new SimpleDateFormat(EEE, MMM dd HH:mm:ss , Locale.US);
 Date d = df.parse(date.substring(0,25));
 The modified code works fine.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira