date:20051207

Nutch returns irrelevant site

2005-12-07 Thread Aled Jones

Hi

I'm currently setting up a nutch search engine that searches travel
websites.  It works quite well but sometimes returns odd results.
One good example:
One of the 100 or so sites I've asked it to crawl is
http://www.hfholidays.co.uk/ .  This site is mainly about walking
holidays and has many pages with the word walking in it, so when I
type in walking into nutch then I'd expect it to turn up, however the
first result I get back from using the keyword walking is
http://www.hfholidays.co.uk/email.asp .  This page doesn't have the word
walking in it anywhere.
Could someone please explain if this is a bug or the way nutch works.
I've got an idea how google works, if nutch works in a similar fashion
does this page appear because it is linked from many pages with the word
walking in them?

Thanks
Aled





This e-mail and any attachments are strictly confidential and intended solely 
for the addressee. They may contain information which is covered by legal, 
professional or other privilege. If you are not the intended addressee, you 
must not copy the e-mail or the attachments, or use them for any purpose or 
disclose their contents to any other person. To do so may be unlawful. If you 
have received this transmission in error, please notify us as soon as possible 
and delete the message and attachments from all places in your computer where 
they are stored. 

Although we have scanned this e-mail and any attachments for viruses, it is 
your responsibility to ensure that they are actually virus free.

Re: ad feed for nutch

2005-12-07 Thread Byron Miller

phpadsnew is ok.. not easy to integrate with a keyword
based system such as search.

I've used Inclick before with moderate success.. was
under heavy development at the time however the
developers seem to have a strong base to work from.

With my experience it's not affordable to really do
your own PPC and try and compete..  Backfill with
Google specific sites or establish a mutual/beneficial
relationship with a 2nd/3rd tier PPC engine that will
co-market with you.

-byron

--- Thomas Delnoij [EMAIL PROTECTED] wrote:

 It should be fairly easy to integrate PhpAdsNew with
 Nutch:
 http://phpadsnew.com/.
 
 Rgrds, Thomas
 
 On 12/7/05, Greg Cohen [EMAIL PROTECTED] wrote:
 
  Glenn,
 
  I'm trying to put together a project that will
 also require ad serving,
  but
  want it to be open source and give greater
 transparency to the advertisers
  than they get today with google and overture.  If
 you start developing
  one,
  were you thinking of making this open source
 project?
 
  Thanks.
 
  -greg
 
  -Original Message-
  From: Insurance Squared Inc.
 [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, December 06, 2005 3:37 AM
  To: nutch-user@lucene.apache.org
  Subject: ad feed for nutch
 
  Has anyone had any luck with advertising/ad
 management systems being
  integrated into nutch? Not just something for the
 owner to admin ads,
  but to allow external advertisers to manage their
 accounts/bids, that
  kind of thing.
 
  I'm drawing up plans for one if none are
 available, but clearly
  something that's already running would be nicer.
 
  Thanks,
  -glenn

Re: ad feed for nutch

2005-12-07 Thread Insurance Squared Inc.



Thank you Byron and Greg for your comments.   Upon reflection I think
I'll do as Byron has suggested and attempt an ad feed from third party
until I've got enough of a base to make my own system with my own
advertisers worthwhile (at which point Greg, yes, I'd be happy to make
it available).

My concern with a Google feed is that the base feed I don't think
integrates well - I don't want to be promoting 'ads by Google' and
would like some control over design.  I've done some research and I'm
going to check out searchfeed.com.  I'm sure if I get big enough Yahoo
or Google will provide something custom.  I know some folks that have a
Yahoo feed and claim they're very happy with it however they've got
such huge volume that they would get more consideration that a bit
player such as myself.  I've heard rumours that MSN is hooking up some
beta testers for their ad feed as well, no info on what they're doing.

Regards,

Glenn


Byron Miller wrote:


phpadsnew is ok.. not easy to integrate with a keyword
based system such as search.

I've used Inclick before with moderate success.. was
under heavy development at the time however the
developers seem to have a strong base to work from.

With my experience it's not affordable to really do
your own PPC and try and compete..  Backfill with
Google specific sites or establish a mutual/beneficial
relationship with a 2nd/3rd tier PPC engine that will
co-market with you.

-byron

--- Thomas Delnoij [EMAIL PROTECTED] wrote:

 


It should be fairly easy to integrate PhpAdsNew with
Nutch:
http://phpadsnew.com/.

Rgrds, Thomas

On 12/7/05, Greg Cohen [EMAIL PROTECTED] wrote:
   


Glenn,

I'm trying to put together a project that will
 


also require ad serving,
   


but
want it to be open source and give greater
 


transparency to the advertisers
   


than they get today with google and overture.  If
 


you start developing
   


one,
were you thinking of making this open source
 


project?
   


Thanks.

-greg

-Original Message-
From: Insurance Squared Inc.
 


[mailto:[EMAIL PROTECTED]
   


Sent: Tuesday, December 06, 2005 3:37 AM
To: nutch-user@lucene.apache.org
Subject: ad feed for nutch

Has anyone had any luck with advertising/ad
 


management systems being
   


integrated into nutch? Not just something for the
 


owner to admin ads,
   


but to allow external advertisers to manage their
 


accounts/bids, that
   


kind of thing.

I'm drawing up plans for one if none are
 


available, but clearly
   


something that's already running would be nicer.

Thanks,
-glenn

Nutch and Google Map togather for Real Estate search.

2005-12-07 Thread Benny Krauss

I think I found a website that put Nutch and google map togather for real
estate search.

http://www.realestateadvisor.com/

Nutch is amazing.

Re: Nutch and Google Map togather for Real Estate search.

2005-12-07 Thread Benny Krauss

Actually, I just made a guess. When I typed search.jsp in the site root
directory,  the file is there even some errors popup.

On 12/7/05, Stefan Groschupf [EMAIL PROTECTED] wrote:

 Interesting! What makes you think that they use nutch?

 Am 07.12.2005 um 16:48 schrieb Benny Krauss:

  I think I found a website that put Nutch and google map togather
  for real
  estate search.
 
  http://www.realestateadvisor.com/
 
  Nutch is amazing.

Re: Nutch and Google Map togather for Real Estate search.

2005-12-07 Thread Diane Palla

Yeah, I see it too.

At http://www.realestateadvisor.com/search.jsp  a url that I entered , not 
linked off of,

I see an invalid page with Nutch java exceptions:

root cause 
java.lang.NullPointerException
 at 
org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96)
 at 
org.apache.nutch.searcher.NutchBean.init(NutchBean.java:82)

Diane Palla
Web Services Developer
Seton Hall University
973 313-6199
[EMAIL PROTECTED]




Benny Krauss [EMAIL PROTECTED] 
12/07/2005 11:42 AM
Please respond to
nutch-user@lucene.apache.org


To
nutch-user@lucene.apache.org
cc

Subject
Re: Nutch and Google Map togather for Real Estate search.






Actually, I just made a guess. When I typed search.jsp in the site root
directory,  the file is there even some errors popup.

On 12/7/05, Stefan Groschupf [EMAIL PROTECTED] wrote:

 Interesting! What makes you think that they use nutch?

 Am 07.12.2005 um 16:48 schrieb Benny Krauss:

  I think I found a website that put Nutch and google map togather
  for real
  estate search.
 
  http://www.realestateadvisor.com/
 
  Nutch is amazing.

Re: try to restart aborted crawl

2005-12-07 Thread Piotr Kosiorowski


Hi,
I had the same problems with JVM crashes and it was in fact hardware 
problem (memory). It can also be a problem with your software config 
(but as far as I remember you are using quite standard configuration).
I doubt it has anything to do with nutch (except nutch stresses 
JVM/whole box) so it can be seen easier than during normal system usage.

Regards,
Piotr

wmelo wrote:
The biggest problem, is not to restart the crawl, but the problem that 
lead do failure itself, more precisely:
 
Exception in thread main java.io.IOException: key out of order:

http://web.mit
.edu/is/about/index.html after http://web.mit.edu/is/?ut/index.html;
 
 
This kind of problem occurs, with me, almost all the time (together with 
another that says that there is some problem with Java Hot Spot), 
preventing me to, really, use Nutch. 
 
I have reported those two problems before, without any answer.  I don't 
know, but this may be (or not)  a bug of Nutch (or of Lucene , I don't 
have any idea),   The only thing I know, is that both issues are very 
big non-conformities, that should be corrected as soon as possible.
 
Wmelo





No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.362 / Virus Database: 267.13.12/192 - Release Date: 5/12/2005

NDFS problem on mapred branch

2005-12-07 Thread Hamza Kaya

Hi,
We hava a mapred setup on 4 machines. (1 namenode and 3 datanodes)
I can access the file system from these machines without any problem.
However, when I tried to write a file to the NDFS on a machine other than
these 4 machines
I got the following error:

~/nutch-mapred/bin$ ./nutch ndfs -put nutch nutch
051206 152157 parsing file:/home/agmlab/nutch-mapred/conf/nutch-default.xml
051206 152158 parsing file:/home/agmlab/nutch-mapred/conf/nutch-site.xml
051206 152158 No FS indicated, using default:192.168.15.118:9001
051206 152158 Client connection to 192.168.15.118:9001: starting
Exception in thread main java.io.IOException: Cannot create file
/user/agmlab/nutch on client NDFSClient_1904460956
at org.apache.nutch.ipc.Client.call(Client.java:294)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:128)
at $Proxy0.create(Unknown Source)
at
org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.nextBlockOutputStream(
NDFSClient.java:537)
at org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.init(
NDFSClient.java:512)
at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:74)
at org.apache.nutch.fs.NDFSFileSystem.createRaw(NDFSFileSystem.java
:67)
at org.apache.nutch.fs.NFSDataOutputStream$Summer.init(
NFSDataOutputStream.java:41)
at org.apache.nutch.fs.NFSDataOutputStream.init(
NFSDataOutputStream.java:129)
at org.apache.nutch.fs.NutchFileSystem.create(NutchFileSystem.java
:175)
at org.apache.nutch.fs.NutchFileSystem.create(NutchFileSystem.java
:162)
at org.apache.nutch.fs.NDFSFileSystem.doFromLocalFile(
NDFSFileSystem.java:174)
at org.apache.nutch.fs.NDFSFileSystem.copyFromLocalFile(
NDFSFileSystem.java:149)
at org.apache.nutch.fs.NDFSShell.copyFromLocal(NDFSShell.java:46)
at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234)

From the same machine I was able to list the files and create directories.
What may be the problem?

Thanks.

--
Hamza Kaya

Re: NDFS problem on mapred branch

2005-12-07 Thread Andrzej Bialecki


Hamza Kaya wrote:


Hi,
We hava a mapred setup on 4 machines. (1 namenode and 3 datanodes)
I can access the file system from these machines without any problem.
However, when I tried to write a file to the NDFS on a machine other than
these 4 machines
I got the following error:

~/nutch-mapred/bin$ ./nutch ndfs -put nutch nutch
 



Could you try the same, but using absolute paths? NDFS client has no 
notion of relative or current directory, so the file names must always 
be absolute, i.e. starting with the leading / .


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: NDFS problem on mapred branch

2005-12-07 Thread Stefan Groschupf


I had the same problem, you will find it in the mail archive.
In my case it one box was unable to connect to the other.
There can be 2 cases, first may a fire wall block the ports or a  
common case is that the network dns name and the name the box uses to  
identify itself against other boxes is different.
Check that you can telnet on the port or / and ping all boxes from  
all other boxes by using the names that are setuped in the host.conf
Let me please know if this was also the problem in your case, since  
other people had this problem as well and we may should add this to  
the faq.



Stefan

Am 06.12.2005 um 14:48 schrieb Hamza Kaya:


Hi,
We hava a mapred setup on 4 machines. (1 namenode and 3 datanodes)
I can access the file system from these machines without any problem.
However, when I tried to write a file to the NDFS on a machine  
other than

these 4 machines
I got the following error:

~/nutch-mapred/bin$ ./nutch ndfs -put nutch nutch
051206 152157 parsing file:/home/agmlab/nutch-mapred/conf/nutch- 
default.xml
051206 152158 parsing file:/home/agmlab/nutch-mapred/conf/nutch- 
site.xml

051206 152158 No FS indicated, using default:192.168.15.118:9001
051206 152158 Client connection to 192.168.15.118:9001: starting
Exception in thread main java.io.IOException: Cannot create file
/user/agmlab/nutch on client NDFSClient_1904460956
at org.apache.nutch.ipc.Client.call(Client.java:294)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:128)
at $Proxy0.create(Unknown Source)
at
org.apache.nutch.ndfs.NDFSClient 
$NDFSOutputStream.nextBlockOutputStream(

NDFSClient.java:537)
at org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.init(
NDFSClient.java:512)
at org.apache.nutch.ndfs.NDFSClient.create(NDFSClient.java:74)
at org.apache.nutch.fs.NDFSFileSystem.createRaw 
(NDFSFileSystem.java

:67)
at org.apache.nutch.fs.NFSDataOutputStream$Summer.init(
NFSDataOutputStream.java:41)
at org.apache.nutch.fs.NFSDataOutputStream.init(
NFSDataOutputStream.java:129)
at org.apache.nutch.fs.NutchFileSystem.create 
(NutchFileSystem.java

:175)
at org.apache.nutch.fs.NutchFileSystem.create 
(NutchFileSystem.java

:162)
at org.apache.nutch.fs.NDFSFileSystem.doFromLocalFile(
NDFSFileSystem.java:174)
at org.apache.nutch.fs.NDFSFileSystem.copyFromLocalFile(
NDFSFileSystem.java:149)
at org.apache.nutch.fs.NDFSShell.copyFromLocal 
(NDFSShell.java:46)

at org.apache.nutch.fs.NDFSShell.main(NDFSShell.java:234)

From the same machine I was able to list the files and create  
directories.

What may be the problem?

Thanks.

--
Hamza Kaya


---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net

Re: Nutch returns irrelevant site

2005-12-07 Thread Piotr Kosiorowski

You can use explain page to find out why this page is scored the way it
is. I would expect anchor text would be th emain component of it.

Regards
Piotr

Aled Jones wrote:

I'm currently setting up a nutch search engine that searches travel
websites. It works quite well but sometimes returns odd results.
One good example:
One of the 100 or so sites I've asked it to crawl is
http://www.hfholidays.co.uk/ . This site is mainly about walking
holidays and has many pages with the word walking in it, so when I
type in walking into nutch then I'd expect it to turn up, however the
first result I get back from using the keyword walking is
http://www.hfholidays.co.uk/email.asp . This page doesn't have the word
walking in it anywhere.
Could someone please explain if this is a bug or the way nutch works.
I've got an idea how google works, if nutch works in a similar fashion
does this page appear because it is linked from many pages with the word
walking in them?

Thanks
Aled

This e-mail and any attachments are strictly confidential and intended solely for the addressee. They may contain information which is covered by legal, professional or other privilege. If you are not the intended addressee, you must not copy the e-mail or the attachments, or use them for any purpose or disclose their contents to any other person. To do so may be unlawful. If you have received this transmission in error, please notify us as soon as possible and delete the message and attachments from all places in your computer where they are stored.

Although we have scanned this e-mail and any attachments for viruses, it is
your responsibility to ensure that they are actually virus free.

Re: Upgrading from Nutch 0.7.1 to 0.8

2005-12-07 Thread Stefan Groschupf


Dave,

here is a step by step tutorial to setup a .08 on a set of boxes:
http://wiki.media-style.com/display/nutchDocu/setup+a+map+reduce+multi 
+box+system

May this can help you.
Stefan

Am 07.12.2005 um 17:50 schrieb Goldschmidt, Dave:


Hello,



Any caveats or pitfalls in upgrading from Nutch 0.7.1 to the latest  
0.8
nightly build?  I'd like to rebuild a 1-machine 0.7.1 environment,  
then

distribute it out to 2 machines using NDFS.



Thanks!

DaveG





---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net

Re: searching while crawling.

2005-12-07 Thread Stefan Groschupf


Hi.
you do the generate, fetch, update, index cycle and than you can add  
this segment to be searched.
I prefer to have a ready folder and a working folder and in case a  
segemtn is ready indexed my shell script just move it to ready.

After 30 days I delete the segment to start from beginning.
HTH
Stefan

Am 07.12.2005 um 07:53 schrieb K.A.Hussain Ali:



HI all,

while crawling using Nutch could we make out a search over the  
segments crawled and indexed.

i had some error while doing the above way , i dont get any hits,
kindly send me ur suggestion to overcome the above,and also
should we do search only after the whole crawling ends ?

Thanks in advance
regards
-Hussain


---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net

RE: Class Not Found

2005-12-07 Thread Vanderdray, Jacob

Sorry it took so long, but I downloaded and installed ant from
ant.apache.org and was able to build the war file without a problem
(once I'd gotten /etc/ant.conf from the rpm out of the way).  So if
anyone else hits this, just install from either the source or binary
installation from ant.apache.org instead of using an rpm.

Thanks,
Jake.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Saturday, December 03, 2005 1:32 PM
To: nutch-user@lucene.apache.org
Subject: Re: Class Not Found

Vanderdray, Jacob wrote:
   I installed ant from an rpm.  It is possible that the rpm I
 grabbed just doesn't have everything I need.

I have seen this problem too using ant installed from rpm.  I recommend 
downloading ant from Apache.

Doug

Re: Setting up a crawler for a country.

2005-12-07 Thread Insurance Squared Inc.


Mostly an FYI post on those working with country specific SE's:

Just to continue on this topic, the country code TLD I'm looking at 
doesn't provide any information so we're back to crawling to find 
domains.  To add to the complexity, there's lots of people who register 
.com's as their main domain here, instead of the country specific TLD.


So our intended solution is to hack the filter so that it only crawls 
and follows sites that match the specific TLD *or* match ARIN's IP list 
for addresses in the country.  (note: Arin publishes a list of IP 
assignments by country).  Not perfect, but it sure beats hand review.)  
We'll assume that if they're hosted here that they're likely a site 
relevant to the country. 

For any remaining sites we're going to offer a manually submission 
service.  The sites will be reviewed manually, then added into the 
filter.  (we've got a preliminary php program that does this running now). 

ARIN's IP mapping to country isn't quite perfect.  For example our 
servers are located here yet show up as being in the range of another 
country.  I expect I'll occassionally review the list of sites we've 
added manually and look for trends in the IP address list to see if we 
have any missing ranges.  At that point I can pull those domains from 
the filter and just add them to the IP address list.


I'm concerned that adding a huge range of IP's to check will cause the 
crawler to slow.  However of the four bytes in an ip address, there are 
only about 10 possibilities in the first byte (i.e. the 
000.XXX.XXX.XXX).  So we'll check just the first byte, then continue to 
drill down if there's a match.


HTH. 




Matt Kangas wrote:

glenn, i know that verisign makes this available for .com and .net as  
TLD zone files.


for ccTLDs like .us and .uk, you'll have to see if the TLD registrar  
provides the same. the following page has some useful links to these  
folks:


http://www.dnsstuff.com/info/dnslinks.htm

--matt

On Nov 29, 2005, at 10:23 AM, Insurance Squared Inc. wrote:

Along these same lines (as I'm interested in a similiar country- 
specific project), is there any place to get a list of all the  
domains for a specific TLD to use to seed nutch?  i.e. if I wanted  
to get a list of all currently registered .it, .de, or .ca's?
I've looked without success.  I'm thinking that this information  
isn't available due to spamming issues, however in the paper you  
referenced they discuss crawling an entire TLD which seemed to  
indicate they may have access to this info.


Thanks,
Glenn


Ken Krugler wrote:



Is there anyone that can implement a country crawler?  I  
estimate around 40m documents.  Please send me info about your  
prev work and how much time it would take to setup and money :-)




Check out the paper titled Crawling a Country: Better Strategies  
than Breadth-First for Web Page Ordering by Ricardo Baeza-Yates   
others. They were using a crawl of Chilean domains to test  
strategies for efficient crawling, so it seems like it would be of  
interest to you.


The main problem we've run into in doing similar limited domain  
crawls is that you wind up with many fewer hosts, and thus more  
URLs/host in any given fetch loop. The restriction of being polite  
(one thread per host) leads to lots of retry errors caused by  
fetcher threads blocking on a host (IP address) that is already  
being accessed by another fetcher thread, and thus lower pages/ 
second throughput.


So we've been making some mods to Nutch to improve our  performance, 
but it's not debugged yet...getting closer, though.


-- Ken





--
Matt Kangas / [EMAIL PROTECTED]

Luke and Indexes

2005-12-07 Thread Bryan Woliner

I have a couple very basic questions about Luke and indexes in
general. Answers to any of these questions are much appreciated:

1. In the Luke overview tab, what does Index version refer to?

2. Also in the overview tab, if Has Deletions? is equal to yes,
where are the possible sources of deletions? Dedup? Manual deletions
through luke?

3. Is there any way (w/ Luke or otherwise) to get a file listing all
of the docs in an index. Basically is there an index equivalent of
this command (which outputs all the URLs in a segment):

bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir

4. Finally, my last question is the one I'm most perplexed by:

I called bin/nutch segread -list -dir for a particular segments
directory and found out that one directory had 93 entries. BUT, when I
opened up the index of that segment in Luke, there were only 23
documents (and 3 deletions)! Where did the rest of the URLs go??

Thanks ahead of time for any helpful suggestions,
Bryan

Re: fetch of file:///F:/xxx/xxx/xxx.txt failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file

2005-12-07 Thread Arun Kaundal

I am unable to understand, what u want to say. Is it possible for u to send
me any configuration onm the form of attachment.
with Thanx


On 12/8/05, Hasan Diwan [EMAIL PROTECTED] wrote:


 On Dec 5, 2005, at 4:57 AM, Arun Kaundal wrote:

I am getting protocol not found error. What configuartionsetting
  require for my case. Plz come up with solution soon, I am waiting
  my posting from long time.
 In your crawl-filter.txt:
 -^(file|ftp|mailto): # remove the word file, leaving
  # -^(ftp|mailto):
 -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
 tgz|mov|MOV|e
 xe|png|PNG)$
 [EMAIL PROTECTED]
 +^http*://([a-z0-9]*\.)*/
 +^https*://([a-z0-9]*\.)*/
 +^file:///* # Add this
 -.


 Cheers,
 Hasan Diwan [EMAIL PROTECTED]

Re: [Nutch-general] RE: Speed of indexing

2005-12-07 Thread ogjunk-nutch

It's slightly different, actually.
mergeFactor only controls the rate of Lucene index segment creation. 
It doesn't control in-memory stuff.  That is what minmergeDocs controls
- it controls the size of in-memory buffer.  If you are curious about
this and other Lucene details, they are described in Lucene in Action -
http://www.lucenebook.com/ - code examples are free to download.

Otis


--- Goldschmidt, Dave [EMAIL PROTECTED] wrote:

 Thanks, I'm trying to get a better understanding of this.  Does
 anyone
 have experience working with these parameters for large datasets
 (7-20M
 documents)?
 
 What's the interplay between mergeFactor and minMergeDocs?
 
 I think the mergeFactor specifies how many documents to store in
 memory
 before writing to disk, yes?  But this may be overridden with the
 minMergeDocs parameter, which specifies how many documents must be
 buffered in memory before being merged -- DOES THIS MEAN that nothing
 is
 written to disk until minMergeDocs is reached?
 
 If I understand it correctly, the mergeFactor also specifies when to
 merge Lucene (not Nutch) segments into a new segment.  If I've a
 mergeFactor of 10, after processing 10 documents in memory, a Lucene
 segment may be written to disk.  When 10 Lucene segments exist, they
 are
 merged into a single 100-document segment, etc.  Back to
 minMergeDocs,
 DOES THIS MEAN that all this mergeFactor-based merging occurs in
 memory
 until minMergeDocs is reached?
 
 What about using RAMDirectory instead?
 
 Help?
 
 Thanks!
 DaveG
 
 
 -Original Message-
 From: Stefan Groschupf [mailto:[EMAIL PROTECTED] 
 Sent: Monday, December 05, 2005 4:29 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Speed of indexing
 
 The lucene wiki and the lucene in action book provides at least a  
 description of the formula, but also no magic formula.
 Just check the nutch configration file. Every value that  minimize  
 disk access (increase memory usage) improve speed.
 
 Stefan
 
 Am 05.12.2005 um 22:24 schrieb Goldschmidt, Dave:
 
  Hello,
 
  In searching for solutions, I found an old post from Doug on tuning
  these parameters -- but this old message applied to ~30,000
 documents
  only:
 
  http://marc.theaimsgroup.com/?l=lucene-userm=110235452004799w=2
 
  I've upped both the mergeFactor and minMergeDocs to 1000 and was  
  able to
  get ~200 records/second -- not bad, but is this the best I can do?
  That's still ~8 hours of indexing time (I haven't hit the
 'updatedb'
  phase yet!).
 
  I'm going to keep playing with these parameters -- BUT does anyone 
 
  have
  a FORMULA for tuning these parameters given the memory, Java heap  
  size,
  etc.?  :-)
 
  Thanks,
  DaveG
 
 
  -Original Message-
  From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED]
  Sent: Monday, December 05, 2005 2:43 PM
  To: nutch-user@lucene.apache.org
  Subject: RE: Speed of indexing
 
  Hi, no additional plugins enabled -- just an out of the box
 build.
 
  And, no, I haven't set any nutch-site settings -- default setting
 for
  indexer.minMergeDocs is 50, indexer.maxMergeDocs is 2147483647,
  indexer.mergeFactor is 50.  Any rule-of-thumb formula for setting  
  these
  values?
 
  Note I've upped the number of open files from 1024 to 4096.
 
  Thanks,
  DaveG
 
  -Original Message-
  From: Byron Miller [mailto:[EMAIL PROTECTED]
  Sent: Monday, December 05, 2005 2:36 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: Speed of indexing
 
  Which plugins do you have enabled? Have you optimized
  any of your nutch-site settings yet?
 
  -byron
 
  --- Goldschmidt, Dave [EMAIL PROTECTED]
  wrote:
 
  Hello,
 
 
 
  I'm currently indexing ~50 segments, each ~2GB in
  size, for a total of
  only ~7,000,000 pages.  From the log output, I see
  an index rate of ~72
  records/second.  Doing the math, this is over 24
  hours of time to index
  these segments.
 
 
 
  Does this sound slow?  If so, any suggestions as to
  how to tune this?
  Note I'm using Nutch 0.7.1 on a Linux box with dual
  CPUs, 2GB of memory
  and a 250GB partition to play with.
 
 
 
  Thanks,
 
  DaveG
 
 
 
 
 
 
 
 ---
 company:http://www.media-style.com
 forum:http://www.text-mining.org
 blog:http://www.find23.net
 
 
 
 
 ---
 This SF.net email is sponsored by: Splunk Inc. Do you grep through
 log files
 for problems?  Stop!  Download the new AJAX search engine that makes
 searching your log files as easy as surfing the  web.  DOWNLOAD
 SPLUNK!
 http://ads.osdn.com/?ad_idv37alloc_id865op=click
 ___
 Nutch-general mailing list
 Nutch-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nutch-general

Nutch returns irrelevant site

Re: ad feed for nutch

Re: ad feed for nutch

Nutch and Google Map togather for Real Estate search.

Re: Nutch and Google Map togather for Real Estate search.

Re: Nutch and Google Map togather for Real Estate search.

Re: try to restart aborted crawl

NDFS problem on mapred branch

Re: NDFS problem on mapred branch

Re: NDFS problem on mapred branch

Re: Nutch returns irrelevant site

Re: Upgrading from Nutch 0.7.1 to 0.8

Re: searching while crawling.

RE: Class Not Found

Re: Setting up a crawler for a country.

Luke and Indexes

Re: fetch of file:///F:/xxx/xxx/xxx.txt failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file

Re: [Nutch-general] RE: Speed of indexing

18 matches

Site Navigation

Mail list logo

Footer information