Re: Fetcher2 Slow

2009-03-17 Thread Roger Dunk
Now that the soon to be released v1 uses Fetcher2 as default (or as the only 
fetcher available?), I would think that this slowness problem that is facing 
a number of users might be addressed?


In short the case for me is like this:

Nutch trunk revision 755143
JDK 1.6_12 on Linux

Crawl list consists of ~40,000 URLs from dmoz, so naturally are well 
distributed among hosts (i.e. mostly unique hosts).


Config options:
fetcher.threads.fetch = 80
fetcher.threads.per.host = 80
fetcher.server.delay = 0

The result?

Most of the time, something like this:

activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0

If I'm lucky, it might fetch around 1 page per second (or less).

What I have noticed is that if I let it run for a while, cancel the fetch, 
and start it again from the beginning, it runs very quickly for a while 
before it slows right down to a trickle again. My guess is that the hosts 
that have cached by my caching NS are fetched quickly, but new lookups are 
taking an age and slowing things down. However, I don't believe my NS is 
slow by any means. And furthermore, the old Fetcher1 never had this problem.


Any ideas where to look to track this down?

Thanks,
Roger

--
From: Roger Dunk ro...@at.com.au
Sent: Thursday, February 05, 2009 2:16 PM
To: nutch-user@lucene.apache.org
Subject: Re: Fetcher2 Slow

It makes no difference if I set fetcher.threads.per.host to 1 or 100, 
which I assume is what you were suggesting?


I also stated that the majority of pages to fetch were from unique hosts, 
so I believe the value of this parameter should not really come into play.


Cheers...
Roger

--
From: Laurent Laborde kerdez...@gmail.com
Sent: Tuesday, February 03, 2009 5:51 PM
To: nutch-user@lucene.apache.org
Subject: Re: Fetcher2 Slow


On Tue, Feb 3, 2009 at 4:10 AM, Roger Dunk ro...@at.com.au wrote:

Hi all,

I'm having no luck whatsoever using Fetcher2, as even with 50 threads 
enabled and parsing disabled, I have 48 or 49 threads SpinWaiting, and 0 
hosts in the queue. I do however have some 50,000 pages to fetch, the 
majority of which are from unique hosts.


The regular fetcher works as expected, fetching concurrently from 50 
hosts.


There is a configuration parameters limiting the concurent fetcher per
unique host.

--
F4FQM
Kerunix Flan
Laurent Laborde




Re: Fetcher2 Slow

2009-03-17 Thread Roger Dunk
Andrzej stated in NUTCH-669 that some people reported performance issues 
with Fetcher2, i.e. that it doesn't use the available bandwidth. These 
reports are unconfirmed, and they may have been caused by suboptimal URL / 
host distribution in a fetchlist - but it would be good to review the 
synchronization and threading aspects of Fetcher2.


To address this, I've tried just now generating a fetchlist using 
generate.max.per.host = 1 (which gave me 35,000 unique hosts) to guarantee 
unique hosts, but the problem still remains.


Therefore, I believe it's clearly not an issue of suboptimal URL / host 
distribution. If you require any further information to confirm my report, 
you need only ask!


Cheers...
Roger

--
From: Roger Dunk ro...@at.com.au
Sent: Tuesday, March 17, 2009 7:10 PM
To: nutch-user@lucene.apache.org
Subject: Re: Fetcher2 Slow

Now that the soon to be released v1 uses Fetcher2 as default (or as the 
only fetcher available?), I would think that this slowness problem that is 
facing a number of users might be addressed?


In short the case for me is like this:

Nutch trunk revision 755143
JDK 1.6_12 on Linux

Crawl list consists of ~40,000 URLs from dmoz, so naturally are well 
distributed among hosts (i.e. mostly unique hosts).


Config options:
fetcher.threads.fetch = 80
fetcher.threads.per.host = 80
fetcher.server.delay = 0

The result?

Most of the time, something like this:

activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0

If I'm lucky, it might fetch around 1 page per second (or less).

What I have noticed is that if I let it run for a while, cancel the fetch, 
and start it again from the beginning, it runs very quickly for a while 
before it slows right down to a trickle again. My guess is that the hosts 
that have cached by my caching NS are fetched quickly, but new lookups are 
taking an age and slowing things down. However, I don't believe my NS is 
slow by any means. And furthermore, the old Fetcher1 never had this 
problem.


Any ideas where to look to track this down?

Thanks,
Roger

--
From: Roger Dunk ro...@at.com.au
Sent: Thursday, February 05, 2009 2:16 PM
To: nutch-user@lucene.apache.org
Subject: Re: Fetcher2 Slow

It makes no difference if I set fetcher.threads.per.host to 1 or 100, 
which I assume is what you were suggesting?


I also stated that the majority of pages to fetch were from unique hosts, 
so I believe the value of this parameter should not really come into 
play.


Cheers...
Roger

--
From: Laurent Laborde kerdez...@gmail.com
Sent: Tuesday, February 03, 2009 5:51 PM
To: nutch-user@lucene.apache.org
Subject: Re: Fetcher2 Slow


On Tue, Feb 3, 2009 at 4:10 AM, Roger Dunk ro...@at.com.au wrote:

Hi all,

I'm having no luck whatsoever using Fetcher2, as even with 50 threads 
enabled and parsing disabled, I have 48 or 49 threads SpinWaiting, and 
0 hosts in the queue. I do however have some 50,000 pages to fetch, the 
majority of which are from unique hosts.


The regular fetcher works as expected, fetching concurrently from 50 
hosts.


There is a configuration parameters limiting the concurent fetcher per
unique host.

--
F4FQM
Kerunix Flan
Laurent Laborde




Re: nutch 0.7

2009-03-17 Thread Mayank Kamthan
Hello ppl,
Please provide a pointer to 0.7 release.. I need it  urgently..

Thanks  n regards,
Mayank.

On Mon, Mar 16, 2009 at 2:23 PM, Mayank Kamthan mkamt...@gmail.com wrote:

 Hi!

 I need nutch 0.7. Can someone please provide me a pointer to it to
 download.
 When I try via the Apache site it leads me to nutch 0.9.
 Please give a pointer for the 0.7 release.

 Regards,
 Mayank.
 --
 Mayank Kamthan




-- 
Mayank Kamthan


Re: nutch 0.7

2009-03-17 Thread W
Just check out the code from the svn branch, and build your self .., i
think it's easy enough ...

On Tue, Mar 17, 2009 at 5:21 PM, Mayank Kamthan mkamt...@gmail.com wrote:
 Hello ppl,
 Please provide a pointer to 0.7 release.. I need it  urgently..

 Thanks  n regards,
 Mayank.

 On Mon, Mar 16, 2009 at 2:23 PM, Mayank Kamthan mkamt...@gmail.com wrote:

 Hi!

 I need nutch 0.7. Can someone please provide me a pointer to it to
 download.
 When I try via the Apache site it leads me to nutch 0.9.
 Please give a pointer for the 0.7 release.

 Regards,
 Mayank.
 --
 Mayank Kamthan




 --
 Mayank Kamthan




-- 
---
OpenThink Labs
www.tobethink.com

Aligning IT and Education

 021-99325243
Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana


Re: Fetcher2 Slow

2009-03-17 Thread Andrzej Bialecki

Roger Dunk wrote:
Andrzej stated in NUTCH-669 that some people reported performance 
issues with Fetcher2, i.e. that it doesn't use the available bandwidth. 
These reports are unconfirmed, and they may have been caused by 
suboptimal URL / host distribution in a fetchlist - but it would be good 
to review the synchronization and threading aspects of Fetcher2.


To address this, I've tried just now generating a fetchlist using 
generate.max.per.host = 1 (which gave me 35,000 unique hosts) to 
guarantee unique hosts, but the problem still remains.


Therefore, I believe it's clearly not an issue of suboptimal URL / host 
distribution. If you require any further information to confirm my 
report, you need only ask!


Thanks for reporting this. Yes, we need more information - it's best if 
you create a JIRA issue, because then it will be easier to send attachments.


What we need at this moment is:

* the fetchlist - just zip the crawl_generate and attach it.
* nutch-site.xml and hadoop-site.xml (if you run in a distributed mode).
* cmd-line parameters, specifically the number of threads and -noParsing
* information about your environment (OS, cpu/mem, heapsize, JVM version).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Original tags, attribute defs, multiword tokens, how is this done.

2009-03-17 Thread Lukas, Ray
 
I have some basic questions about Nutch. Can someone point me in the
right direction, or if you have time, maybe just blast out an answer. 

Question One:
I can see the terms that come from the web page. Can I set up a way to
also add these things to the index. In other words, if ice cream came
from a h1 tag I want to know.

Question Two:
Ice Cream is really two words. But in the index it will be stored as
two entries. How can I tell Nutch (Lucene) that this and other things
are to be treated as one Token.. I know that somehow I will need to
supply a dictionary of these terms, but is it possible.. and if so how?

Question Three ( is will start hunting for this ):
I have to hunt around for this so.. I have not yet.. but since I am
asking questions.. How can I add more stop words into the stop word
list?

Question Four ( is will start hunting for this ):
Last one, promise.. The indexes themselves. Is there an explanation
written up for each of the fields in the index. 


Thanks for the help
Ray


Re: Original tags, attribute defs, multiword tokens, how is this done.

2009-03-17 Thread vishal vachhani
pls see the inline comments!!

On Tue, Mar 17, 2009 at 7:34 PM, Lukas, Ray ray.lu...@idearc.com wrote:


 I have some basic questions about Nutch. Can someone point me in the
 right direction, or if you have time, maybe just blast out an answer.

 Question One:
 I can see the terms that come from the web page. Can I set up a way to
 also add these things to the index. In other words, if ice cream came
 from a h1 tag I want to know.


Modifiy the index-plugin to include such a changes. You can add more
fields in the plugin. Of course you need to modify HTML parser also so that
it also keeps record of the headings in a document being parse. E.g you can
includes the Field Heading in the index which contains the terms of a
document which are there in  headings

While searching you can give more boost to the document if a query terms are
found in the Heading field . For the you need to modify query
formulation...for more see the documentation about lucene query formulation.




 Question Two:
 Ice Cream is really two words. But in the index it will be stored as
 two entries. How can I tell Nutch (Lucene) that this and other things
 are to be treated as one Token.. I know that somehow I will need to
 supply a dictionary of these terms, but is it possible.. and if so how?


If you have Multi-word Extractor(MWE) or dictionary, before indexing a
document you can invoke the MWE or look up in the dictionary , create a
field MWE in the index, Give more boost if a query terms are found in MWE.
In some sense Lucene/Nutch Ranking does handle it. For more details see the
coord  factor  in lucene ranking.

However, If you still wants to give more boost to the multi-world terms ,
you can do it by setting boost too hight in the lucene query ...again see
lucene query formulation.



 Question Three ( is will start hunting for this ):
 I have to hunt around for this so.. I have not yet.. but since I am
 asking questions.. How can I add more stop words into the stop word
 list?


You can look at the SMART system 's stop word list. Or you can generate
using frequecy analysis on some document collections if you are looking for
domain specific stop words.



 Question Four ( is will start hunting for this ):
 Last one, promise.. The indexes themselves. Is there an explanation
 written up for each of the fields in the index.


I m not sure but look at the nutch wiki .. you might get something.




 Thanks for the help
 Ray



Re: Original tags, attribute defs, multiword tokens, how is this done.

2009-03-17 Thread Eric J. Christeson


On Mar 17, 2009, at 9:04 AM, Lukas, Ray wrote:


Question Four ( is will start hunting for this ):
Last one, promise.. The indexes themselves. Is there an explanation
written up for each of the fields in the index.



http://wiki.apache.org/nutch/IndexStructure
is the closest thing I've found apart from reading the code.

Eric

--
Eric J. Christeson  
eric.christe...@ndsu.edu

Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University, Fargo, North Dakota, USA



PGP.sig
Description: This is a digitally signed message part


Re: Original tags, attribute defs, multiword tokens, how is this done.

2009-03-17 Thread Eric J. Christeson


On Mar 17, 2009, at 9:04 AM, Lukas, Ray wrote:


Question Four ( is will start hunting for this ):
Last one, promise.. The indexes themselves. Is there an explanation
written up for each of the fields in the index.



http://wiki.apache.org/nutch/IndexStructure
is the closest thing I've found apart from reading the code.

--
Eric J. Christeson  
eric.christe...@ndsu.edu

Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University



PGP.sig
Description: This is a digitally signed message part


wild card query in nutch

2009-03-17 Thread Raagu

Hello people,
I have used nutch-0.9 to crawl my application.. While searching , Its not
giving results for query which is a part of the string .. For example the
word Message  is indexed , and the search query is essa, Its not
searching for the message,  and hence it will give No results ..
So How do I make nutch to search in these cases, How can I enable wild card
queries in nutch ? Should I write any plugin for that ???

Please reply ASAP

-- 
View this message in context: 
http://www.nabble.com/wild-card-query-in-nutch-tp22564901p22564901.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Index Disaster Recovery

2009-03-17 Thread Eric J. Christeson


On Mar 16, 2009, at 7:55 PM, Otis Gospodnetic wrote:



Eric,

There are a couple of ways you can back up a Lucene index built by  
Solr:


1) have a look at the Solr replication scripts, specifically  
snapshooter.  This script creates a snapshot of an index.  It's  
typically triggered by Solr after its commit or optimize calls,  
when the index is stable and not being modified.  If you use  
snapshooter to create index snapshots, you could simply grab a  
snapshot and there is your backup.


2) have a look at Solr's new replication mechanism (info on the  
Solr Wiki), which does something similar to the above, but without  
relying on replication (shell) scripts.  It does everything via HTTP.


In my 10 years of using Lucene and N years of using Solr and Nutch  
I've never had index corruption.  Nowadays Lucene even has  
transactions, so it's much harder (theoretically impossible) to  
corrupt the index.


Thank you for the information.  I happened to read about snapshooter  
about 10 minutes after I sent that message, but didn't know about  
replication.  It inspires confidence that you haven't experienced  
index corruption in your years of using this technology.


Eric

--
Eric J. Christeson  
eric.christe...@ndsu.edu

Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University



Professional Nutch Support and Distribution

2009-03-17 Thread Dennis Kubes
Wanted to gauge community interest in having a certified Nutch 
distribution with support?  Similar to what Lucid Imagination is doing 
for Solr and Lucene and what Cloudera is providing for Hadoop.  Anybody 
interested?


Dennis


Re: Task failed to report status when merging segments

2009-03-17 Thread Justin Yao
I raised heap size to 2GB for each child in mapred.child.java.opts and 
the segment merging succeeded.



Justin Yao wrote:

Hi,

I encountered an error when I try to merge segment using the latest 
nightly build nutch.

I have 3 hadoop nodes and all servers have CentOS 5.2 installed.

Every time when I tried to merge segment using command:

nutch mergesegs crawl/MERGEDsegments -dir crawl/segments,

it would fail with error message:

Task attempt: /default-rack/10.9.17.206
Cleanup Attempt: /default-rack/10.9.17.206

Task attempt_200903161037_0001_r_03_0 failed to report status for 
1200 seconds. Killing!


then another child task will be launched, and later I got error message:

org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to 
create file 
/user/justin/crawl/MERGEDsegments/20090316143643/crawl_generate/part-3 
for DFSClient_attempt_200903161037_0001_r_03_1 on client 10.6.180.2 
because current leaseholder is trying to recreate file.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1055) 

at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:998) 

at 
org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:301)

at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 


at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:894)

at org.apache.hadoop.ipc.Client.call(Client.java:697)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy1.create(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 


at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) 

at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) 


at $Proxy1.create(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.(DFSClient.java:2585)

at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:454)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:190) 


at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
at 
org.apache.hadoop.io.SequenceFile$RecordCompressWriter.(SequenceFile.java:1074) 

at 
org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
at 
org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:306)
at 
org.apache.nutch.segment.SegmentMerger$SegmentOutputFormat$1.ensureSequenceFile(SegmentMerger.java:252) 

at 
org.apache.nutch.segment.SegmentMerger$SegmentOutputFormat$1.write(SegmentMerger.java:211) 

at 
org.apache.nutch.segment.SegmentMerger$SegmentOutputFormat$1.write(SegmentMerger.java:194) 


at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
at 
org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:479)
at 
org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:113)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
at org.apache.hadoop.mapred.Child.main(Child.java:158)


Here is the log of namenode:

2009-03-16 17:03:20,794 WARN  hdfs.StateChange - DIR* 
NameSystem.startFile: failed to create file 
/user/justin/crawl/MERGEDsegments/20090316143643/crawl_generate/part-3 
for DFSClient_attempt_200903161037_0001_r_03_1 on client 10.6.180.2 
because current leaseholder is trying to recreate file.
2009-03-16 17:04:20,798 WARN  hdfs.StateChange - DIR* 
NameSystem.startFile: failed to create file 
/user/justin/crawl/MERGEDsegments/20090316143643/crawl_generate/part-3 
for DFSClient_attempt_200903161037_0001_r_03_1 on client 10.6.180.2 
because current leaseholder is trying to recreate file.
2009-03-16 17:05:20,803 WARN  hdfs.StateChange - DIR* 
NameSystem.startFile: failed to create file 
/user/justin/crawl/MERGEDsegments/20090316143643/crawl_generate/part-3 
for DFSClient_attempt_200903161037_0001_r_03_1 on client 10.6.180.2 
because current leaseholder is trying to recreate file.


I checked the processes in hadoop node, the failed reduce process was 
never killed and it kept running.


I've tried to merge segments several times and it always failed with 
same error.


Does someone encounter this problem before? Is there any solution to 
avoid this problem? Any suggestion would be appreciated.


Thanks,


--
Justin Yao
Snooth
o: 646.723.4328
c: 718.662.6362
jus...@snooth.com

Snooth -- Over 2 million ratings and counting...


nutch - solr integration advantages

2009-03-17 Thread Bartosz Gadzimski

Hello,

It's hard for me to get big picture of why to use solr as indexing and 
searching.


Could someone try to describe this a little bit?

I understand that nutch is doing crawling and solr just indexing and 
searching?


Any help would be great.

Thanks,
Bartosz


Re: nutch - solr integration advantages

2009-03-17 Thread Andrew Smith
Hello Bartosz. I can only really describe my own experiences and what I have
done with Nutch/Solr is pretty simple.

My reasons for using Nutch/Solr were that the query interface to solr is
more powerful (Nutch is optimised for speed instead) and that I felt that ot
would be easier for me to integrate Solr into my python/django front end
than it would be to integrate Nutch using OpenSearch.

Thanks

Andy


2009/3/17 Bartosz Gadzimski bartek...@o2.pl

 Hello,

 It's hard for me to get big picture of why to use solr as indexing and
 searching.

 Could someone try to describe this a little bit?

 I understand that nutch is doing crawling and solr just indexing and
 searching?

 Any help would be great.

 Thanks,
 Bartosz



Re: The Future of Nutch

2009-03-17 Thread Marc Boucher
Dennis, Otis et al,

My very small team has kept silent for a long time. We've been playing
with Nutch, Hadoop and to a lesser extent Solr for about 2 years now.
Before I get into my thoughts on what direction things should take I
would like to offer a thought on why Nutch is not as active as other
groups.

I think in part it's because what Nutch represents and that's the
ability of creating a large scale search. Some developers would rather
use Nutch and associated tools and keep quiet about it because of
their goals, which in some case might mean competing against the likes
of Google, Yahoo, Ask, MSN Live etc. For my part I'm not going to
compete with those companies on large scale search but I can see
competition in the vertical markets. And while Solr is hot these days
it's intended primarily for the enterprise market which is very
different than the large scale and vertical markets.

Now on to the future. I agree with many of the thoughts Otis put forward.

While Nutch has it's problems other than Heritrix there is no other
open source system available and Nutch's ability to perform web-wide
crawls must be preserved. However I'm thinking we should have modular
approach to Nutch. For instance, why just one fetcher? Why not keep
the current one but also allow for the possibility of using Droids?
Parsing can and should include Tika. I'm not sure about outsourcing
indexing and searching to Solr but that could be a modular option as
well.

I'm not sure if Nutch should become a top level project and move out
from under Lucene. Lucene has great visibility and for many reasons.
If Nutch was moved, would it still attract enough attention? It's been
noted that developer interest in Nutch is different that Lucene, Solr
etc. On the other hand it might do Nutch good to go TLP as maybe then
it would attract more developers especially if it was packaged
differently.

My thoughts. And hopefully in the near future my small team will be
able to contribute to Nutch in a meaningful way.

Marc Boucher
http://hyperix.com

On Mon, Mar 16, 2009 at 5:50 PM, Otis Gospodnetic
ogjunk-nu...@yahoo.com wrote:

 Hello,


 Comments inlined.

 - Original Message 
 From: Dennis Kubes ku...@apache.org
 To: nutch-user@lucene.apache.org
 Sent: Friday, March 13, 2009 8:19:37 PM

 With the release of Nutch 1.0 I think it is a good time to begin a discussion
 about the future of Nutch.  Here are some things to consider and would love 
 to
 here everyones views on this

 Nutch's original intention was as a large-scale www search engine.  That is a
 very specific goal.  Only a few people and organizations actually use it on 
 that
 level.  (I just happen to be one of them as most of my work focuses on large
 scale web search as opposed to vertical search).

 Yes, there are fewer parties doing large scale web crawling.  Still, as there 
 is no alternative fetcher+parser+indexer+searcher capable of handling large 
 scale deployments like Nutch (or maybe Heritrix has the same scaling 
 capabilities?), I think Nutch's ability to perform web-wide crawls, etc. 
 should be preserved.

 Many, perhaps most, people
 using Nutch these days are either using parts of Nutch, such as the crawler, 
 or
 are targeting towards vertical or intranet type search engines.  This can be
 seen in how many people have already started using the Solr integration
 features.  So while Nutch was originally intended as a www search, IMO most
 people aren't using it for that purpose.


 That's my experience, too.  I think we can have both under the same Nutch 
 roof.

 Since there are different purposes for different users, would it be good to
 consider moving Nutch to a top level apache project out from under the Lucene
 umbrella?  This would then allow the creation of nutch sub-projects, such as
 nutch-solr, nutch-hbase.  Thoughts?


 I disagree, at least in the near term.  There is nothing preventing those 
 sub-projects existing under Nutch today.  Both Solr and Lucene have the 
 contrib area where similar sub-projects live.  I think it's not a matter of 
 being a TLP, but rather attracting enough developer interest, then user 
 interest, and then contributor interest, so that these sub-projects can be 
 created, maintained, advanced.  Right now, Solr gets a TON of attention, as 
 does Lucene.  Nutch gets the least developer attention, and for some reason 
 the nutch-user subscribers feel a bit different from solr-user or java-user 
 subscribers.

 Many parts of Nutch have also been implemented in other projects.  For 
 example,
 Tika for the parsers, Droids for the Crawler.  In begs the question what is
 Nutch's core features going forward.  When I think about search (again my
 perspective is large scale), I think crawling or acquisition of data, 
 parsing,
 analysis, indexing, deployment, and searching.  I personally think that 
 there is
 much room for improvement in crawling and especially analysis.  Nutch 
 shouldn't
 just be about the shell but also the 

Re: Professional Nutch Support and Distribution

2009-03-17 Thread Marc Boucher
This sounds interesting. I might be interested in this.

Marc Boucher
http://hyperix.com

On Tue, Mar 17, 2009 at 12:31 PM, Dennis Kubes ku...@apache.org wrote:
 Wanted to gauge community interest in having a certified Nutch distribution
 with support?  Similar to what Lucid Imagination is doing for Solr and
 Lucene and what Cloudera is providing for Hadoop.  Anybody interested?

 Dennis



Re: The Future of Nutch

2009-03-17 Thread Dennis Kubes

Marc,

Glad you responded.  Always good to hear peoples thoughts.

Marc Boucher wrote:

Dennis, Otis et al,

My very small team has kept silent for a long time. We've been playing
with Nutch, Hadoop and to a lesser extent Solr for about 2 years now.
Before I get into my thoughts on what direction things should take I
would like to offer a thought on why Nutch is not as active as other
groups.

I think in part it's because what Nutch represents and that's the
ability of creating a large scale search. Some developers would rather
use Nutch and associated tools and keep quiet about it because of
their goals, which in some case might mean competing against the likes
of Google, Yahoo, Ask, MSN Live etc. For my part I'm not going to
compete with those companies on large scale search but I can see
competition in the vertical markets. And while Solr is hot these days
it's intended primarily for the enterprise market which is very
different than the large scale and vertical markets.


I completely agree.  The group of people/companies that are creating 
large scale search solutions whether whole web or vertical is much 
smaller than say enterprise search or even the potential uses for Hadoop.




Now on to the future. I agree with many of the thoughts Otis put forward.

While Nutch has it's problems other than Heritrix there is no other
open source system available and Nutch's ability to perform web-wide
crawls must be preserved. However I'm thinking we should have modular
approach to Nutch. For instance, why just one fetcher? Why not keep
the current one but also allow for the possibility of using Droids?
Parsing can and should include Tika. I'm not sure about outsourcing
indexing and searching to Solr but that could be a modular option as
well.


Yup.  It should IMO also be easy to install and configure.  I was having 
a discussion today where the main topic was, could we make Nutch have a 
nice graphical web interface for configuration, when you could drop it 
in, change some options, and create a customized vertical search over x 
domains?




I'm not sure if Nutch should become a top level project and move out
from under Lucene. Lucene has great visibility and for many reasons.
If Nutch was moved, would it still attract enough attention? It's been
noted that developer interest in Nutch is different that Lucene, Solr
etc. On the other hand it might do Nutch good to go TLP as maybe then
it would attract more developers especially if it was packaged
differently.


Part of this is about releases.  Currently releases are voted on by 
Lucene PMC members and it takes 3 members to confirm a vote.  There are 
only 2 Nutch committers on the Lucene PMC.  So for releases, not that we 
have had many recently, other Lucene PMC members who may not be actively 
associated with Nutch would need to vote to release.  If Nutch was a TLP 
there would be a Nutch PMC which would most likely include all current 
Nutch committers.  The other may be to add another Nutch committer to 
the Lucene PMC.




My thoughts. And hopefully in the near future my small team will be
able to contribute to Nutch in a meaningful way.


Any and every contribution is welcome.

Dennis



Marc Boucher
http://hyperix.com

On Mon, Mar 16, 2009 at 5:50 PM, Otis Gospodnetic
ogjunk-nu...@yahoo.com wrote:

Hello,


Comments inlined.

- Original Message 

From: Dennis Kubes ku...@apache.org
To: nutch-user@lucene.apache.org
Sent: Friday, March 13, 2009 8:19:37 PM

With the release of Nutch 1.0 I think it is a good time to begin a discussion
about the future of Nutch.  Here are some things to consider and would love to
here everyones views on this

Nutch's original intention was as a large-scale www search engine.  That is a
very specific goal.  Only a few people and organizations actually use it on that
level.  (I just happen to be one of them as most of my work focuses on large
scale web search as opposed to vertical search).

Yes, there are fewer parties doing large scale web crawling.  Still, as there 
is no alternative fetcher+parser+indexer+searcher capable of handling large 
scale deployments like Nutch (or maybe Heritrix has the same scaling 
capabilities?), I think Nutch's ability to perform web-wide crawls, etc. should 
be preserved.


Many, perhaps most, people
using Nutch these days are either using parts of Nutch, such as the crawler, or
are targeting towards vertical or intranet type search engines.  This can be
seen in how many people have already started using the Solr integration
features.  So while Nutch was originally intended as a www search, IMO most
people aren't using it for that purpose.


That's my experience, too.  I think we can have both under the same Nutch roof.


Since there are different purposes for different users, would it be good to
consider moving Nutch to a top level apache project out from under the Lucene
umbrella?  This would then allow the creation of nutch sub-projects, such as
nutch-solr, nutch-hbase.  

Re: The Future of Nutch

2009-03-17 Thread Marc Boucher
Dennis,

That adds another dimension to the issue which I had not considered.
One avenue as you suggest would be to add another committer to the
Lucene PMC. If that does not work them maybe going the route of TLP is
the best option.

Marc


 Part of this is about releases.  Currently releases are voted on by Lucene
 PMC members and it takes 3 members to confirm a vote.  There are only 2
 Nutch committers on the Lucene PMC.  So for releases, not that we have had
 many recently, other Lucene PMC members who may not be actively associated
 with Nutch would need to vote to release.  If Nutch was a TLP there would be
 a Nutch PMC which would most likely include all current Nutch committers.
  The other may be to add another Nutch committer to the Lucene PMC.


 My thoughts. And hopefully in the near future my small team will be
 able to contribute to Nutch in a meaningful way.

 Any and every contribution is welcome.

 Dennis



Nutch-based Application for Windows

2009-03-17 Thread John Whelan

Hi All,

For fun, I created a windows-based installer for Nutch and added a
administrative GUI to it. If interested, you can grab it from 
http://www.freewarefiles.com/WhelanLabs-Search-Engine-Manager_program_47202.html
FreewareFiles .

Regards,
John
-- 
View this message in context: 
http://www.nabble.com/Nutch-based-Application-for-Windows-tp22572158p22572158.html
Sent from the Nutch - User mailing list archive at Nabble.com.



embed nutch crawl in an application

2009-03-17 Thread n_developer

Generally nutch crawl in done thru cygwin. If i dont want to run cygwin, and
i want to crawl an application from an application of my own what can i do?

N also i want nutch to perform wildcard query search(as in, if search query
is book*, then it shd return al search results whic contain isbn followed by
any text) This is possible in luke lucene. But hw can i implement it in
nutch search?
-- 
View this message in context: 
http://www.nabble.com/embed-nutch-crawl-in-an-application-tp22572933p22572933.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: embed nutch crawl in an application

2009-03-17 Thread MyD

This is an interesting question. If you know how to run the Crawl process out
of another Java program, plz let me know it. Thanks in advance.



n_developer wrote:
 
 Generally nutch crawl in done thru cygwin. If i dont want to run cygwin,
 and i want to crawl an application from an application of my own what can
 i do?
 
 N also i want nutch to perform wildcard query search(as in, if search
 query is book*, then it shd return al search results whic contain isbn
 followed by any text) This is possible in luke lucene. But hw can i
 implement it in nutch search?
 

-- 
View this message in context: 
http://www.nabble.com/embed-nutch-crawl-in-an-application-tp22572933p22573211.html
Sent from the Nutch - User mailing list archive at Nabble.com.