[jira] Created: (NUTCH-736) how long it takes nutch 1.0 to fetch

2009-05-14 Thread Filipe Antunes (JIRA)
how long it takes nutch 1.0 to fetch


 Key: NUTCH-736
 URL: https://issues.apache.org/jira/browse/NUTCH-736
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Intel 2.8 Core2Duo OS X 10.5.6
Reporter: Filipe Antunes


I need an opinion about how long it takes nuch 1.0 to fetch a web site.

At the moment I'm indexing 3000 sites (medical area). university's, clinics, 
hospitals, associations, journals (html, docs, PDF, txt, xls).
So far I have 5 segments (64Gb) and the its fetching the 6th.
Using an Intel 2.8 Core2Duo OS X 10.5.6 with 4Mbit internet connection (the 
machine is throttled to 64Mbits during the day (8 hours)) and this fetching 
started one month ago.

Does anyone have statistics of how long a site (# of pages) nutch 1.0 takes? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Nutch/Solr: storing the page cache in Solr

2009-05-14 Thread Andrzej Bialecki

Siddhartha Reddy wrote:
I'm trying to patch Nutch to allow the page cache to be added to the 
Solr index when using the SolrIndexer tool. Is there any reason this is 
not done by default? The Solr schema even has the cache field but it 
is left empty.




This issue is more complicated. We would need to handle also non-string 
content such as various binary formats (PDF, Office, images, etc), and 
there is no support for this in Solr (yet).


Additionally, storing large binary blobs in Lucene index has some 
performance consequences.


Currently Nutch uses Solr for searching, and a separate (set of) segment 
servers for content serving.


I'm enclosing a patch of the changes I have made. I have done some 
testing and this seems to work fine. Can someone please take a look at 
it let me know if I'm doing anything wrong? I'm especially not sure 
about the character encoding to assume when converting the Content 
(which is stored as byte[]) to a String; I'm getting the encoding from 
Metadata (using the key Metadata.ORIGINAL_CHAR_ENCODING) but it is 
always null.


The patch looks ok, if handling String content is all you need. Char 
encoding should be available in ParseData.getMeta().


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] Updated: (NUTCH-736) how long it takes nutch 1.0 to fetch

2009-05-14 Thread Filipe Antunes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Filipe Antunes updated NUTCH-736:
-

Description: 
I need an opinion about how long it takes nuch 1.0 to fetch a web site.

At the moment I'm indexing 3000 sites (medical area). university's, clinics, 
hospitals, associations, journals (html, docs, PDF, txt, xls).
So far I have 5 segments (64Gb) and the its fetching the 6th.
Using an Intel 2.8 Core2Duo OS X 10.5.6 with 4Mbit internet connection (the 
machine is throttled to 64Kbytes during the day (8 hours)) and this fetching 
started one month ago.

Does anyone have statistics of how long a site (# of pages) nutch 1.0 takes? 

  was:
I need an opinion about how long it takes nuch 1.0 to fetch a web site.

At the moment I'm indexing 3000 sites (medical area). university's, clinics, 
hospitals, associations, journals (html, docs, PDF, txt, xls).
So far I have 5 segments (64Gb) and the its fetching the 6th.
Using an Intel 2.8 Core2Duo OS X 10.5.6 with 4Mbit internet connection (the 
machine is throttled to 64Mbits during the day (8 hours)) and this fetching 
started one month ago.

Does anyone have statistics of how long a site (# of pages) nutch 1.0 takes? 


 how long it takes nutch 1.0 to fetch
 

 Key: NUTCH-736
 URL: https://issues.apache.org/jira/browse/NUTCH-736
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Affects Versions: 1.0.0
 Environment: Intel 2.8 Core2Duo OS X 10.5.6
Reporter: Filipe Antunes

 I need an opinion about how long it takes nuch 1.0 to fetch a web site.
 At the moment I'm indexing 3000 sites (medical area). university's, clinics, 
 hospitals, associations, journals (html, docs, PDF, txt, xls).
 So far I have 5 segments (64Gb) and the its fetching the 6th.
 Using an Intel 2.8 Core2Duo OS X 10.5.6 with 4Mbit internet connection (the 
 machine is throttled to 64Kbytes during the day (8 hours)) and this fetching 
 started one month ago.
 Does anyone have statistics of how long a site (# of pages) nutch 1.0 takes? 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Regarding Solr1.3 and Nutch 0.9 Integration

2009-05-14 Thread malli j
Dear Paul,
  we are trying to integrate solr 1.3 and nutch 0.9 and
facing few problems .Here, below i am mentioning the error  stack trace,
and jdk versions.please, help us out from this problem.
  Finally , let us know if you have
any useful documents on Solr-Nutch.

Environment we are working:

jdk:1.6
nutch:0.9
solr:1.3
Article refered for Integration:
http://wiki.apache.org/nutch/RunningNutchAndSolr


http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
OS:windows xp


Error Stack Trace we got when updating the crawler fetched Content:

009-05-13 20:22:27,175 WARN  indexer.SolrClientAdapter - Could not index
document, reason:Bad Request

Bad Request

request: http://localhost:8080/solr/update?wt=javabinversion=2.2
org.apache.solr.common.SolrException: Bad Request

Bad Request

request: http://localhost:8080/solr/update?wt=javabinversion=2.2
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:343)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183)
at
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:217)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)
at
org.apache.nutch.indexer.SolrClientAdapter.index(SolrClientAdapter.java:75)
at
org.apache.nutch.indexer.SolrIndexer$OutputFormat$1.write(SolrIndexer.java:118)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:298)
at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:238)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:313)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155)
2009-05-13 20:22:27,190 INFO  indexer.Indexer - Executing commit
2009-05-13 20:22:28,221 INFO  indexer.Indexer - SolrIndexer: done

Regards,
Mallik.J.


The Future of Nutch, reactivated

2009-05-14 Thread Andrzej Bialecki

Hi all,

I'd like to revive this thread and gather additional feedback so that we
end up with concrete conclusions. Much of what I write below others have
said before, I'm trying here to express this as it looks from my point
of view.

Target audience
===
I think that the Nutch project experiences a crisis of personality now -
we are not sure what is the target audience, and we cannot satisfy
everyone. I think that there are following groups of Nutch users:

1. Large-scale Internet crawl  search: actually, there are only few
such users, because it takes considerable resources to manage operations
on that scale. Scalability, manage-ability and ranking/spam prevention
are the chief concerns here.

2. Medium-scale vertical search: I suspect that many Nutch users fall
into this category. Modularity, flexibility in implementing custom
processing, ability to modify workflows and to use only some Nutch
components seem to be chief concerns here. Scalability too, but only up
to a volume of ~100-200 mln documents.

3. Small- to medium-scale enterprise search: there's a sizeable number
of Nutch users that fall into this category, for historical reasons.
Link-based ranking and resource discovery are not that important here,
but integration with Windows networking, Microsoft formats and databases
, as well as realtime indexing and easy index maintenance are crucial.
This class of users often has to heavily customize Nutch to get any
sensible result. Also, this is where Solr really shines, so there is
little benefit in using Nutch here. I predict that Nutch will have fewer
and fewer users of this type.

4. Single desktop to small intranet search: as above, but the accent is
on the ease of use out of the box, and an often requested feature is a
GUI frontend. Currently IMHO Nutch is too complex and requires too much
command-line operation for casual users to make this use case attractive.

What is the target audience that we as a community want to support? By
this I mean not only the moral support, but also active participation in
the development process. From the place where we are at the moment we
could go in any of the above directions.

Core competence
===
This is a simple but important point. Currently we maintain several
major subsystems in Nutch that are implemented by other projects, and
often in a better way. Plugin framework (and dependency injection) and
content parsing are two areas that we have to delegate to third-party
libraries, such as Tika and OSGI or some other simple IOC container -
probably there are other components that we don't have to do ourselves.
Another thing that I'd love to delegate is the distributed search and
index maintenance - either through Solr or Katta or something else.

The question then is, what is the core competence of this project? I see
the following major areas that are unique to Nutch:

* crawling - this includes crawl scheduling (and re-crawl scheduling),
discovery and classification of new resources, strategies for crawling
specific sets of URLs (hosts and domains) under bandwidth and netiquette
constraints, etc.

* web graph analysis - this includes link-based ranking, mirror
detection (and URL aliasing) but also link spam detection and a more
complex control over the crawling frontier.

Anything more? I'm not sure - perhaps I would add template detection and
pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).

Nutch 1.0 already made some steps in this direction, with the new link
analysis package and pluggable FetchSchedule and Signature. A lot
remains to be done here, and we are still spending a lot of resources on
dealing with issues outside this core competence.

---

So, what do we need to do next?

* we need to decide where we should commit our resources, as a community
of users, contributors and committers, so that the project is most
useful to our target audience. At this point there are few active
committers, so I don't think we can cover more than 1 direction at a
time ... ;)

* we need to re-architect Nutch to focus on our core competence, and
delegate what we can to other projects.

Feel free to comment on the above, make suggestions or corrections. I'd
like to wrap it up in a concise mission statement that would help us set
the goals for the next couple months.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





[Nutch Wiki] Trivial Update of HttpAuthenticationSchemes by susam

2009-05-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by susam:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes

--
  == Introduction ==
- This is a feature in Nutch, developed by Susam Pal, that allows the crawler 
to authenticate itself to websites requiring NTLM, Basic or Digest 
authentication. This feature can not do POST based authentication that depends 
on cookies. More information on this can be found at: HttpPostAuthentication
+ This is a feature in Nutch that allows the crawler to authenticate itself to 
websites requiring NTLM, Basic or Digest authentication. This feature can not 
do POST based authentication that depends on cookies. More information on this 
can be found at: HttpPostAuthentication
  
  == Necessity ==
  There were two plugins already present, viz. 'protocol-http' and 
'protocol-httpclient'. However, 'protocol-http' could not support HTTP 1.1, 
HTTPS and NTLM, Basic and Digest authentication schemes. 'protocol-httpclient' 
supported HTTPS and had code for NTLM authentication but the NTLM 
authentication didn't work due to a bug. Some portions of 'protocol-httpclient' 
were re-written to solve these problems, provide additional features like 
authentication support for proxy server and better inline documentation for the 
properties to be used to configure authentication.
@@ -108, +108 @@

  Once you have checked the items listed above and you are still unable to fix 
the problem or confused about any point listed above, please mail the issue 
with the following information:
  
   1. Version of Nutch you are running.
-  1. Complete code in ''conf/httpclient-auth.xml' file.
+  1. Complete code in 'conf/httpclient-auth.xml' file.
   1. Relevant portion from 'logs/hadoop.log' file. If you are clueless, send 
the complete file.
  


[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar

2009-05-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by amitkumar:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
  -  private static class LuceneDocumentWrapper implements Writable {
  +  public static class LuceneDocumentWrapper implements Writable { ).
  
+ 
+ HI, I to faced problems to integrate solr and nutch.After , some work i found 
the below article and integrated successfully. 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
+ 


[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar

2009-05-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by amitkumar:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
  +  public static class LuceneDocumentWrapper implements Writable { ).
  
  
- HI, I to faced problems to integrate solr and nutch.After , some work i found 
the below article and integrated successfully. 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
  
+ HI, I to faced problems in integrating solr and nutch. After, some work out i 
found the below article and integrated successfully. 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
+ 


[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar

2009-05-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by amitkumar:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
  d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste 
following fragment to it
  
  requestHandler name=/nutch class=solr.SearchHandler 
+ 
  lst name=defaults
+ 
  str name=defTypedismax/str
+ 
  str name=echoParamsexplicit/str
+ 
  float name=tie0.01/float
+ 
  str name=qf
+ 
  content^0.5 anchor^1.0 title^1.2
  /str
+ 
  str name=pf
  content^0.5 anchor^1.5 title^1.2 site^1.5
  /str
+ 
  str name=fl
  url
  /str
+ 
  str name=mm
  2lt;-1 5lt;-2 6lt;90%
  /str
+ 
  int name=ps100/int
+ 
  bool hl=true/
+ 
  str name=q.alt*:*/str
+ 
  str name=hl.fltitle url content/str
+ 
  str name=f.title.hl.fragsize0/str
+ 
  str name=f.title.hl.alternateFieldtitle/str
+ 
  str name=f.url.hl.fragsize0/str
+ 
  str name=f.url.hl.alternateFieldurl/str
+ 
  str name=f.content.hl.fragmenterregex/str
+ 
  /lst
+ 
  /requestHandler
  
  6. Start Solr


[Nutch Wiki] Update of RunningNutchAndSolr by amitkumar

2009-05-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The following page has been changed by amitkumar:
http://wiki.apache.org/nutch/RunningNutchAndSolr

--
   * apt-get install sun-java6-jdk subversion ant patch unzip
  
  == Steps ==
-  Setup
  
  The first step to get started is to download the required software 
components, namely Apache Solr and Nutch.
  
- 1. Download Solr version 1.3.0 or LucidWorks for Solr from Download page
+ '''1.''' Download Solr version 1.3.0 or LucidWorks for Solr from Download page
  
- 2. Extract Solr package
+ '''2.''' Extract Solr package
  
- 3. Download Nutch version 1.0 or later (Alternatively download the the 
nightly version of Nutch that contains the required functionality)
+ '''3.''' Download Nutch version 1.0 or later (Alternatively download the the 
nightly version of Nutch that contains the required functionality)
  
- 4. Extract the Nutch package
+ '''4.''' Extract the Nutch package   tar xzf apache-nutch-1.0.tar.gz
  
- tar xzf apache-nutch-1.0.tar.gz
- 
- 5. Configure Solr
+ '''5.''' Configure Solr
- 
  For the sake of simplicity we are going to use the example
  configuration of Solr as a base.
  
- a. Copy the provided Nutch schema from directory
+ '''a.''' Copy the provided Nutch schema from directory
  apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf 
(override the existing file)
  
  We want to allow Solr to create the snippets for search results so we need to 
store the content in addition to indexing it:
  
- b. Change schema.xml so that the stored attribute of field “content” is 
true.
+ '''b.''' Change schema.xml so that the stored attribute of field 
“content” is true.
  
  field name=”content” type=”text” stored=”true” 
indexed=”true”/
  
  We want to be able to tweak the relevancy of queries easily so we’ll create 
new dismax request handler configuration for our use case:
  
- d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste 
following fragment to it
+ '''d.''' Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste 
following fragment to it
  
  requestHandler name=/nutch class=solr.SearchHandler 
  
@@ -93, +89 @@

  
  /requestHandler
  
- 6. Start Solr
+ '''6.''' Start Solr
  
  cd apache-solr-1.3.0/example
  java -jar start.jar
  
- 7. Configure Nutch
+ '''7. Configure Nutch'''
  
  a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s 
contents with the following (we specify our crawler name, active plugins and 
limit maximum url count for single host per run to be 100) :
  
  ?xml version=1.0?
  configuration
+ 
  property
+ 
  namehttp.agent.name/name
+ 
  valuenutch-solr-integration/value
+ 
  /property
+ 
  property
  namegenerate.max.per.host/name
+ 
  value100/value
+ 
  /property
+ 
  property
+ 
  nameplugin.includes/name
+ 
  
valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
+ 
  /property
+ 
  /configuration
  
+ 
- b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,
+ '''b.''' Open regex-urlfilter.txt in directory apache-nutch-1.0/conf,replace 
it’s content with following:
- replace it’s content with following:
  
  -^(https|telnet|file|ftp|mailto):
   
@@ -135, +143 @@

  # deny anything else
  -.
  
- 8. Create a seed list (the initial urls to fetch)
+ '''8.''' Create a seed list (the initial urls to fetch)
  
  mkdir urls
  echo http://www.lucidimagination.com/;  urls/seed.txt
  
- 9. Inject seed url(s) to nutch crawldb (execute in nutch directory)
+ '''9.''' Inject seed url(s) to nutch crawldb (execute in nutch directory)
  
  bin/nutch inject crawl/crawldb urls
  
- 10. Generate fetch list, fetch and parse content
+ '''10.''' Generate fetch list, fetch and parse content
  
  bin/nutch generate crawl/crawldb crawl/segments
  
@@ -166, +174 @@

  
  Now a full Fetch cycle is completed. Next you can repeat step 10 couple of 
more times to get some more content.
  
- 11. Create linkdb
+ '''11.''' Create linkdb
  
  bin/nutch invertlinks crawl/linkdb -dir crawl/segments
  
- 12. Finally index all content from all segments to Solr
+ '''12.''' Finally index all content from all segments to Solr
  
  bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb 
crawl/segments/*
  


Re: The Future of Nutch, reactivated

2009-05-14 Thread Mattmann, Chris A
Hi Andrzej,

Great summary. My general feeling on this is similar to my prior comments on
similar threads from Otis and from Dennis. My personal pet projects for
Nutch2:

* refactored Nutch core data structures, modeled as POJOs
* refactored Nutch architecture where crawling/indexing/parsing/scoring/etc.
are insulated from the underlying messaging substrate (e.g., crawl over JMS,
EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other
framework, etc.)
* simpler Nutch deployment mechanisms (separate Nutch deployment package
from source code package), think about using Maven2

+1 to all of those and other ideas for how to improve the project's focus.

Cheers,
Chris


On 5/14/09 6:45 AM, Andrzej Bialecki a...@getopt.org wrote:

 Hi all,
 
 I'd like to revive this thread and gather additional feedback so that we
 end up with concrete conclusions. Much of what I write below others have
 said before, I'm trying here to express this as it looks from my point
 of view.
 
 Target audience
 ===
 I think that the Nutch project experiences a crisis of personality now -
 we are not sure what is the target audience, and we cannot satisfy
 everyone. I think that there are following groups of Nutch users:
 
 1. Large-scale Internet crawl  search: actually, there are only few
 such users, because it takes considerable resources to manage operations
 on that scale. Scalability, manage-ability and ranking/spam prevention
 are the chief concerns here.
 
 2. Medium-scale vertical search: I suspect that many Nutch users fall
 into this category. Modularity, flexibility in implementing custom
 processing, ability to modify workflows and to use only some Nutch
 components seem to be chief concerns here. Scalability too, but only up
 to a volume of ~100-200 mln documents.
 
 3. Small- to medium-scale enterprise search: there's a sizeable number
 of Nutch users that fall into this category, for historical reasons.
 Link-based ranking and resource discovery are not that important here,
 but integration with Windows networking, Microsoft formats and databases
 , as well as realtime indexing and easy index maintenance are crucial.
 This class of users often has to heavily customize Nutch to get any
 sensible result. Also, this is where Solr really shines, so there is
 little benefit in using Nutch here. I predict that Nutch will have fewer
 and fewer users of this type.
 
 4. Single desktop to small intranet search: as above, but the accent is
 on the ease of use out of the box, and an often requested feature is a
 GUI frontend. Currently IMHO Nutch is too complex and requires too much
 command-line operation for casual users to make this use case attractive.
 
 What is the target audience that we as a community want to support? By
 this I mean not only the moral support, but also active participation in
 the development process. From the place where we are at the moment we
 could go in any of the above directions.
 
 Core competence
 ===
 This is a simple but important point. Currently we maintain several
 major subsystems in Nutch that are implemented by other projects, and
 often in a better way. Plugin framework (and dependency injection) and
 content parsing are two areas that we have to delegate to third-party
 libraries, such as Tika and OSGI or some other simple IOC container -
 probably there are other components that we don't have to do ourselves.
 Another thing that I'd love to delegate is the distributed search and
 index maintenance - either through Solr or Katta or something else.
 
 The question then is, what is the core competence of this project? I see
 the following major areas that are unique to Nutch:
 
 * crawling - this includes crawl scheduling (and re-crawl scheduling),
 discovery and classification of new resources, strategies for crawling
 specific sets of URLs (hosts and domains) under bandwidth and netiquette
 constraints, etc.
 
 * web graph analysis - this includes link-based ranking, mirror
 detection (and URL aliasing) but also link spam detection and a more
 complex control over the crawling frontier.
 
 Anything more? I'm not sure - perhaps I would add template detection and
 pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
 
 Nutch 1.0 already made some steps in this direction, with the new link
 analysis package and pluggable FetchSchedule and Signature. A lot
 remains to be done here, and we are still spending a lot of resources on
 dealing with issues outside this core competence.
 
 ---
 
 So, what do we need to do next?
 
 * we need to decide where we should commit our resources, as a community
 of users, contributors and committers, so that the project is most
 useful to our target audience. At this point there are few active
 committers, so I don't think we can cover more than 1 direction at a
 time ... ;)
 
 * we need to re-architect Nutch to focus on our core competence, and
 delegate what we can to other projects.
 
 

The Future of Nutch, reactivated

2009-05-14 Thread Kirby Bohling
All,

Sorry that I didn't reply, and thus this isn't threaded properly.
I've lurked on the list via the RSS feed, I subscribed so I could put
in my two cents worth.  I've recently starting using git to maintain a
local branch of Nutch.  My hope is to get my employer to let me
contribute just engineering back to Nutch.  We'd like to customize
Nutch in various ways and use that as the basis of internal RD and
potentially some products that we'd not contribute.  The other things
that just make Nutch more flexible I'd like to contribute to.

I've been working with Nutch on and off since sometime in November or
so for my job.  A couple of thoughts:

1. Nutch is too monolithic
2. Nutch does the heavy lifting of a framework for a distributed system well.
3. Nutch doesn't really keep all the various pieces up to date very well.
4. Nutch requires at least a Bachelors in Nutch to deal with it.
5. Documentation in a Wiki is out of date or is hard to tell which
versions various things work with.
6. Nutch isn't very friendly to simple requests if there a complex
hack could be found. (See recursive file:// handling).

My most recent task was actually to update Tika to use 0.3 and then
use the Tika parsing of the docx format to index.  There were a
several interesting problems, but I want to get permission from my
employer and just show the patches.

I thing we fall into the category of #2 (we wish we could fall into
category #1, but such is life).  We want to make our intranet
searchable on a large scale, and would like to apply the indexing and
retrieval in a number of RD projects.  We also have an interest in
using Nutch/Lucene/Hadoop in a number of other problems unrelated to
Internet Search.

A couple of things that I'd like to help do (or see done) that would
make Nutch far more framework like so I can assemble the pieces and
parts into what I need:

1. Get Nutch and it's various components into a public Maven
repository, and have public scripts to do the publishing.  Don't care
if that is via Ant with Ivy extensions, or switching to a Maven build
systems.  I've actually started with both approaches.  I'm much better
with Maven, but I think Ivy is more likely to be acceptable to the
project.  I'd like to see this done with Hadoop, and any other core
components.  For now, I'm just maintaining a local POM file that
pushes my builds into our local Maven repository.  I'm going to do
this one way or another, and would love to hear any feedback on an
approach that is acceptable to be contributed back to Nutch.

2. Clearly segregate Plugins from Core from Bits that make it an
Application.  I've had fun problems with ClassLoaders, and it seems
that the interface Plugins are allowed to access are Anything in
Core, or it's existing libraries.  It would seem that it would be
better to have the Core Runtime, which plugins can depend upon, and is
relatively minimal.  Identify the pieces of Nutch which are there to
make it into a program you can run, and push those into a separate
place.  For API's with multiple implementations, it would be nice to
not have be forced to use the same one the Core does when a plugin is
written.

3. As you stated earlier, use OSGi for a plugin system and some type
of dependency injection rather then hand parsed XML files.  I've had
problems with the PluginClassloader (I wanted to use Tika in my
plugin, and because of the plugin/classloader setup, I had to push the
POI libraries into the lib directory rather then in the
src/plugin/plugin-XXX/lib directory).  Well, that was the first
approach, the second was to hack the PluginClassloader to not delegate
to the parent for the org.apache.tika package and then provide Tika
in the plugin and it all worked.  Using an well known plug-in system
would have made this much easier.

4. Help transition to using the 3rd party libraries, Nutch still has
an SWF parser that went unmaintained in 2002.  Flash has moved a long
way, it would seem sensible to either jettison that code, or update to
newer versions of the same library by the same project (SWF2).  Not
that I care about Flash, but it seems that parsing isn't something
Nutch proper is focused on.

5. With whatever build system is chosen, figure out how to setup a
Maven build to construct Out-of-Tree Nutch plugins without having to
manually deal with all of the various dependencies and packaging
details.

6. Better support for running out of an IDE.  The instructions work,
and are very helpful.  It'd be much nicer to see the use of tools or
scripts to generate a saner system then is currently there (having
each plugin be a project in Eclipse would be a huge help to debugging
weird classpath issues).  Right now, running and compiling inside of
Eclipse isn't at all similar to running it outside, if you have any
time of classloader issues, or multiple conflicting libraries.  Not
that there are any in-tree right now, but I can see how future ones
could exist.

7. Make each plugin be it's own deliverable (even if