Hi all,
The ApacheCon is over, our release 1.0 has been out already for some
time, so I think it's a good moment to discuss what are the next steps
in Nutch development.
Let me share with you the topics I identified and presented in the
ApacheCon slides, and some topics that are worth discussing based on
various conversations I had there, and the discussions we had on our
mailing list:
1. Avoid duplication of effort
--
Currently we spend significant effort on implementing functionality that
other projects are dedicated to. Instead of doing the same work, and
sometimes poorly, we should concentrate on delegating and reusing:
* Use Tika for content parsing: this will require some effort and
collaboration with the Tika project, to improve Tika's ability to handle
more complex formats well (e.g. hierarchical compound documents such as
archives, mailboxes, RSS), and to contribute any missing parsers (e.g.
parse-swf).
* Use Solr for indexing search: it is hard to justify the effort of
developing and maintaining our own search server - Solr offers much more
functionality, configurability, performance and ease of integration than
our relatively primitive search server. Our integration with Solr needs
to be improved so that it's easier to setup and operate.
* Use database-like storage abstraction: this may seem like a serious
departure from the current architecture, but I don't mean that we should
switch to an SQL DB ... what this means is that we should provide an
option to use HBase, as well as the current plain MapFile-s (and perhaps
other types of DBs, such as Berkeley DB or SQL, if it makes sense) as
our storage. There is a very promising initial port of Nutch to HBase,
which is currently closely integrated with HBase API (which is both good
and bad) - it provides several improvements over our current storage, so
I think it's worth using as the new default, but let's see if we can
make it more abstract.
* Plugins: the initial OSGI port looks good, but I'm not sure yet at
this moment if the benefits of OSGI outweigh the cost of this change ...
* Shard management: this is currently an Achilles' heel of Nutch, where
users are left on their own ... If we switch to using HBase then at
least on the crawling side the shard management will become much easier.
This still leaves the problem of deploying new content to search
server(s). The candidate framework for this side of the shard management
is Katta + patches provided by Ted Dunning (see ???). If we switch to
using Solr we would have to also use the Katta / Solr integration, and
perhaps Solr/Hadoop integration as well. This is a complex mix of
half-ready components that needs to be well thought-through ...
* Crawler Commons: during our Crawler MeetUp all representatives agreed
that we should collect a few components that are nearly the same across
all projects and collaborate on their development, and use them as an
external dependency. The candidate components are:
- robots.txt parsing
- URL filtering and normalization
- page signature (fingerprint) implementations
- page template detection removal (aka. main content extraction)
- possibly others, like URL redirection tracking, PageRank
calculation, crawler trap detection etc.
2. Make Nutch easier to use
---
This, as you may remember our earlier discussions, begs the question:
who is the target audience of Nutch?
In my opinion, the main users of Nutch are vertical search engines, and
this is the audience that we should cater to. There are many reasons for
this:
- Nutch is too complex and too heavy for those that need to crawl up to
a few thousand pages. Now that the Droids project exists it's probably
not worth the effort to attempt a complete re-design of Nutch so that it
fits the need of this group - Nutch is based on map-reduce, and it's not
likely we would want to change that, so this means there will always be
a significant overhead for small jobs. I'm not saying we should not make
Nutch easier to use, but for small crawls Nutch is an overkill. Also, in
many cases these users don't realize that they don't do any frontier
discovery and expansion, and what they really need is Solr.
- at the other end of the spectrum, there are very very few companies
that want to do a wide large web-scale crawling - this is costly, and
requires a solid business plan and serious funding. These users are
prepared anyway to spend significant effort on customizations and
problem-solving, or they may want to use only some parts of Nutch. Often
they are also not too eager to contribute back to the project - either
because of their proprietary nature or because their customizations are
not useful for general audience.
The remaining group is interested in medium-size, high quality crawling
(focused, with good spam junk controls). Which is either an enterprise
search or a vertical search.