I am pleased to announce the availability of Apache Nutch 1.0.
Apache Nutch, a subproject of Apache Lucene, is open source web-search
software. It builds on Lucene Java, adding web-specifics, such as a
crawler, a link-graph database, parsers for HTML and other document formats.
Apache Nutch
Is it possible to use heritrix as nutch's crawler?
On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren ssi...@gmail.com wrote:
I am pleased to announce the availability of Apache Nutch 1.0.
Apache Nutch, a subproject of Apache Lucene, is open source web-search
software. It builds on Lucene Java,
To a point yes. Heritrix will output in arc format. Then you can use
the o.a.n.tools.arc.ArcSegmentsCreator to convert the arc files to
segments. From there you can run other tools on the segments as normal.
What you won't get is Heritrix access to the crawldb.
Dennis
Ryan Smith wrote:
Dennis,
Thank you. Ok, then one other question please :). I want to use heritrix,
and the plugin for heritrix that writes records directly to hbase using
hbase-writer:
http://code.google.com/p/hbase-writer/
(Hbase runs on top of hadoop)
Would it be feasible/make sense for someone (maybe myself)
That is already in the works. See:
https://issues.apache.org/jira/browse/NUTCH-650
Dennis
Ryan Smith wrote:
Dennis,
Thank you. Ok, then one other question please :). I want to use heritrix,
and the plugin for heritrix that writes records directly to hbase using
hbase-writer:
Hi Sami,
Thank you so much for the good news. Is there going to be documentation for
Solr integration? Sorry to Otis, I know you are going to ask me to try to
find it out by myself ;)
Thanks! - Tony
On Sat, Mar 28, 2009 at 1:53 PM, Sami Siren ssi...@gmail.com wrote:
I am pleased to announce