FWIW, there is a plugin for heritrix to write to hbase as a back end store.
Maybe it will help for making a nutch plugin?
http://code.google.com/p/hbase-writer
-Ryan
On Mon, Feb 8, 2010 at 4:32 AM, Hua Su huas...@gmail.com wrote:
Hi all,
Any recent progress on HBase integration? There is a
Dennis, Thanks a lot.
-Ryan
2009/3/28 Tony Wang ivyt...@gmail.com
Hi Sami,
Thank you so much for the good news. Is there going to be documentation for
Solr integration? Sorry to Otis, I know you are going to ask me to try to
find it out by myself ;)
Thanks! - Tony
On Sat, Mar 28, 2009
Is it possible to use heritrix as nutch's crawler?
On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren ssi...@gmail.com wrote:
I am pleased to announce the availability of Apache Nutch 1.0.
Apache Nutch, a subproject of Apache Lucene, is open source web-search
software. It builds on Lucene Java,
to convert the arc files to segments.
From there you can run other tools on the segments as normal. What you
won't get is Heritrix access to the crawldb.
Dennis
Ryan Smith wrote:
Is it possible to use heritrix as nutch's crawler?
On Sat, Mar 28, 2009 at 3:53 PM, Sami Siren ssi
One way is you can try to enable debug logging in log4j so you can see the
headers that httpclient is passing back and forth to the webserver.
On Thu, Dec 11, 2008 at 10:29 AM, George Herlin [EMAIL PROTECTED] wrote:
I have read that if one sets the plugin.includes property to use
Ok, so you merge your other crawls into the same search dir, thats
understood thanks.
My other question is concerning when you do a search in nutch. Right now,
it returns links to file:///x/y/z/.../foo.html and i was wondering if
there was a simple way to change that link to be
tell. Don't
understand the logic, but there you are. Note, if you use a webserver, be
aware you will have to disable IGNORE.INTERNAL setting in Nutch-Site.xml
(you'll be messing around a lot in here).
Cheers,
Winton
At 2:40 PM -0400 7/3/08, Ryan Smith wrote:
Is there a simple way to have
at 7:17 PM, Winton Davies [EMAIL PROTECTED]
wrote:
Hi Ryan,
I just used the regular intranet crawl, didnt try to do the inject
W
At 6:16 PM -0400 7/5/08, Ryan Smith wrote:
Winton,
I added the override property to nutch-site.xml ( i saw the one in
nutch-default.xml after your email
Is there a simple way to have nutch index a folder full of other folders and
html files?
I was hoping to avoid having to run apache to serve the html files, and then
have nutch crawl the site on apache.
Thank you,
-Ryan