So things like NUTCH-666 should be ported to SOLR to use on Nutch 2.0?
ok. problem is some things like NUTCH-666 are required both on indexing
and
definitely on the SOLR side from now on
btw,when Nutch 2.0 is planned to be released?1.1 took several years...
we are still at a very early
Nope, that changes nothing. Just checked out my log file:
2010-06-24 17:13:40,410 INFO plugin.PluginRepository - Plugins: looking in:
/~/apache-nutch-1.1-bin/plugins
2010-06-24 17:13:41,439 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2010-06-24 17:13:41,439 INFO
Just tried it in nutch-1.0 with the same kind of behavior:
hc.me...@server01:~/nutch-1.0 ./bin/nutch plugin urlnormalizer-regex
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
http://www.myinputurl.com
Plugin 'urlnormalizer-regex' not present or inactive.
(it is present and it is
hi hannes,
i have identified your problem.
your nutch-site.xml plugin.includes property contains a newline after
urlnormalizer-(basic|pass|regex), which breaks pattern matching in
PluginRepository.java.
property
nameplugin.includes/name
On Wed, Jun 23, 2010 at 5:27 PM, Dennis Kubes ku...@apache.org wrote:
You may still see some urls that *seem* to be outside of your domains list
while using the domain urlfilter. Remember the following:
1. Urls are checked in order of domain suffix, domain name, and
hostname. If you
Awesome... Thank you very very much :-)
On Thu, Jun 24, 2010 at 6:55 PM, reinhard schwab reinhard.sch...@aon.atwrote:
hi hannes,
i have identified your problem.
your nutch-site.xml plugin.includes property contains a newline after
urlnormalizer-(basic|pass|regex), which breaks pattern
Hi,
I would like to crawl a list of pages but only index PDFs. From what I
gather I can add an exclusion for all non .pdf extensions in
crawl-urlfilter.txt.
However, I would also like to apply an additional restriction, that I only
index pages that match a certain query. In my head, this
On 2010-06-24 10:56, arkadi.kosmy...@csiro.au wrote:
Hi,
It looks like Tika does not include a PostScript parser. At least the
copy that comes with Nutch 1.1. Is this right? I just want to double
check because PostScript is a major file format. I get errors Can't
retrieve Tika parser for
Please do I am very interested in how a solution like that works for you
in terms of performance.
Dennis
On 06/24/2010 12:10 PM, Emmanuel de Castro Santana wrote:
Thank you for all the help Dennis, all of this is valid information to me !
I am trying a solution using Memcache, will post
9 matches
Mail list logo