I have been using something similar to this for a while because we came
from Google Search Appliance and had googleon and googleoff all over the
place. I don't really like having to patch the parse-html plugin everytime
I do an upgrade, wish I could move that into it's own plugin somehow.
+0
On Wed, Apr 29, 2015 at 7:22 PM, Jorge Luis Betancourt González
jlbetanco...@uci.cu wrote:
+1
- run small test crawl in local mode and index into Solr.
- Original Message -
From: Sebastian Nagel wastl.na...@googlemail.com
To: user@nutch.apache.org
Sent: Wednesday, April 29,
+0
Thanks to all the contributors for the hard work.
On Sat, Jan 10, 2015 at 1:36 PM, Renato Marroquín Mogrovejo
renatoj.marroq...@gmail.com wrote:
Tests pass and signature looks good.
Here is my +1 (non-binding) Thanks for driving this Lewis!
Renato M.
2015-01-09 9:58 GMT+01:00 Lewis
Using Nutch 1.7
Out of the blue all of my crawl jobs started failing a few days ago. I
checked the user logs and nobody logged into the server and there were no
reboots or any other obvious issues. There is plenty of disk space. Here
is the error I'm getting, any help is appreciated:
/2014 12:30 AM, John Lafitte wrote:
Using Nutch 1.7
Out of the blue all of my crawl jobs started failing a few days ago. I
checked the user logs and nobody logged into the server and there were no
reboots or any other obvious issues. There is plenty of disk space. Here
is the error I'm
still exist in your hdfs.
On 06/24/2014 12:30 AM, John Lafitte wrote:
Using Nutch 1.7
Out of the blue all of my crawl jobs started failing a few days ago. I
checked the user logs and nobody logged into the server and there were no
reboots or any other obvious issues. There is plenty
What would be the case where you would want to keep the segments? I'm
considering automatically deleting them after sending the data to solr
On May 3, 2014 2:29 AM, chethan chethan.p...@gmail.com wrote:
Thanks for your reply!
Regards,
--
Chethan Prasad
On Sat, May 3, 2014 at 12:22 PM,
Hi Zabini,
I'm a little unclear if you are having a problem with nutch following the
links or indexing the pages. Have you tried both of these to verify the
links and index data?
https://wiki.apache.org/nutch/bin/nutch%20parsechecker
https://wiki.apache.org/nutch/bin/nutch%20indexchecker
The
Hi Shane,
The way I save on bandwidth for internal sites is by adding the internal IP
into the hosts file for the domain. If it's the local machine you can
probably just point it to 127.0.0.1 but I kind of wonder if you would be
saving anything but a DNS lookup... I'm not a networking guru but
gethue seems pretty awesome but it looks like it's more for reporting and
metrics, will it really administer nutch?
On Tue, Apr 8, 2014 at 7:07 AM, Talat Uyarer ta...@uyarer.com wrote:
Hi anunpak,
If you want to use pretty ui. You can use Hue[0]
[0] http://www.gethue.com
Talat
reddibabu,
I cannot resolve wiki.ibm.com so I'm guessing nutch can't either. Is that
an internal dns record?
On Wed, Apr 2, 2014 at 11:54 PM, reddibabu reddybabu...@gmail.com wrote:
Hi All,
I am using Apache Nutch 1.7. I can able to crawl and index all most all
sites except wiki pages.
) and
(link.ignore.internal.host == true)
resp.
(link.ignore.internal.domain == true)
cf. explanations about that in the wiki.
2014-03-26 4:09 GMT+01:00 John Lafitte jlafi...@brandextract.com:
Thanks for that Sebastian. So given the hint you've given me, I'm trying
to generate the scoring using
I setup a script that uses freegen to manually index new/updated URLs. I
thought it was working great, but now I'm just realizing that Solr returns
a score of 0 for these new documents. I thought the score was calculated
independent from what Nutch does, just uses the content and other metadata
. in continuous crawls,
OPIC scores run out of control.
Sebastian
On 03/25/2014 08:31 PM, John Lafitte wrote:
I setup a script that uses freegen to manually index new/updated URLs. I
thought it was working great, but now I'm just realizing that Solr
returns
a score of 0 for these new
I am still new to this, but I believe solr creates the score field. There
is also a boost field from nutch that you can save into solr. You'll have
to create your solr query to sort by both. Score, boost, title was the
most logical sorting to me.
On Sat, Mar 22, 2014 at 11:16 PM, azhar2007
This is a good question. I was looking for an out-of-the-box Solr/Nutch
solution to replace our google mini. We use rackspace and they don't seem
to offer it. Even though it was a bit of a learning curve, I feel it was
good to know how to actually set it up and configure it.
You might look at
I haven't done it myself but it's documented here:
http://wiki.apache.org/nutch/HttpAuthenticationSchemes
I'm not sure how you would do it with forms based auth, but if it's a
custom app you might be able to just automatically grant it access if it
the user agent and/or IP match up.
On Fri, Mar
/NUTCH-1733
On Tue, Mar 18, 2014 at 8:18 AM, remi tassing tassingr...@gmail.com wrote:
Hi John,
Try freegen for the second question:
http://wiki.apache.org/nutch/bin/nutch_freegen
Remi
On Tuesday, March 18, 2014, John Lafitte jlafi...@brandextract.com
wrote:
We are just starting out using
We are just starting out using nutch and solr but I have a couple of issues
I can't find any answers for.
1. Some of the HTML files we index are UTF-8 and contain a BOM. Nutch
seems to capture it and store it as some strange characters . I can
fix it by removing the BOM and indexchecker
0
On Tue, Mar 4, 2014 at 1:28 PM, S.L simpleliving...@gmail.com wrote:
+1
On Tue, Mar 4, 2014 at 12:50 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
Hi All Nutch'ers,
This thread is a VOTE for releasing Apache Nutch 1.8. The release
candidate
comprises the following
I am using Nutch 1.7 and Solr 4.6.1. I'm having a problem with indexing
RSS that has channel/title then channel/image/title it tries to add both of
them then fails when doing solrindex because title isn't multivalued.
I've used nutch indexchecker and I see the two titles being returned. The
21 matches
Mail list logo