Re: Remove Header Footer and Menus from crawled content

2015-10-01 Thread John Lafitte
I have been using something similar to this for a while because we came from Google Search Appliance and had googleon and googleoff all over the place. I don't really like having to patch the parse-html plugin everytime I do an upgrade, wish I could move that into it's own plugin somehow.

Re: [MASSMAIL]Re: [VOTE] Release Apache Nutch 1.10

2015-04-29 Thread John Lafitte
+0 On Wed, Apr 29, 2015 at 7:22 PM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: +1 - run small test crawl in local mode and index into Solr. - Original Message - From: Sebastian Nagel wastl.na...@googlemail.com To: user@nutch.apache.org Sent: Wednesday, April 29,

Re: [VOTE] Release Apache Nutch 2.3

2015-01-10 Thread John Lafitte
+0 Thanks to all the contributors for the hard work. On Sat, Jan 10, 2015 at 1:36 PM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Tests pass and signature looks good. Here is my +1 (non-binding) Thanks for driving this Lewis! Renato M. 2015-01-09 9:58 GMT+01:00 Lewis

File not found error

2014-06-24 Thread John Lafitte
Using Nutch 1.7 Out of the blue all of my crawl jobs started failing a few days ago. I checked the user logs and nobody logged into the server and there were no reboots or any other obvious issues. There is plenty of disk space. Here is the error I'm getting, any help is appreciated:

Re: File not found error

2014-06-24 Thread John Lafitte
/2014 12:30 AM, John Lafitte wrote: Using Nutch 1.7 Out of the blue all of my crawl jobs started failing a few days ago. I checked the user logs and nobody logged into the server and there were no reboots or any other obvious issues. There is plenty of disk space. Here is the error I'm

Re: File not found error

2014-06-24 Thread John Lafitte
still exist in your hdfs. On 06/24/2014 12:30 AM, John Lafitte wrote: Using Nutch 1.7 Out of the blue all of my crawl jobs started failing a few days ago. I checked the user logs and nobody logged into the server and there were no reboots or any other obvious issues. There is plenty

Re: Nutch 1.7 - deleting segments

2014-05-03 Thread John Lafitte
What would be the case where you would want to keep the segments? I'm considering automatically deleting them after sending the data to solr On May 3, 2014 2:29 AM, chethan chethan.p...@gmail.com wrote: Thanks for your reply! Regards, -- Chethan Prasad On Sat, May 3, 2014 at 12:22 PM,

Re: Don't fetch all urls in a page

2014-04-16 Thread John Lafitte
Hi Zabini, I'm a little unclear if you are having a problem with nutch following the links or indexing the pages. Have you tried both of these to verify the links and index data? https://wiki.apache.org/nutch/bin/nutch%20parsechecker https://wiki.apache.org/nutch/bin/nutch%20indexchecker The

Re: Index web folders.

2014-04-09 Thread John Lafitte
Hi Shane, The way I save on bandwidth for internal sites is by adding the internal IP into the hosts file for the domain. If it's the local machine you can probably just point it to 127.0.0.1 but I kind of wonder if you would be saving anything but a DNS lookup... I'm not a networking guru but

Re: Control and monitor nutch-1.x via Web interface ?

2014-04-08 Thread John Lafitte
gethue seems pretty awesome but it looks like it's more for reporting and metrics, will it really administer nutch? On Tue, Apr 8, 2014 at 7:07 AM, Talat Uyarer ta...@uyarer.com wrote: Hi anunpak, If you want to use pretty ui. You can use Hue[0] [0] http://www.gethue.com Talat

Re: Unable to crawl wiki pages through Nutch

2014-04-02 Thread John Lafitte
reddibabu, I cannot resolve wiki.ibm.com so I'm guessing nutch can't either. Is that an internal dns record? On Wed, Apr 2, 2014 at 11:54 PM, reddibabu reddybabu...@gmail.com wrote: Hi All, I am using Apache Nutch 1.7. I can able to crawl and index all most all sites except wiki pages.

Re: Freegen and Solr score

2014-03-26 Thread John Lafitte
) and (link.ignore.internal.host == true) resp. (link.ignore.internal.domain == true) cf. explanations about that in the wiki. 2014-03-26 4:09 GMT+01:00 John Lafitte jlafi...@brandextract.com: Thanks for that Sebastian. So given the hint you've given me, I'm trying to generate the scoring using

Freegen and Solr score

2014-03-25 Thread John Lafitte
I setup a script that uses freegen to manually index new/updated URLs. I thought it was working great, but now I'm just realizing that Solr returns a score of 0 for these new documents. I thought the score was calculated independent from what Nutch does, just uses the content and other metadata

Re: Freegen and Solr score

2014-03-25 Thread John Lafitte
. in continuous crawls, OPIC scores run out of control. Sebastian On 03/25/2014 08:31 PM, John Lafitte wrote: I setup a script that uses freegen to manually index new/updated URLs. I thought it was working great, but now I'm just realizing that Solr returns a score of 0 for these new

Re: Ranking Algorithm

2014-03-23 Thread John Lafitte
I am still new to this, but I believe solr creates the score field. There is also a boost field from nutch that you can save into solr. You'll have to create your solr query to sort by both. Score, boost, title was the most logical sorting to me. On Sat, Mar 22, 2014 at 11:16 PM, azhar2007

Re: Ready to use nutch

2014-03-21 Thread John Lafitte
This is a good question. I was looking for an out-of-the-box Solr/Nutch solution to replace our google mini. We use rackspace and they don't seem to offer it. Even though it was a bit of a learning curve, I feel it was good to know how to actually set it up and configure it. You might look at

Re: Crawling an authenticated site

2014-03-21 Thread John Lafitte
I haven't done it myself but it's documented here: http://wiki.apache.org/nutch/HttpAuthenticationSchemes I'm not sure how you would do it with forms based auth, but if it's a custom app you might be able to just automatically grant it access if it the user agent and/or IP match up. On Fri, Mar

Re: Usage Scenarios

2014-03-18 Thread John Lafitte
/NUTCH-1733 On Tue, Mar 18, 2014 at 8:18 AM, remi tassing tassingr...@gmail.com wrote: Hi John, Try freegen for the second question: http://wiki.apache.org/nutch/bin/nutch_freegen Remi On Tuesday, March 18, 2014, John Lafitte jlafi...@brandextract.com wrote: We are just starting out using

Usage Scenarios

2014-03-17 Thread John Lafitte
We are just starting out using nutch and solr but I have a couple of issues I can't find any answers for. 1. Some of the HTML files we index are UTF-8 and contain a BOM. Nutch seems to capture it and store it as some strange characters . I can fix it by removing the BOM and indexchecker

Re: [VOTE] Apache Nutch 1.8 Release Candidate #1

2014-03-04 Thread John Lafitte
0 On Tue, Mar 4, 2014 at 1:28 PM, S.L simpleliving...@gmail.com wrote: +1 On Tue, Mar 4, 2014 at 12:50 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi All Nutch'ers, This thread is a VOTE for releasing Apache Nutch 1.8. The release candidate comprises the following

multivalues returned unexpectedly

2014-02-24 Thread John Lafitte
I am using Nutch 1.7 and Solr 4.6.1. I'm having a problem with indexing RSS that has channel/title then channel/image/title it tries to add both of them then fails when doing solrindex because title isn't multivalued. I've used nutch indexchecker and I see the two titles being returned. The