how does nutch handle cookies ?

2012-04-05 Thread Rémy Amouroux
Hi all I'm wondering about how nutch handle cookies defined while fetching a page. 1) are those cookies used when nutch is crawling urls generated from that page ? 2) is there a way to configure Nutch so the values of some of those cookies are considered as part of the identity of the page (as

Re: Crawl and extract data

2012-04-05 Thread Lewis John Mcgibbney
Hi Mansour, On Wed, Apr 4, 2012 at 10:05 PM, Mansour Al Akeel mansour.alak...@gmail.com wrote: I understand that I need to implement a way to process each of the pages for these sites in a different way. Mostly XML processing and regexp (any expert advice here). This is extremely vague,

Fwd: request about snippets (with attachement)

2012-04-05 Thread alessio crisantemi
-- Messaggio inoltrato -- Da: alessio crisantemi alessio.crisant...@gmail.com Date: 05 aprile 2012 22:32 Oggetto: request about snippets A: user@nutch.apache.org Dear all, I configured my Nutch (1.4) for works with Solr (1.4.1) and I crawl and index with success my website. I

Re: how does nutch handle cookies ?

2012-04-05 Thread Sebastian Nagel
Hi Rémy, I'm wondering about how nutch handle cookies defined while fetching a page. 1) are those cookies used when nutch is crawling urls generated from that page ? Generally, cookies are ignored. But have a look at https://issues.apache.org/jira/browse/NUTCH-827 Your problem is almost

Re: request about snippets (with attachement)

2012-04-05 Thread Lewis John Mcgibbney
Hi Alessio, You need to determine in which field the unwanted content exists. Once you've done this you could write an indexing filter to remove this from your document prior to indexing. Lewis On Thu, Apr 5, 2012 at 9:41 PM, alessio crisantemi alessio.crisant...@gmail.com wrote:

Re: request about snippets (with attachement)

2012-04-05 Thread alessio crisantemi
Dear Lewis, thank you for your fast reply. But just thiat's my problem! I don't compred wich is the field that crates this raw. But I see a date (eg: Mercoledì Apr 04) followed by the word parent anche after and the the ame of categories (Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video

Re: request about snippets (with attachement)

2012-04-05 Thread Lewis John Mcgibbney
I can't see any of your attachments as they're not permitted on list. Can you provide an URL? On Thu, Apr 5, 2012 at 9:56 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: Dear Lewis, thank you for your fast reply. But just thiat's my problem! I don't compred wich is the field that

Re: request about snippets (with attachement)

2012-04-05 Thread Markus Jelsma
Seems to me it's just the breadcrumb of the page popping up in Solr's highlighter snippet? In Thu, 5 Apr 2012 22:02:31 +0100, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I can't see any of your attachments as they're not permitted on list. Can you provide an URL? On Thu, Apr 5,

Re: request about snippets (with attachement)

2012-04-05 Thread alessio crisantemi
what is it 'breadcrumb' Markus? Il giorno 05 aprile 2012 23:08, Markus Jelsma markus.jel...@openindex.ioha scritto: Seems to me it's just the breadcrumb of the page popping up in Solr's highlighter snippet? In Thu, 5 Apr 2012 22:02:31 +0100, Lewis John Mcgibbney lewis.mcgibb...@gmail.com

Re: request about snippets (with attachement)

2012-04-05 Thread alessio crisantemi
here a part of results: [2] Live Score - GiocoNews - Tutto su casinò, poker, giochi onlinehttp://www.gioconews.it/live-score.html Live Score - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr 04 Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score Home Live

meta tags HTML??

2012-04-05 Thread Manuel Antonio Novoa Proenza
HI I am new to using Nutch. I'm not good with English, so the help of a translator. My question focuses on the need to know how nutch can collect and process for future indexing on solr server , all meta tags of a html document. I am also interested in knowing how to collect the ALT