Hi all
I'm wondering about how nutch handle cookies defined while fetching a page.
1) are those cookies used when nutch is crawling urls generated from that page ?
2) is there a way to configure Nutch so the values of some of those cookies are
considered as part of the identity of the page (as
Hi Mansour,
On Wed, Apr 4, 2012 at 10:05 PM, Mansour Al Akeel mansour.alak...@gmail.com
wrote:
I understand that I need to
implement a way to process each of the pages for these sites in a
different way. Mostly XML processing and regexp (any expert advice
here).
This is extremely vague,
-- Messaggio inoltrato --
Da: alessio crisantemi alessio.crisant...@gmail.com
Date: 05 aprile 2012 22:32
Oggetto: request about snippets
A: user@nutch.apache.org
Dear all,
I configured my Nutch (1.4) for works with Solr (1.4.1) and I crawl and
index with success my website.
I
Hi Rémy,
I'm wondering about how nutch handle cookies defined while fetching a page.
1) are those cookies used when nutch is crawling urls generated from that
page ?
Generally, cookies are ignored. But have a look at
https://issues.apache.org/jira/browse/NUTCH-827
Your problem is almost
Hi Alessio,
You need to determine in which field the unwanted content exists. Once
you've done this you could write an indexing filter to remove this from
your document prior to indexing.
Lewis
On Thu, Apr 5, 2012 at 9:41 PM, alessio crisantemi
alessio.crisant...@gmail.com wrote:
Dear Lewis, thank you for your fast reply.
But just thiat's my problem! I don't compred wich is the field that crates
this raw.
But I see a date (eg: Mercoledì Apr 04) followed by the word parent
anche after and the the ame of categories (Home NEWSLOT/VLT SCOMMESSE
ONLINE LOTTERIE Politica Video
I can't see any of your attachments as they're not permitted on list.
Can you provide an URL?
On Thu, Apr 5, 2012 at 9:56 PM, alessio crisantemi
alessio.crisant...@gmail.com wrote:
Dear Lewis, thank you for your fast reply.
But just thiat's my problem! I don't compred wich is the field that
Seems to me it's just the breadcrumb of the page popping up in Solr's
highlighter snippet?
In Thu, 5 Apr 2012 22:02:31 +0100, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:
I can't see any of your attachments as they're not permitted on list.
Can you provide an URL?
On Thu, Apr 5,
what is it 'breadcrumb' Markus?
Il giorno 05 aprile 2012 23:08, Markus Jelsma
markus.jel...@openindex.ioha scritto:
Seems to me it's just the breadcrumb of the page popping up in Solr's
highlighter snippet?
In Thu, 5 Apr 2012 22:02:31 +0100, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com
here a part of results:
[2] Live Score - GiocoNews - Tutto su casinò, poker, giochi
onlinehttp://www.gioconews.it/live-score.html Live
Score - *Gioco*News - Tutto su casinò, poker, giochi online Mercoledì Apr
04 Home NEWSLOT/VLT SCOMMESSE ONLINE LOTTERIE Politica Video Live Score
Home Live
HI
I am new to using Nutch. I'm not good with English, so the help of a
translator.
My question focuses on the need to know how nutch can collect and process for
future indexing on solr server , all meta tags of a html document. I am also
interested in knowing how to collect the ALT
11 matches
Mail list logo