This is not very clear:
There is big difference between removing garbage for indexingFilter and
removing search results... I think you want the first one.
You just need to build a custom Parser that will filter out the tags you dont
want
: indexing just certain content
Date: Sun, 11 Oct 2009 11:02:21 +0200
To: nutch-user@lucene.apache.org
This is not very clear:
There is big difference between removing garbage for indexingFilter and
removing search results... I think you want the first one.
You just need to build a custom
MilleBii wrote:
Andzej,
The use case you are thinking is : at the parsing stage, filter out garbage
content and index only the rest.
I have a different use case, I want to keep everything as standard indexing
_AND_ also extract part for being indexed in a dedicated field (which will
be
are using SOLR, so i just have to index the important content...the
search will be performed with solr so i guess i dont need the QueryFilter.
best regards
Date: Sat, 10 Oct 2009 16:04:10 +0200
From: a...@getopt.org
To: nutch-user@lucene.apache.org
Subject: Re: indexing just certain content
an html page :( if i will find this piece the rest
will be like a peice of cake :)
Date: Sat, 10 Oct 2009 16:41:44 +0200
Subject: Re: indexing just certain content
From: mille...@gmail.com
To: nutch-user@lucene.apache.org
Andrzej,
Great !!!
I did not realize you could put your own
what i want is exactly explained in this second post : How to ignore search
results that don't have related keywords in main body?
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: RE: indexing just certain content
Date: Sat, 10 Oct 2009 15:35:31 +
yes
Don't think it will work because at the indexing filter stage all the HTML
tags are gone from the text.
I think you need to modify the HTML parser to filter out the tags you want
to get rid of.
In some use case I have I would like to perform 'intelligent indexing', ie
use the tag information to
On Fri, 9 Oct 2009 18:00:41 +0200
MilleBii mille...@gmail.com wrote:
Don't think it will work because at the indexing filter stage all
the HTML tags are gone from the text.
I think you need to modify the HTML parser to filter out the tags
you want to get rid of.
In some use case I have I
or to find a class which could filter an HTML pages
and delete certain tag from it
Thx.
Date: Fri, 9 Oct 2009 22:04:41 +0530
From: g...@srijan.in
To: nutch-user@lucene.apache.org
Subject: Re: indexing just certain content
On Fri, 9 Oct 2009 18:00:41 +0200
MilleBii mille...@gmail.com wrote
BELLINI ADAM wrote:
HI
hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was thinking
to start to create an HTML tag filter class.
mabe i can create my own HTML parser ! as i do for parsing and indexing
DublinCore metadata...it sounds possible don't you think so ?
i just hv
just certain content
BELLINI ADAM wrote:
HI
hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was
thinking to start to create an HTML tag filter class.
mabe i can create my own HTML parser ! as i do for parsing and indexing
DublinCore metadata...it sounds possible
2009 19:16:44 +0200
From: a...@getopt.org
To: nutch-user@lucene.apache.org
Subject: Re: indexing just certain content
BELLINI ADAM wrote:
HI
hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i
was thinking to start to create an HTML tag filter class.
mabe i can create my own HTML
To: nutch-user@lucene.apache.org
Subject: Re: indexing just certain content
Date: Fri, 9 Oct 2009 16:39:31 -0700
can you plz just tell us in english what the plugin creativecommons
is for ?
i mean if i will include this plugin in my nutch-site.txt, what will
i have as result ?
I
in this class the BasicIndexingFilter.java, I think before adding the
contenent to the document i could parse it again to filter certain div tags ??
text = parse.getText();
// i have to parse and filter the text here before adding it to the docuement
new_Filtred_text =
Adam,
You could turn off all the indexing plugins and write your own plugin
that only indexes certain meta content from your intranet - giving you
complete control of the fields indexed.
Eric
On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote:
hi
does anybody know if it's possible to
15 matches
Mail list logo