RE: indexing just certain content

2009-10-11 Thread BELLINI ADAM

we are using SOLR, I dont know how to remove search results, that's why i dont 
want to index the garbage data...and that's why i'm wondering to remove those 
data in the parse operation...yes i want to filter out the data from the HTML, 
and this is my big problem...in my post i'm asking if there is a java class 
that delete section form an HTML ! since i know only the sections i want to 
delete (it's a template), i'm not able to construct a new HTML file by taking 
only section i need since i dont know those section and dont know if the HTML 
tags are well dompted(the only thing i know is that the section i want to 
remove are DIV sections and i know that they are dompted).
so the big deal is : removing known section from an HTML file. (without knowing 
the other sections).
i will try to construct such a class to clean those html files



> From: mille...@gmail.com
> Subject: RE: indexing just certain content
> Date: Sun, 11 Oct 2009 11:02:21 +0200
> To: nutch-user@lucene.apache.org
> 
> This is not very clear:
> There is big difference between removing garbage for indexingFilter and 
> removing search  results... I think you want the first one. 
> 
> You just need to build a custom Parser that will filter out the tags you dont 
> want 
  
_
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

RE: indexing just certain content

2009-10-11 Thread MilleBii
This is not very clear:
There is big difference between removing garbage for indexingFilter and 
removing search  results... I think you want the first one. 

You just need to build a custom Parser that will filter out the tags you dont 
want 


RE: indexing just certain content

2009-10-10 Thread BELLINI ADAM

what i want is exactly explained in this second post : How to ignore search 
results that don't have related keywords in main body?




> From: mbel...@msn.com
> To: nutch-user@lucene.apache.org
> Subject: RE: indexing just certain content
> Date: Sat, 10 Oct 2009 15:35:31 +
> 
> 
> yes 
> 
> 
> 
> 
> MilleBii 
> 
> i tald you before that i created a DublinCore metadata parser and 
> indexer...so i parsed my html and created fileds to get my DC metadata...my 
> missing piece is how to delete sections form an html page :( if i will find 
> this piece the rest will be like a peice of cake :)
> 
> 
> 
> 
> > Date: Sat, 10 Oct 2009 16:41:44 +0200
> > Subject: Re: indexing just certain content
> > From: mille...@gmail.com
> > To: nutch-user@lucene.apache.org
> > 
> > Andrzej,
> > 
> > Great !!!
> > I did not realize you could put your own content in ParseData.metadata and
> > read it back in the IndexingFilter... this was my missing piece in the
> > puzzle, for the rest I knew what to do.
> > 
> > Thanks,
> > 
> > 
> > 
> > 2009/10/10 Andrzej Bialecki 
> > 
> > > MilleBii wrote:
> > >
> > >> Andzej,
> > >>
> > >> The use case you are thinking is : at the parsing stage, filter out
> > >> garbage
> > >> content and index only the rest.
> > >>
> > >> I have a different use case, I want to keep everything as standard
> > >> indexing
> > >> _AND_  also extract part for being indexed in a dedicated field (which
> > >> will
> > >> be boosted at search time). In a document certain part have more
> > >> importance
> > >> than others in my case.
> > >>
> > >> So I would like either
> > >> 1. to access html representation at indexing time... not possible or did
> > >> not
> > >> find how
> > >> 2. create a dual representation of the document, plain & standard,
> > >> filtered
> > >> document
> > >>
> > >> I think option 2. is much better because it better fits the model and
> > >> allows
> > >> for a lot of different other use cases.
> > >>
> > >
> > > Actually, creativecommons provides hints how to do this .. but to be more
> > > explicit:
> > >
> > > * in your HtmlParseFilter you need to extract from DOM tree the parts that
> > > you want, and put them inside ParseData.metadata. This way you will 
> > > preserve
> > > both the original text, and your special parts that you extracted.
> > >
> > > * in your IndexingFilter you will retrieve the parts from
> > > ParseData.metadata and add them as additional index fields (don't forget 
> > > to
> > > specify indexing backend options).
> > >
> > > * in your QueryFilter plugin.xml you declare that QueryParser should pass
> > > your special fields without treating them as terms, and in the
> > > implementation you create a BooleanClause to be added to the translated
> > > query.
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrzej Bialecki <><
> > >  ___. ___ ___ ___ _ _   __
> > > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > http://www.sigram.com  Contact: info at sigram dot com
> > >
> > >
> > 
> > 
> > -- 
> > -MilleBii-
> 
> _
> New! Faster Messenger access on the new MSN homepage
> http://go.microsoft.com/?linkid=9677406
  
_
New! Get to Messenger faster: Sign-in here now!
http://go.microsoft.com/?linkid=9677407

RE: indexing just certain content

2009-10-10 Thread BELLINI ADAM

yes 




MilleBii 

i tald you before that i created a DublinCore metadata parser and indexer...so 
i parsed my html and created fileds to get my DC metadata...my missing piece is 
how to delete sections form an html page :( if i will find this piece the rest 
will be like a peice of cake :)




> Date: Sat, 10 Oct 2009 16:41:44 +0200
> Subject: Re: indexing just certain content
> From: mille...@gmail.com
> To: nutch-user@lucene.apache.org
> 
> Andrzej,
> 
> Great !!!
> I did not realize you could put your own content in ParseData.metadata and
> read it back in the IndexingFilter... this was my missing piece in the
> puzzle, for the rest I knew what to do.
> 
> Thanks,
> 
> 
> 
> 2009/10/10 Andrzej Bialecki 
> 
> > MilleBii wrote:
> >
> >> Andzej,
> >>
> >> The use case you are thinking is : at the parsing stage, filter out
> >> garbage
> >> content and index only the rest.
> >>
> >> I have a different use case, I want to keep everything as standard
> >> indexing
> >> _AND_  also extract part for being indexed in a dedicated field (which
> >> will
> >> be boosted at search time). In a document certain part have more
> >> importance
> >> than others in my case.
> >>
> >> So I would like either
> >> 1. to access html representation at indexing time... not possible or did
> >> not
> >> find how
> >> 2. create a dual representation of the document, plain & standard,
> >> filtered
> >> document
> >>
> >> I think option 2. is much better because it better fits the model and
> >> allows
> >> for a lot of different other use cases.
> >>
> >
> > Actually, creativecommons provides hints how to do this .. but to be more
> > explicit:
> >
> > * in your HtmlParseFilter you need to extract from DOM tree the parts that
> > you want, and put them inside ParseData.metadata. This way you will preserve
> > both the original text, and your special parts that you extracted.
> >
> > * in your IndexingFilter you will retrieve the parts from
> > ParseData.metadata and add them as additional index fields (don't forget to
> > specify indexing backend options).
> >
> > * in your QueryFilter plugin.xml you declare that QueryParser should pass
> > your special fields without treating them as terms, and in the
> > implementation you create a BooleanClause to be added to the translated
> > query.
> >
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki <><
> >  ___. ___ ___ ___ _ _   __
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> 
> 
> -- 
> -MilleBii-
  
_
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

RE: indexing just certain content

2009-10-10 Thread BELLINI ADAM

Hi,
you said :  '...* in your HtmlParseFilter you need to extract from DOM tree the 
parts 
that you want ...'
but my problem is : i dont know what to extract becoz dont know all pages i'm 
indexing, i just know what to don't index
 1 - I just know what to dont index...all pages have some sections that i wont 
index, since i know those section i want to take them off from the document and 
keep the rest of the important content.

the sections are headers, top menus, right menus, left menus and some other 
sections:

 bla bla 
   bla bla 
   bla bla 
  bla bla 
mabe i could find some java classes which can delete sections form a an HTML 
page ?!
if i found this one so i guess it will be more easy to use 

2- you said dont forget backends index : could you tell me what are they ?

3- we are using SOLR, so i just have to index the important content...the 
search will be performed with solr so i guess i dont need the QueryFilter.

best regards




> Date: Sat, 10 Oct 2009 16:04:10 +0200
> From: a...@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: indexing just certain content
> 
> MilleBii wrote:
> > Andzej,
> > 
> > The use case you are thinking is : at the parsing stage, filter out garbage
> > content and index only the rest.
> > 
> > I have a different use case, I want to keep everything as standard indexing
> > _AND_  also extract part for being indexed in a dedicated field (which will
> > be boosted at search time). In a document certain part have more importance
> > than others in my case.
> > 
> > So I would like either
> > 1. to access html representation at indexing time... not possible or did not
> > find how
> > 2. create a dual representation of the document, plain & standard, filtered
> > document
> > 
> > I think option 2. is much better because it better fits the model and allows
> > for a lot of different other use cases.
> 
> Actually, creativecommons provides hints how to do this .. but to be 
> more explicit:
> 
> * in your HtmlParseFilter you need to extract from DOM tree the parts 
> that you want, and put them inside ParseData.metadata. This way you will 
> preserve both the original text, and your special parts that you extracted.
> 
> * in your IndexingFilter you will retrieve the parts from 
> ParseData.metadata and add them as additional index fields (don't forget 
> to specify indexing backend options).
> 
> * in your QueryFilter plugin.xml you declare that QueryParser should 
> pass your special fields without treating them as terms, and in the 
> implementation you create a BooleanClause to be added to the translated 
> query.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
>   ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
  
_
New! Get to Messenger faster: Sign-in here now!
http://go.microsoft.com/?linkid=9677407

Re: indexing just certain content

2009-10-10 Thread MilleBii
Andrzej,

Great !!!
I did not realize you could put your own content in ParseData.metadata and
read it back in the IndexingFilter... this was my missing piece in the
puzzle, for the rest I knew what to do.

Thanks,



2009/10/10 Andrzej Bialecki 

> MilleBii wrote:
>
>> Andzej,
>>
>> The use case you are thinking is : at the parsing stage, filter out
>> garbage
>> content and index only the rest.
>>
>> I have a different use case, I want to keep everything as standard
>> indexing
>> _AND_  also extract part for being indexed in a dedicated field (which
>> will
>> be boosted at search time). In a document certain part have more
>> importance
>> than others in my case.
>>
>> So I would like either
>> 1. to access html representation at indexing time... not possible or did
>> not
>> find how
>> 2. create a dual representation of the document, plain & standard,
>> filtered
>> document
>>
>> I think option 2. is much better because it better fits the model and
>> allows
>> for a lot of different other use cases.
>>
>
> Actually, creativecommons provides hints how to do this .. but to be more
> explicit:
>
> * in your HtmlParseFilter you need to extract from DOM tree the parts that
> you want, and put them inside ParseData.metadata. This way you will preserve
> both the original text, and your special parts that you extracted.
>
> * in your IndexingFilter you will retrieve the parts from
> ParseData.metadata and add them as additional index fields (don't forget to
> specify indexing backend options).
>
> * in your QueryFilter plugin.xml you declare that QueryParser should pass
> your special fields without treating them as terms, and in the
> implementation you create a BooleanClause to be added to the translated
> query.
>
>
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
-MilleBii-


Re: indexing just certain content

2009-10-10 Thread Andrzej Bialecki

MilleBii wrote:

Andzej,

The use case you are thinking is : at the parsing stage, filter out garbage
content and index only the rest.

I have a different use case, I want to keep everything as standard indexing
_AND_  also extract part for being indexed in a dedicated field (which will
be boosted at search time). In a document certain part have more importance
than others in my case.

So I would like either
1. to access html representation at indexing time... not possible or did not
find how
2. create a dual representation of the document, plain & standard, filtered
document

I think option 2. is much better because it better fits the model and allows
for a lot of different other use cases.


Actually, creativecommons provides hints how to do this .. but to be 
more explicit:


* in your HtmlParseFilter you need to extract from DOM tree the parts 
that you want, and put them inside ParseData.metadata. This way you will 
preserve both the original text, and your special parts that you extracted.


* in your IndexingFilter you will retrieve the parts from 
ParseData.metadata and add them as additional index fields (don't forget 
to specify indexing backend options).


* in your QueryFilter plugin.xml you declare that QueryParser should 
pass your special fields without treating them as terms, and in the 
implementation you create a BooleanClause to be added to the translated 
query.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: indexing just certain content

2009-10-10 Thread MilleBii
Andzej,

The use case you are thinking is : at the parsing stage, filter out garbage
content and index only the rest.

I have a different use case, I want to keep everything as standard indexing
_AND_  also extract part for being indexed in a dedicated field (which will
be boosted at search time). In a document certain part have more importance
than others in my case.

So I would like either
1. to access html representation at indexing time... not possible or did not
find how
2. create a dual representation of the document, plain & standard, filtered
document

I think option 2. is much better because it better fits the model and allows
for a lot of different other use cases.

best regards,


2009/10/9 Andrzej Bialecki 

> BELLINI ADAM wrote:
>
>> HI
>>
>> hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was
>> thinking to start to create an HTML tag filter class.
>> mabe i can create my own HTML parser ! as i do for parsing and indexing
>> DublinCore metadata...it sounds possible don't you think so ?
>>
>> i just hv to create also or to find a class which could filter an HTML
>> pages and delete certain tag from it
>>
>
> Guys, please take a look at how HtmlParseFilters are implemented - for
> example the creativecommons plugin. I believe that's exactly the
> functionality that you are looking for.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
-MilleBii-


RE: indexing just certain content

2009-10-09 Thread BELLINI ADAM


yes i did read the code but  didnt understand what is 'the Creative Commons 
license' that's why i asked what does mean  creativecommons  .
but as u said,  i hv to be familiar with DOM manipulation to understand the 
code...so lets start knowing DOM
thx




> From: kkrugler_li...@transpac.com
> To: nutch-user@lucene.apache.org
> Subject: Re: indexing just certain content
> Date: Fri, 9 Oct 2009 16:39:31 -0700
> 
> > can you plz just tell us in english what the plugin creativecommons  
> > is for ?
> > i mean if i will include this plugin in my nutch-site.txt, what will  
> > i have as result ?
> 
> I think Andrzej is suggesting that you read the code.
> 
> If you look at the beginning of the CCParseFilter.java file, you'll see:
> 
> /** Adds metadata identifying the Creative Commons license used, if  
> any. */
> public class CCParseFilter implements HtmlParseFilter {
> 
> The key routine that you need to implement is:
> 
>/** Adds metadata or otherwise modifies a parse of an HTML  
> document, given
> * the DOM tree of a page. */
>public ParseResult filter(Content content, ParseResult parseResult,  
> HTMLMetaTags metaTags, DocumentFragment doc) {
> 
> So it seems that this plugin would be a great place for you to start.
> 
> But you'll need to dig into the code, be familiar with DOM  
> manipulation, etc.
> 
> -- Ken
> 
> 
> 
> >> Date: Fri, 9 Oct 2009 19:16:44 +0200
> >> From: a...@getopt.org
> >> To: nutch-user@lucene.apache.org
> >> Subject: Re: indexing just certain content
> >>
> >> BELLINI ADAM wrote:
> >>> HI
> >>>
> >>> hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i  
> >>> was thinking to start to create an HTML tag filter class.
> >>> mabe i can create my own HTML parser ! as i do for parsing and  
> >>> indexing DublinCore metadata...it sounds possible don't you think  
> >>> so ?
> >>>
> >>> i just hv to create also or to find a class which could filter an  
> >>> HTML pages and delete certain tag from it
> >>
> >> Guys, please take a look at how HtmlParseFilters are implemented -  
> >> for
> >> example the creativecommons plugin. I believe that's exactly the
> >> functionality that you are looking for.
> >>
> >>
> >> -- 
> >> Best regards,
> >> Andrzej Bialecki <><
> 
  
_
New! Open Messenger faster on the MSN homepage
http://go.microsoft.com/?linkid=9677405

Re: indexing just certain content

2009-10-09 Thread Ken Krugler
can you plz just tell us in english what the plugin creativecommons  
is for ?
i mean if i will include this plugin in my nutch-site.txt, what will  
i have as result ?


I think Andrzej is suggesting that you read the code.

If you look at the beginning of the CCParseFilter.java file, you'll see:

/** Adds metadata identifying the Creative Commons license used, if  
any. */

public class CCParseFilter implements HtmlParseFilter {

The key routine that you need to implement is:

  /** Adds metadata or otherwise modifies a parse of an HTML  
document, given

   * the DOM tree of a page. */
  public ParseResult filter(Content content, ParseResult parseResult,  
HTMLMetaTags metaTags, DocumentFragment doc) {


So it seems that this plugin would be a great place for you to start.

But you'll need to dig into the code, be familiar with DOM  
manipulation, etc.


-- Ken




Date: Fri, 9 Oct 2009 19:16:44 +0200
From: a...@getopt.org
To: nutch-user@lucene.apache.org
Subject: Re: indexing just certain content

BELLINI ADAM wrote:

HI

hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i  
was thinking to start to create an HTML tag filter class.
mabe i can create my own HTML parser ! as i do for parsing and  
indexing DublinCore metadata...it sounds possible don't you think  
so ?


i just hv to create also or to find a class which could filter an  
HTML pages and delete certain tag from it


Guys, please take a look at how HtmlParseFilters are implemented -  
for

example the creativecommons plugin. I believe that's exactly the
functionality that you are looking for.


--
Best regards,
Andrzej Bialecki <><




RE: indexing just certain content

2009-10-09 Thread BELLINI ADAM

hi,

can you plz just tell us in english what the plugin creativecommons is for ?
i mean if i will include this plugin in my nutch-site.txt, what will i have as 
result ?
thx





> Date: Fri, 9 Oct 2009 19:16:44 +0200
> From: a...@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: indexing just certain content
> 
> BELLINI ADAM wrote:
> > HI
> > 
> > hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was 
> > thinking to start to create an HTML tag filter class.
> > mabe i can create my own HTML parser ! as i do for parsing and indexing 
> > DublinCore metadata...it sounds possible don't you think so ?
> > 
> > i just hv to create also or to find a class which could filter an HTML 
> > pages and delete certain tag from it
> 
> Guys, please take a look at how HtmlParseFilters are implemented - for 
> example the creativecommons plugin. I believe that's exactly the 
> functionality that you are looking for.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
>   ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
  
_
New! Get to Messenger faster: Sign-in here now!
http://go.microsoft.com/?linkid=9677407

Re: indexing just certain content

2009-10-09 Thread Andrzej Bialecki

BELLINI ADAM wrote:

HI

hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was thinking 
to start to create an HTML tag filter class.
mabe i can create my own HTML parser ! as i do for parsing and indexing 
DublinCore metadata...it sounds possible don't you think so ?

i just hv to create also or to find a class which could filter an HTML pages 
and delete certain tag from it


Guys, please take a look at how HtmlParseFilters are implemented - for 
example the creativecommons plugin. I believe that's exactly the 
functionality that you are looking for.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: indexing just certain content

2009-10-09 Thread BELLINI ADAM

HI

hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was thinking 
to start to create an HTML tag filter class.
mabe i can create my own HTML parser ! as i do for parsing and indexing 
DublinCore metadata...it sounds possible don't you think so ?

i just hv to create also or to find a class which could filter an HTML pages 
and delete certain tag from it

Thx.





> Date: Fri, 9 Oct 2009 22:04:41 +0530
> From: g...@srijan.in
> To: nutch-user@lucene.apache.org
> Subject: Re: indexing just certain content
> 
> On Fri, 9 Oct 2009 18:00:41 +0200
> MilleBii  wrote:
> 
> > Don't think it will work because at the indexing filter stage all
> > the HTML tags are gone from the text.
> > 
> > I think you need to modify the HTML parser to filter out the tags
> > you want to get rid of.
> > 
> > In some use case I have I would like to perform 'intelligent
> > indexing', ie use the tag information to extract specific fields
> > to be indexed along with the main text. A reverse case of yours.
> > Todate I did not find a way to do it.
> > So if you find a solution I'm with you.
> [...]
> 
> This is something that we would also be interested in. Actually,
> we even have a working solution to extract content from between
> start/stop tags, written by our colleagues from a partner company.
> 
> There are a couple of things that we would like to fix with this
> solution:
> (a) It directly modifies HtmlParser.java, which is obviously
> unmaintainable.
> (b) It is a solution for specific tags, rather than picking them
> up from configuration parameters.
> (c) We have not yet traced the complete execution path for Nutch,
> i.e., when is the parser called, when are filters called, etc.
> Is there a document anywhere about this? We were thinking of a
> filter, but from what you say above, that is the wrong stage.
> (d) Ideally, whatever solution we come up with would be contributed
> back to Nutch, which also helps us from a maintenance
> standpoint. Is there a defined process for getting external
> plugins accepted into Nutch?
> 
> We are willing to put in some time into this, starting the coming
> week. Where can we start a brainstorming Wiki for this? Is the
> Nutch Wiki the right place?
> 
> Regards,
> Gora
  
_
New: Messenger sign-in on the MSN homepage
http://go.microsoft.com/?linkid=9677403

Re: indexing just certain content

2009-10-09 Thread Gora Mohanty
On Fri, 9 Oct 2009 18:00:41 +0200
MilleBii  wrote:

> Don't think it will work because at the indexing filter stage all
> the HTML tags are gone from the text.
> 
> I think you need to modify the HTML parser to filter out the tags
> you want to get rid of.
> 
> In some use case I have I would like to perform 'intelligent
> indexing', ie use the tag information to extract specific fields
> to be indexed along with the main text. A reverse case of yours.
> Todate I did not find a way to do it.
> So if you find a solution I'm with you.
[...]

This is something that we would also be interested in. Actually,
we even have a working solution to extract content from between
start/stop tags, written by our colleagues from a partner company.

There are a couple of things that we would like to fix with this
solution:
(a) It directly modifies HtmlParser.java, which is obviously
unmaintainable.
(b) It is a solution for specific tags, rather than picking them
up from configuration parameters.
(c) We have not yet traced the complete execution path for Nutch,
i.e., when is the parser called, when are filters called, etc.
Is there a document anywhere about this? We were thinking of a
filter, but from what you say above, that is the wrong stage.
(d) Ideally, whatever solution we come up with would be contributed
back to Nutch, which also helps us from a maintenance
standpoint. Is there a defined process for getting external
plugins accepted into Nutch?

We are willing to put in some time into this, starting the coming
week. Where can we start a brainstorming Wiki for this? Is the
Nutch Wiki the right place?

Regards,
Gora


Re: indexing just certain content

2009-10-09 Thread MilleBii
Don't think it will work because at the indexing filter stage all the HTML
tags are gone from the text.

I think you need to modify the HTML parser to filter out the tags you want
to get rid of.

In some use case I have I would like to perform 'intelligent indexing', ie
use the tag information to extract specific fields to be indexed along with
the main text. A reverse case of yours. Todate I did not find a way to do
it.
So if you find a solution I'm with you.



2009/10/7 BELLINI ADAM 

>
>
>  in this class the BasicIndexingFilter.java, I think before adding the
> contenent to the document i could parse it again to filter certain div tags
> ??
>
> text = parse.getText();
>
> // i have to parse and filter the text here before adding it to the
> docuement
>
> new_Filtred_text = text.myParser_New_method(text);
>
> doc.add("content", parse.getText());
>
> what do you think about that ?
>
> _
> New! Faster Messenger access on the new MSN homepage
> http://go.microsoft.com/?linkid=9677406




-- 
-MilleBii-


Re: indexing just certain content

2009-10-07 Thread BELLINI ADAM


 in this class the BasicIndexingFilter.java, I think before adding the 
contenent to the document i could parse it again to filter certain div tags ??

text = parse.getText();

// i have to parse and filter the text here before adding it to the docuement 

new_Filtred_text = text.myParser_New_method(text);

doc.add("content", parse.getText());

what do you think about that ?
  
_
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

Re: indexing just certain content

2009-10-05 Thread Eric
Look at the source code for the basic indexing plugin - it indexes the  
title tags and some other tags: should be a good starting point.


Eric

On Oct 5, 2009, at 1:20 PM, BELLINI ADAM wrote:



hi,

but how will i get the HTML  tag ?
is there any nutch method to get from the content the  tag ??
thx





Subject: Re: indexing just certain content
From: e...@lakemeadonline.com
Date: Mon, 5 Oct 2009 13:09:17 -0700
To: nutch-user@lucene.apache.org

Adam,

You could turn off all the indexing plugins and write your own plugin
that only indexes certain meta content from your intranet - giving  
you

complete control of the fields indexed.

Eric

On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote:



hi

does anybody know if it's possible to index just certain content ? i
mean i need to dont index some garbage and repetitive data on my
intranet.

in other way if it is possible to tell the indexer dont index the
content between  certain  tags
like:




plz dont index this  bla  bla bla



thx to all

_
New: Messenger sign-in on the MSN homepage
http://go.microsoft.com/?linkid=9677403




_
Click less, chat more: Messenger on MSN.ca
http://go.microsoft.com/?linkid=9677404




RE: indexing just certain content

2009-10-05 Thread BELLINI ADAM

hi,

but how will i get the HTML  tag ?
is there any nutch method to get from the content the  tag ??
thx




> Subject: Re: indexing just certain content
> From: e...@lakemeadonline.com
> Date: Mon, 5 Oct 2009 13:09:17 -0700
> To: nutch-user@lucene.apache.org
> 
> Adam,
> 
> You could turn off all the indexing plugins and write your own plugin  
> that only indexes certain meta content from your intranet - giving you  
> complete control of the fields indexed.
> 
> Eric
> 
> On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote:
> 
> >
> > hi
> >
> > does anybody know if it's possible to index just certain content ? i  
> > mean i need to dont index some garbage and repetitive data on my  
> > intranet.
> >
> > in other way if it is possible to tell the indexer dont index the  
> > content between  certain  tags
> > like:
> >
> > 
> >
> >
> > plz dont index this  bla  bla bla
> >
> > 
> >
> > thx to all
> > 
> > _
> > New: Messenger sign-in on the MSN homepage
> > http://go.microsoft.com/?linkid=9677403
> 
  
_
Click less, chat more: Messenger on MSN.ca
http://go.microsoft.com/?linkid=9677404

Re: indexing just certain content

2009-10-05 Thread Eric

Adam,

You could turn off all the indexing plugins and write your own plugin  
that only indexes certain meta content from your intranet - giving you  
complete control of the fields indexed.


Eric

On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote:



hi

does anybody know if it's possible to index just certain content ? i  
mean i need to dont index some garbage and repetitive data on my  
intranet.


in other way if it is possible to tell the indexer dont index the  
content between  certain  tags

like:




plz dont index this  bla  bla bla



thx to all

_
New: Messenger sign-in on the MSN homepage
http://go.microsoft.com/?linkid=9677403