RE: indexing just certain content

2009-10-11 Thread MilleBii
This is not very clear:
There is big difference between removing garbage for indexingFilter and 
removing search  results... I think you want the first one. 

You just need to build a custom Parser that will filter out the tags you dont 
want 


RE: indexing just certain content

2009-10-11 Thread BELLINI ADAM

we are using SOLR, I dont know how to remove search results, that's why i dont 
want to index the garbage data...and that's why i'm wondering to remove those 
data in the parse operation...yes i want to filter out the data from the HTML, 
and this is my big problem...in my post i'm asking if there is a java class 
that delete section form an HTML ! since i know only the sections i want to 
delete (it's a template), i'm not able to construct a new HTML file by taking 
only section i need since i dont know those section and dont know if the HTML 
tags are well dompted(the only thing i know is that the section i want to 
remove are DIV sections and i know that they are dompted).
so the big deal is : removing known section from an HTML file. (without knowing 
the other sections).
i will try to construct such a class to clean those html files



 From: mille...@gmail.com
 Subject: RE: indexing just certain content
 Date: Sun, 11 Oct 2009 11:02:21 +0200
 To: nutch-user@lucene.apache.org
 
 This is not very clear:
 There is big difference between removing garbage for indexingFilter and 
 removing search  results... I think you want the first one. 
 
 You just need to build a custom Parser that will filter out the tags you dont 
 want 
  
_
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

Re: indexing just certain content

2009-10-10 Thread Andrzej Bialecki

MilleBii wrote:

Andzej,

The use case you are thinking is : at the parsing stage, filter out garbage
content and index only the rest.

I have a different use case, I want to keep everything as standard indexing
_AND_  also extract part for being indexed in a dedicated field (which will
be boosted at search time). In a document certain part have more importance
than others in my case.

So I would like either
1. to access html representation at indexing time... not possible or did not
find how
2. create a dual representation of the document, plain  standard, filtered
document

I think option 2. is much better because it better fits the model and allows
for a lot of different other use cases.


Actually, creativecommons provides hints how to do this .. but to be 
more explicit:


* in your HtmlParseFilter you need to extract from DOM tree the parts 
that you want, and put them inside ParseData.metadata. This way you will 
preserve both the original text, and your special parts that you extracted.


* in your IndexingFilter you will retrieve the parts from 
ParseData.metadata and add them as additional index fields (don't forget 
to specify indexing backend options).


* in your QueryFilter plugin.xml you declare that QueryParser should 
pass your special fields without treating them as terms, and in the 
implementation you create a BooleanClause to be added to the translated 
query.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: indexing just certain content

2009-10-10 Thread BELLINI ADAM

Hi,
you said :  '...* in your HtmlParseFilter you need to extract from DOM tree the 
parts 
that you want ...'
but my problem is : i dont know what to extract becoz dont know all pages i'm 
indexing, i just know what to don't index
 1 - I just know what to dont index...all pages have some sections that i wont 
index, since i know those section i want to take them off from the document and 
keep the rest of the important content.

the sections are headers, top menus, right menus, left menus and some other 
sections:

div id = 'header' bla bla /div
 div id = 'top_menu'  bla bla /div
 div id = 'left_menu'  bla bla /div
 div id = 'right_menu' bla bla /div
mabe i could find some java classes which can delete sections form a an HTML 
page ?!
if i found this one so i guess it will be more easy to use 

2- you said dont forget backends index : could you tell me what are they ?

3- we are using SOLR, so i just have to index the important content...the 
search will be performed with solr so i guess i dont need the QueryFilter.

best regards




 Date: Sat, 10 Oct 2009 16:04:10 +0200
 From: a...@getopt.org
 To: nutch-user@lucene.apache.org
 Subject: Re: indexing just certain content
 
 MilleBii wrote:
  Andzej,
  
  The use case you are thinking is : at the parsing stage, filter out garbage
  content and index only the rest.
  
  I have a different use case, I want to keep everything as standard indexing
  _AND_  also extract part for being indexed in a dedicated field (which will
  be boosted at search time). In a document certain part have more importance
  than others in my case.
  
  So I would like either
  1. to access html representation at indexing time... not possible or did not
  find how
  2. create a dual representation of the document, plain  standard, filtered
  document
  
  I think option 2. is much better because it better fits the model and allows
  for a lot of different other use cases.
 
 Actually, creativecommons provides hints how to do this .. but to be 
 more explicit:
 
 * in your HtmlParseFilter you need to extract from DOM tree the parts 
 that you want, and put them inside ParseData.metadata. This way you will 
 preserve both the original text, and your special parts that you extracted.
 
 * in your IndexingFilter you will retrieve the parts from 
 ParseData.metadata and add them as additional index fields (don't forget 
 to specify indexing backend options).
 
 * in your QueryFilter plugin.xml you declare that QueryParser should 
 pass your special fields without treating them as terms, and in the 
 implementation you create a BooleanClause to be added to the translated 
 query.
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
  
_
New! Get to Messenger faster: Sign-in here now!
http://go.microsoft.com/?linkid=9677407

RE: indexing just certain content

2009-10-10 Thread BELLINI ADAM

yes 




MilleBii 

i tald you before that i created a DublinCore metadata parser and indexer...so 
i parsed my html and created fileds to get my DC metadata...my missing piece is 
how to delete sections form an html page :( if i will find this piece the rest 
will be like a peice of cake :)




 Date: Sat, 10 Oct 2009 16:41:44 +0200
 Subject: Re: indexing just certain content
 From: mille...@gmail.com
 To: nutch-user@lucene.apache.org
 
 Andrzej,
 
 Great !!!
 I did not realize you could put your own content in ParseData.metadata and
 read it back in the IndexingFilter... this was my missing piece in the
 puzzle, for the rest I knew what to do.
 
 Thanks,
 
 
 
 2009/10/10 Andrzej Bialecki a...@getopt.org
 
  MilleBii wrote:
 
  Andzej,
 
  The use case you are thinking is : at the parsing stage, filter out
  garbage
  content and index only the rest.
 
  I have a different use case, I want to keep everything as standard
  indexing
  _AND_  also extract part for being indexed in a dedicated field (which
  will
  be boosted at search time). In a document certain part have more
  importance
  than others in my case.
 
  So I would like either
  1. to access html representation at indexing time... not possible or did
  not
  find how
  2. create a dual representation of the document, plain  standard,
  filtered
  document
 
  I think option 2. is much better because it better fits the model and
  allows
  for a lot of different other use cases.
 
 
  Actually, creativecommons provides hints how to do this .. but to be more
  explicit:
 
  * in your HtmlParseFilter you need to extract from DOM tree the parts that
  you want, and put them inside ParseData.metadata. This way you will preserve
  both the original text, and your special parts that you extracted.
 
  * in your IndexingFilter you will retrieve the parts from
  ParseData.metadata and add them as additional index fields (don't forget to
  specify indexing backend options).
 
  * in your QueryFilter plugin.xml you declare that QueryParser should pass
  your special fields without treating them as terms, and in the
  implementation you create a BooleanClause to be added to the translated
  query.
 
 
 
  --
  Best regards,
  Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 
 
 
 
 -- 
 -MilleBii-
  
_
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

RE: indexing just certain content

2009-10-10 Thread BELLINI ADAM

what i want is exactly explained in this second post : How to ignore search 
results that don't have related keywords in main body?




 From: mbel...@msn.com
 To: nutch-user@lucene.apache.org
 Subject: RE: indexing just certain content
 Date: Sat, 10 Oct 2009 15:35:31 +
 
 
 yes 
 
 
 
 
 MilleBii 
 
 i tald you before that i created a DublinCore metadata parser and 
 indexer...so i parsed my html and created fileds to get my DC metadata...my 
 missing piece is how to delete sections form an html page :( if i will find 
 this piece the rest will be like a peice of cake :)
 
 
 
 
  Date: Sat, 10 Oct 2009 16:41:44 +0200
  Subject: Re: indexing just certain content
  From: mille...@gmail.com
  To: nutch-user@lucene.apache.org
  
  Andrzej,
  
  Great !!!
  I did not realize you could put your own content in ParseData.metadata and
  read it back in the IndexingFilter... this was my missing piece in the
  puzzle, for the rest I knew what to do.
  
  Thanks,
  
  
  
  2009/10/10 Andrzej Bialecki a...@getopt.org
  
   MilleBii wrote:
  
   Andzej,
  
   The use case you are thinking is : at the parsing stage, filter out
   garbage
   content and index only the rest.
  
   I have a different use case, I want to keep everything as standard
   indexing
   _AND_  also extract part for being indexed in a dedicated field (which
   will
   be boosted at search time). In a document certain part have more
   importance
   than others in my case.
  
   So I would like either
   1. to access html representation at indexing time... not possible or did
   not
   find how
   2. create a dual representation of the document, plain  standard,
   filtered
   document
  
   I think option 2. is much better because it better fits the model and
   allows
   for a lot of different other use cases.
  
  
   Actually, creativecommons provides hints how to do this .. but to be more
   explicit:
  
   * in your HtmlParseFilter you need to extract from DOM tree the parts that
   you want, and put them inside ParseData.metadata. This way you will 
   preserve
   both the original text, and your special parts that you extracted.
  
   * in your IndexingFilter you will retrieve the parts from
   ParseData.metadata and add them as additional index fields (don't forget 
   to
   specify indexing backend options).
  
   * in your QueryFilter plugin.xml you declare that QueryParser should pass
   your special fields without treating them as terms, and in the
   implementation you create a BooleanClause to be added to the translated
   query.
  
  
  
   --
   Best regards,
   Andrzej Bialecki 
___. ___ ___ ___ _ _   __
   [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
   ___|||__||  \|  ||  |  Embedded Unix, System Integration
   http://www.sigram.com  Contact: info at sigram dot com
  
  
  
  
  -- 
  -MilleBii-
 
 _
 New! Faster Messenger access on the new MSN homepage
 http://go.microsoft.com/?linkid=9677406
  
_
New! Get to Messenger faster: Sign-in here now!
http://go.microsoft.com/?linkid=9677407

Re: indexing just certain content

2009-10-09 Thread MilleBii
Don't think it will work because at the indexing filter stage all the HTML
tags are gone from the text.

I think you need to modify the HTML parser to filter out the tags you want
to get rid of.

In some use case I have I would like to perform 'intelligent indexing', ie
use the tag information to extract specific fields to be indexed along with
the main text. A reverse case of yours. Todate I did not find a way to do
it.
So if you find a solution I'm with you.



2009/10/7 BELLINI ADAM mbel...@msn.com



  in this class the BasicIndexingFilter.java, I think before adding the
 contenent to the document i could parse it again to filter certain div tags
 ??

 text = parse.getText();

 // i have to parse and filter the text here before adding it to the
 docuement

 new_Filtred_text = text.myParser_New_method(text);

 doc.add(content, parse.getText());

 what do you think about that ?

 _
 New! Faster Messenger access on the new MSN homepage
 http://go.microsoft.com/?linkid=9677406




-- 
-MilleBii-


Re: indexing just certain content

2009-10-09 Thread Gora Mohanty
On Fri, 9 Oct 2009 18:00:41 +0200
MilleBii mille...@gmail.com wrote:

 Don't think it will work because at the indexing filter stage all
 the HTML tags are gone from the text.
 
 I think you need to modify the HTML parser to filter out the tags
 you want to get rid of.
 
 In some use case I have I would like to perform 'intelligent
 indexing', ie use the tag information to extract specific fields
 to be indexed along with the main text. A reverse case of yours.
 Todate I did not find a way to do it.
 So if you find a solution I'm with you.
[...]

This is something that we would also be interested in. Actually,
we even have a working solution to extract content from between
start/stop tags, written by our colleagues from a partner company.

There are a couple of things that we would like to fix with this
solution:
(a) It directly modifies HtmlParser.java, which is obviously
unmaintainable.
(b) It is a solution for specific tags, rather than picking them
up from configuration parameters.
(c) We have not yet traced the complete execution path for Nutch,
i.e., when is the parser called, when are filters called, etc.
Is there a document anywhere about this? We were thinking of a
filter, but from what you say above, that is the wrong stage.
(d) Ideally, whatever solution we come up with would be contributed
back to Nutch, which also helps us from a maintenance
standpoint. Is there a defined process for getting external
plugins accepted into Nutch?

We are willing to put in some time into this, starting the coming
week. Where can we start a brainstorming Wiki for this? Is the
Nutch Wiki the right place?

Regards,
Gora


RE: indexing just certain content

2009-10-09 Thread BELLINI ADAM

HI

hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was thinking 
to start to create an HTML tag filter class.
mabe i can create my own HTML parser ! as i do for parsing and indexing 
DublinCore metadata...it sounds possible don't you think so ?

i just hv to create also or to find a class which could filter an HTML pages 
and delete certain tag from it

Thx.





 Date: Fri, 9 Oct 2009 22:04:41 +0530
 From: g...@srijan.in
 To: nutch-user@lucene.apache.org
 Subject: Re: indexing just certain content
 
 On Fri, 9 Oct 2009 18:00:41 +0200
 MilleBii mille...@gmail.com wrote:
 
  Don't think it will work because at the indexing filter stage all
  the HTML tags are gone from the text.
  
  I think you need to modify the HTML parser to filter out the tags
  you want to get rid of.
  
  In some use case I have I would like to perform 'intelligent
  indexing', ie use the tag information to extract specific fields
  to be indexed along with the main text. A reverse case of yours.
  Todate I did not find a way to do it.
  So if you find a solution I'm with you.
 [...]
 
 This is something that we would also be interested in. Actually,
 we even have a working solution to extract content from between
 start/stop tags, written by our colleagues from a partner company.
 
 There are a couple of things that we would like to fix with this
 solution:
 (a) It directly modifies HtmlParser.java, which is obviously
 unmaintainable.
 (b) It is a solution for specific tags, rather than picking them
 up from configuration parameters.
 (c) We have not yet traced the complete execution path for Nutch,
 i.e., when is the parser called, when are filters called, etc.
 Is there a document anywhere about this? We were thinking of a
 filter, but from what you say above, that is the wrong stage.
 (d) Ideally, whatever solution we come up with would be contributed
 back to Nutch, which also helps us from a maintenance
 standpoint. Is there a defined process for getting external
 plugins accepted into Nutch?
 
 We are willing to put in some time into this, starting the coming
 week. Where can we start a brainstorming Wiki for this? Is the
 Nutch Wiki the right place?
 
 Regards,
 Gora
  
_
New: Messenger sign-in on the MSN homepage
http://go.microsoft.com/?linkid=9677403

Re: indexing just certain content

2009-10-09 Thread Andrzej Bialecki

BELLINI ADAM wrote:

HI

hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was thinking 
to start to create an HTML tag filter class.
mabe i can create my own HTML parser ! as i do for parsing and indexing 
DublinCore metadata...it sounds possible don't you think so ?

i just hv to create also or to find a class which could filter an HTML pages 
and delete certain tag from it


Guys, please take a look at how HtmlParseFilters are implemented - for 
example the creativecommons plugin. I believe that's exactly the 
functionality that you are looking for.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: indexing just certain content

2009-10-09 Thread BELLINI ADAM

hi,

can you plz just tell us in english what the plugin creativecommons is for ?
i mean if i will include this plugin in my nutch-site.txt, what will i have as 
result ?
thx





 Date: Fri, 9 Oct 2009 19:16:44 +0200
 From: a...@getopt.org
 To: nutch-user@lucene.apache.org
 Subject: Re: indexing just certain content
 
 BELLINI ADAM wrote:
  HI
  
  hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was 
  thinking to start to create an HTML tag filter class.
  mabe i can create my own HTML parser ! as i do for parsing and indexing 
  DublinCore metadata...it sounds possible don't you think so ?
  
  i just hv to create also or to find a class which could filter an HTML 
  pages and delete certain tag from it
 
 Guys, please take a look at how HtmlParseFilters are implemented - for 
 example the creativecommons plugin. I believe that's exactly the 
 functionality that you are looking for.
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
  
_
New! Get to Messenger faster: Sign-in here now!
http://go.microsoft.com/?linkid=9677407

Re: indexing just certain content

2009-10-09 Thread Ken Krugler
can you plz just tell us in english what the plugin creativecommons  
is for ?
i mean if i will include this plugin in my nutch-site.txt, what will  
i have as result ?


I think Andrzej is suggesting that you read the code.

If you look at the beginning of the CCParseFilter.java file, you'll see:

/** Adds metadata identifying the Creative Commons license used, if  
any. */

public class CCParseFilter implements HtmlParseFilter {

The key routine that you need to implement is:

  /** Adds metadata or otherwise modifies a parse of an HTML  
document, given

   * the DOM tree of a page. */
  public ParseResult filter(Content content, ParseResult parseResult,  
HTMLMetaTags metaTags, DocumentFragment doc) {


So it seems that this plugin would be a great place for you to start.

But you'll need to dig into the code, be familiar with DOM  
manipulation, etc.


-- Ken




Date: Fri, 9 Oct 2009 19:16:44 +0200
From: a...@getopt.org
To: nutch-user@lucene.apache.org
Subject: Re: indexing just certain content

BELLINI ADAM wrote:

HI

hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i  
was thinking to start to create an HTML tag filter class.
mabe i can create my own HTML parser ! as i do for parsing and  
indexing DublinCore metadata...it sounds possible don't you think  
so ?


i just hv to create also or to find a class which could filter an  
HTML pages and delete certain tag from it


Guys, please take a look at how HtmlParseFilters are implemented -  
for

example the creativecommons plugin. I believe that's exactly the
functionality that you are looking for.


--
Best regards,
Andrzej Bialecki 




RE: indexing just certain content

2009-10-09 Thread BELLINI ADAM


yes i did read the code but  didnt understand what is 'the Creative Commons 
license' that's why i asked what does mean  creativecommons  .
but as u said,  i hv to be familiar with DOM manipulation to understand the 
code...so lets start knowing DOM
thx




 From: kkrugler_li...@transpac.com
 To: nutch-user@lucene.apache.org
 Subject: Re: indexing just certain content
 Date: Fri, 9 Oct 2009 16:39:31 -0700
 
  can you plz just tell us in english what the plugin creativecommons  
  is for ?
  i mean if i will include this plugin in my nutch-site.txt, what will  
  i have as result ?
 
 I think Andrzej is suggesting that you read the code.
 
 If you look at the beginning of the CCParseFilter.java file, you'll see:
 
 /** Adds metadata identifying the Creative Commons license used, if  
 any. */
 public class CCParseFilter implements HtmlParseFilter {
 
 The key routine that you need to implement is:
 
/** Adds metadata or otherwise modifies a parse of an HTML  
 document, given
 * the DOM tree of a page. */
public ParseResult filter(Content content, ParseResult parseResult,  
 HTMLMetaTags metaTags, DocumentFragment doc) {
 
 So it seems that this plugin would be a great place for you to start.
 
 But you'll need to dig into the code, be familiar with DOM  
 manipulation, etc.
 
 -- Ken
 
 
 
  Date: Fri, 9 Oct 2009 19:16:44 +0200
  From: a...@getopt.org
  To: nutch-user@lucene.apache.org
  Subject: Re: indexing just certain content
 
  BELLINI ADAM wrote:
  HI
 
  hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i  
  was thinking to start to create an HTML tag filter class.
  mabe i can create my own HTML parser ! as i do for parsing and  
  indexing DublinCore metadata...it sounds possible don't you think  
  so ?
 
  i just hv to create also or to find a class which could filter an  
  HTML pages and delete certain tag from it
 
  Guys, please take a look at how HtmlParseFilters are implemented -  
  for
  example the creativecommons plugin. I believe that's exactly the
  functionality that you are looking for.
 
 
  -- 
  Best regards,
  Andrzej Bialecki 
 
  
_
New! Open Messenger faster on the MSN homepage
http://go.microsoft.com/?linkid=9677405

Re: indexing just certain content

2009-10-07 Thread BELLINI ADAM


 in this class the BasicIndexingFilter.java, I think before adding the 
contenent to the document i could parse it again to filter certain div tags ??

text = parse.getText();

// i have to parse and filter the text here before adding it to the docuement 

new_Filtred_text = text.myParser_New_method(text);

doc.add(content, parse.getText());

what do you think about that ?
  
_
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

Re: indexing just certain content

2009-10-05 Thread Eric

Adam,

You could turn off all the indexing plugins and write your own plugin  
that only indexes certain meta content from your intranet - giving you  
complete control of the fields indexed.


Eric

On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote:



hi

does anybody know if it's possible to index just certain content ? i  
mean i need to dont index some garbage and repetitive data on my  
intranet.


in other way if it is possible to tell the indexer dont index the  
content between  certain div tags

like:

div id=bla bla


plz dont index this  bla  bla bla

/div

thx to all

_
New: Messenger sign-in on the MSN homepage
http://go.microsoft.com/?linkid=9677403