RE: indexing just certain content
we are using SOLR, I dont know how to remove search results, that's why i dont want to index the garbage data...and that's why i'm wondering to remove those data in the parse operation...yes i want to filter out the data from the HTML, and this is my big problem...in my post i'm asking if there is a java class that delete section form an HTML ! since i know only the sections i want to delete (it's a template), i'm not able to construct a new HTML file by taking only section i need since i dont know those section and dont know if the HTML tags are well dompted(the only thing i know is that the section i want to remove are DIV sections and i know that they are dompted). so the big deal is : removing known section from an HTML file. (without knowing the other sections). i will try to construct such a class to clean those html files > From: mille...@gmail.com > Subject: RE: indexing just certain content > Date: Sun, 11 Oct 2009 11:02:21 +0200 > To: nutch-user@lucene.apache.org > > This is not very clear: > There is big difference between removing garbage for indexingFilter and > removing search results... I think you want the first one. > > You just need to build a custom Parser that will filter out the tags you dont > want _ New! Faster Messenger access on the new MSN homepage http://go.microsoft.com/?linkid=9677406
RE: indexing just certain content
This is not very clear: There is big difference between removing garbage for indexingFilter and removing search results... I think you want the first one. You just need to build a custom Parser that will filter out the tags you dont want
RE: indexing just certain content
what i want is exactly explained in this second post : How to ignore search results that don't have related keywords in main body? > From: mbel...@msn.com > To: nutch-user@lucene.apache.org > Subject: RE: indexing just certain content > Date: Sat, 10 Oct 2009 15:35:31 + > > > yes > > > > > MilleBii > > i tald you before that i created a DublinCore metadata parser and > indexer...so i parsed my html and created fileds to get my DC metadata...my > missing piece is how to delete sections form an html page :( if i will find > this piece the rest will be like a peice of cake :) > > > > > > Date: Sat, 10 Oct 2009 16:41:44 +0200 > > Subject: Re: indexing just certain content > > From: mille...@gmail.com > > To: nutch-user@lucene.apache.org > > > > Andrzej, > > > > Great !!! > > I did not realize you could put your own content in ParseData.metadata and > > read it back in the IndexingFilter... this was my missing piece in the > > puzzle, for the rest I knew what to do. > > > > Thanks, > > > > > > > > 2009/10/10 Andrzej Bialecki > > > > > MilleBii wrote: > > > > > >> Andzej, > > >> > > >> The use case you are thinking is : at the parsing stage, filter out > > >> garbage > > >> content and index only the rest. > > >> > > >> I have a different use case, I want to keep everything as standard > > >> indexing > > >> _AND_ also extract part for being indexed in a dedicated field (which > > >> will > > >> be boosted at search time). In a document certain part have more > > >> importance > > >> than others in my case. > > >> > > >> So I would like either > > >> 1. to access html representation at indexing time... not possible or did > > >> not > > >> find how > > >> 2. create a dual representation of the document, plain & standard, > > >> filtered > > >> document > > >> > > >> I think option 2. is much better because it better fits the model and > > >> allows > > >> for a lot of different other use cases. > > >> > > > > > > Actually, creativecommons provides hints how to do this .. but to be more > > > explicit: > > > > > > * in your HtmlParseFilter you need to extract from DOM tree the parts that > > > you want, and put them inside ParseData.metadata. This way you will > > > preserve > > > both the original text, and your special parts that you extracted. > > > > > > * in your IndexingFilter you will retrieve the parts from > > > ParseData.metadata and add them as additional index fields (don't forget > > > to > > > specify indexing backend options). > > > > > > * in your QueryFilter plugin.xml you declare that QueryParser should pass > > > your special fields without treating them as terms, and in the > > > implementation you create a BooleanClause to be added to the translated > > > query. > > > > > > > > > > > > -- > > > Best regards, > > > Andrzej Bialecki <>< > > > ___. ___ ___ ___ _ _ __ > > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > > ___|||__|| \| || | Embedded Unix, System Integration > > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > > > > > > -- > > -MilleBii- > > _ > New! Faster Messenger access on the new MSN homepage > http://go.microsoft.com/?linkid=9677406 _ New! Get to Messenger faster: Sign-in here now! http://go.microsoft.com/?linkid=9677407
RE: indexing just certain content
yes MilleBii i tald you before that i created a DublinCore metadata parser and indexer...so i parsed my html and created fileds to get my DC metadata...my missing piece is how to delete sections form an html page :( if i will find this piece the rest will be like a peice of cake :) > Date: Sat, 10 Oct 2009 16:41:44 +0200 > Subject: Re: indexing just certain content > From: mille...@gmail.com > To: nutch-user@lucene.apache.org > > Andrzej, > > Great !!! > I did not realize you could put your own content in ParseData.metadata and > read it back in the IndexingFilter... this was my missing piece in the > puzzle, for the rest I knew what to do. > > Thanks, > > > > 2009/10/10 Andrzej Bialecki > > > MilleBii wrote: > > > >> Andzej, > >> > >> The use case you are thinking is : at the parsing stage, filter out > >> garbage > >> content and index only the rest. > >> > >> I have a different use case, I want to keep everything as standard > >> indexing > >> _AND_ also extract part for being indexed in a dedicated field (which > >> will > >> be boosted at search time). In a document certain part have more > >> importance > >> than others in my case. > >> > >> So I would like either > >> 1. to access html representation at indexing time... not possible or did > >> not > >> find how > >> 2. create a dual representation of the document, plain & standard, > >> filtered > >> document > >> > >> I think option 2. is much better because it better fits the model and > >> allows > >> for a lot of different other use cases. > >> > > > > Actually, creativecommons provides hints how to do this .. but to be more > > explicit: > > > > * in your HtmlParseFilter you need to extract from DOM tree the parts that > > you want, and put them inside ParseData.metadata. This way you will preserve > > both the original text, and your special parts that you extracted. > > > > * in your IndexingFilter you will retrieve the parts from > > ParseData.metadata and add them as additional index fields (don't forget to > > specify indexing backend options). > > > > * in your QueryFilter plugin.xml you declare that QueryParser should pass > > your special fields without treating them as terms, and in the > > implementation you create a BooleanClause to be added to the translated > > query. > > > > > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > -- > -MilleBii- _ New! Faster Messenger access on the new MSN homepage http://go.microsoft.com/?linkid=9677406
RE: indexing just certain content
Hi, you said : '...* in your HtmlParseFilter you need to extract from DOM tree the parts that you want ...' but my problem is : i dont know what to extract becoz dont know all pages i'm indexing, i just know what to don't index 1 - I just know what to dont index...all pages have some sections that i wont index, since i know those section i want to take them off from the document and keep the rest of the important content. the sections are headers, top menus, right menus, left menus and some other sections: bla bla bla bla bla bla bla bla mabe i could find some java classes which can delete sections form a an HTML page ?! if i found this one so i guess it will be more easy to use 2- you said dont forget backends index : could you tell me what are they ? 3- we are using SOLR, so i just have to index the important content...the search will be performed with solr so i guess i dont need the QueryFilter. best regards > Date: Sat, 10 Oct 2009 16:04:10 +0200 > From: a...@getopt.org > To: nutch-user@lucene.apache.org > Subject: Re: indexing just certain content > > MilleBii wrote: > > Andzej, > > > > The use case you are thinking is : at the parsing stage, filter out garbage > > content and index only the rest. > > > > I have a different use case, I want to keep everything as standard indexing > > _AND_ also extract part for being indexed in a dedicated field (which will > > be boosted at search time). In a document certain part have more importance > > than others in my case. > > > > So I would like either > > 1. to access html representation at indexing time... not possible or did not > > find how > > 2. create a dual representation of the document, plain & standard, filtered > > document > > > > I think option 2. is much better because it better fits the model and allows > > for a lot of different other use cases. > > Actually, creativecommons provides hints how to do this .. but to be > more explicit: > > * in your HtmlParseFilter you need to extract from DOM tree the parts > that you want, and put them inside ParseData.metadata. This way you will > preserve both the original text, and your special parts that you extracted. > > * in your IndexingFilter you will retrieve the parts from > ParseData.metadata and add them as additional index fields (don't forget > to specify indexing backend options). > > * in your QueryFilter plugin.xml you declare that QueryParser should > pass your special fields without treating them as terms, and in the > implementation you create a BooleanClause to be added to the translated > query. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > _ New! Get to Messenger faster: Sign-in here now! http://go.microsoft.com/?linkid=9677407
Re: indexing just certain content
Andrzej, Great !!! I did not realize you could put your own content in ParseData.metadata and read it back in the IndexingFilter... this was my missing piece in the puzzle, for the rest I knew what to do. Thanks, 2009/10/10 Andrzej Bialecki > MilleBii wrote: > >> Andzej, >> >> The use case you are thinking is : at the parsing stage, filter out >> garbage >> content and index only the rest. >> >> I have a different use case, I want to keep everything as standard >> indexing >> _AND_ also extract part for being indexed in a dedicated field (which >> will >> be boosted at search time). In a document certain part have more >> importance >> than others in my case. >> >> So I would like either >> 1. to access html representation at indexing time... not possible or did >> not >> find how >> 2. create a dual representation of the document, plain & standard, >> filtered >> document >> >> I think option 2. is much better because it better fits the model and >> allows >> for a lot of different other use cases. >> > > Actually, creativecommons provides hints how to do this .. but to be more > explicit: > > * in your HtmlParseFilter you need to extract from DOM tree the parts that > you want, and put them inside ParseData.metadata. This way you will preserve > both the original text, and your special parts that you extracted. > > * in your IndexingFilter you will retrieve the parts from > ParseData.metadata and add them as additional index fields (don't forget to > specify indexing backend options). > > * in your QueryFilter plugin.xml you declare that QueryParser should pass > your special fields without treating them as terms, and in the > implementation you create a BooleanClause to be added to the translated > query. > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- -MilleBii-
Re: indexing just certain content
MilleBii wrote: Andzej, The use case you are thinking is : at the parsing stage, filter out garbage content and index only the rest. I have a different use case, I want to keep everything as standard indexing _AND_ also extract part for being indexed in a dedicated field (which will be boosted at search time). In a document certain part have more importance than others in my case. So I would like either 1. to access html representation at indexing time... not possible or did not find how 2. create a dual representation of the document, plain & standard, filtered document I think option 2. is much better because it better fits the model and allows for a lot of different other use cases. Actually, creativecommons provides hints how to do this .. but to be more explicit: * in your HtmlParseFilter you need to extract from DOM tree the parts that you want, and put them inside ParseData.metadata. This way you will preserve both the original text, and your special parts that you extracted. * in your IndexingFilter you will retrieve the parts from ParseData.metadata and add them as additional index fields (don't forget to specify indexing backend options). * in your QueryFilter plugin.xml you declare that QueryParser should pass your special fields without treating them as terms, and in the implementation you create a BooleanClause to be added to the translated query. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: indexing just certain content
Andzej, The use case you are thinking is : at the parsing stage, filter out garbage content and index only the rest. I have a different use case, I want to keep everything as standard indexing _AND_ also extract part for being indexed in a dedicated field (which will be boosted at search time). In a document certain part have more importance than others in my case. So I would like either 1. to access html representation at indexing time... not possible or did not find how 2. create a dual representation of the document, plain & standard, filtered document I think option 2. is much better because it better fits the model and allows for a lot of different other use cases. best regards, 2009/10/9 Andrzej Bialecki > BELLINI ADAM wrote: > >> HI >> >> hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was >> thinking to start to create an HTML tag filter class. >> mabe i can create my own HTML parser ! as i do for parsing and indexing >> DublinCore metadata...it sounds possible don't you think so ? >> >> i just hv to create also or to find a class which could filter an HTML >> pages and delete certain tag from it >> > > Guys, please take a look at how HtmlParseFilters are implemented - for > example the creativecommons plugin. I believe that's exactly the > functionality that you are looking for. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- -MilleBii-
RE: indexing just certain content
yes i did read the code but didnt understand what is 'the Creative Commons license' that's why i asked what does mean creativecommons . but as u said, i hv to be familiar with DOM manipulation to understand the code...so lets start knowing DOM thx > From: kkrugler_li...@transpac.com > To: nutch-user@lucene.apache.org > Subject: Re: indexing just certain content > Date: Fri, 9 Oct 2009 16:39:31 -0700 > > > can you plz just tell us in english what the plugin creativecommons > > is for ? > > i mean if i will include this plugin in my nutch-site.txt, what will > > i have as result ? > > I think Andrzej is suggesting that you read the code. > > If you look at the beginning of the CCParseFilter.java file, you'll see: > > /** Adds metadata identifying the Creative Commons license used, if > any. */ > public class CCParseFilter implements HtmlParseFilter { > > The key routine that you need to implement is: > >/** Adds metadata or otherwise modifies a parse of an HTML > document, given > * the DOM tree of a page. */ >public ParseResult filter(Content content, ParseResult parseResult, > HTMLMetaTags metaTags, DocumentFragment doc) { > > So it seems that this plugin would be a great place for you to start. > > But you'll need to dig into the code, be familiar with DOM > manipulation, etc. > > -- Ken > > > > >> Date: Fri, 9 Oct 2009 19:16:44 +0200 > >> From: a...@getopt.org > >> To: nutch-user@lucene.apache.org > >> Subject: Re: indexing just certain content > >> > >> BELLINI ADAM wrote: > >>> HI > >>> > >>> hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i > >>> was thinking to start to create an HTML tag filter class. > >>> mabe i can create my own HTML parser ! as i do for parsing and > >>> indexing DublinCore metadata...it sounds possible don't you think > >>> so ? > >>> > >>> i just hv to create also or to find a class which could filter an > >>> HTML pages and delete certain tag from it > >> > >> Guys, please take a look at how HtmlParseFilters are implemented - > >> for > >> example the creativecommons plugin. I believe that's exactly the > >> functionality that you are looking for. > >> > >> > >> -- > >> Best regards, > >> Andrzej Bialecki <>< > _ New! Open Messenger faster on the MSN homepage http://go.microsoft.com/?linkid=9677405
Re: indexing just certain content
can you plz just tell us in english what the plugin creativecommons is for ? i mean if i will include this plugin in my nutch-site.txt, what will i have as result ? I think Andrzej is suggesting that you read the code. If you look at the beginning of the CCParseFilter.java file, you'll see: /** Adds metadata identifying the Creative Commons license used, if any. */ public class CCParseFilter implements HtmlParseFilter { The key routine that you need to implement is: /** Adds metadata or otherwise modifies a parse of an HTML document, given * the DOM tree of a page. */ public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) { So it seems that this plugin would be a great place for you to start. But you'll need to dig into the code, be familiar with DOM manipulation, etc. -- Ken Date: Fri, 9 Oct 2009 19:16:44 +0200 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: indexing just certain content BELLINI ADAM wrote: HI hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was thinking to start to create an HTML tag filter class. mabe i can create my own HTML parser ! as i do for parsing and indexing DublinCore metadata...it sounds possible don't you think so ? i just hv to create also or to find a class which could filter an HTML pages and delete certain tag from it Guys, please take a look at how HtmlParseFilters are implemented - for example the creativecommons plugin. I believe that's exactly the functionality that you are looking for. -- Best regards, Andrzej Bialecki <><
RE: indexing just certain content
hi, can you plz just tell us in english what the plugin creativecommons is for ? i mean if i will include this plugin in my nutch-site.txt, what will i have as result ? thx > Date: Fri, 9 Oct 2009 19:16:44 +0200 > From: a...@getopt.org > To: nutch-user@lucene.apache.org > Subject: Re: indexing just certain content > > BELLINI ADAM wrote: > > HI > > > > hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was > > thinking to start to create an HTML tag filter class. > > mabe i can create my own HTML parser ! as i do for parsing and indexing > > DublinCore metadata...it sounds possible don't you think so ? > > > > i just hv to create also or to find a class which could filter an HTML > > pages and delete certain tag from it > > Guys, please take a look at how HtmlParseFilters are implemented - for > example the creativecommons plugin. I believe that's exactly the > functionality that you are looking for. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > _ New! Get to Messenger faster: Sign-in here now! http://go.microsoft.com/?linkid=9677407
Re: indexing just certain content
BELLINI ADAM wrote: HI hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was thinking to start to create an HTML tag filter class. mabe i can create my own HTML parser ! as i do for parsing and indexing DublinCore metadata...it sounds possible don't you think so ? i just hv to create also or to find a class which could filter an HTML pages and delete certain tag from it Guys, please take a look at how HtmlParseFilters are implemented - for example the creativecommons plugin. I believe that's exactly the functionality that you are looking for. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: indexing just certain content
HI hI THX FOR YOUR DETAILED ANSWER...you make me save lotofftime , i was thinking to start to create an HTML tag filter class. mabe i can create my own HTML parser ! as i do for parsing and indexing DublinCore metadata...it sounds possible don't you think so ? i just hv to create also or to find a class which could filter an HTML pages and delete certain tag from it Thx. > Date: Fri, 9 Oct 2009 22:04:41 +0530 > From: g...@srijan.in > To: nutch-user@lucene.apache.org > Subject: Re: indexing just certain content > > On Fri, 9 Oct 2009 18:00:41 +0200 > MilleBii wrote: > > > Don't think it will work because at the indexing filter stage all > > the HTML tags are gone from the text. > > > > I think you need to modify the HTML parser to filter out the tags > > you want to get rid of. > > > > In some use case I have I would like to perform 'intelligent > > indexing', ie use the tag information to extract specific fields > > to be indexed along with the main text. A reverse case of yours. > > Todate I did not find a way to do it. > > So if you find a solution I'm with you. > [...] > > This is something that we would also be interested in. Actually, > we even have a working solution to extract content from between > start/stop tags, written by our colleagues from a partner company. > > There are a couple of things that we would like to fix with this > solution: > (a) It directly modifies HtmlParser.java, which is obviously > unmaintainable. > (b) It is a solution for specific tags, rather than picking them > up from configuration parameters. > (c) We have not yet traced the complete execution path for Nutch, > i.e., when is the parser called, when are filters called, etc. > Is there a document anywhere about this? We were thinking of a > filter, but from what you say above, that is the wrong stage. > (d) Ideally, whatever solution we come up with would be contributed > back to Nutch, which also helps us from a maintenance > standpoint. Is there a defined process for getting external > plugins accepted into Nutch? > > We are willing to put in some time into this, starting the coming > week. Where can we start a brainstorming Wiki for this? Is the > Nutch Wiki the right place? > > Regards, > Gora _ New: Messenger sign-in on the MSN homepage http://go.microsoft.com/?linkid=9677403
Re: indexing just certain content
On Fri, 9 Oct 2009 18:00:41 +0200 MilleBii wrote: > Don't think it will work because at the indexing filter stage all > the HTML tags are gone from the text. > > I think you need to modify the HTML parser to filter out the tags > you want to get rid of. > > In some use case I have I would like to perform 'intelligent > indexing', ie use the tag information to extract specific fields > to be indexed along with the main text. A reverse case of yours. > Todate I did not find a way to do it. > So if you find a solution I'm with you. [...] This is something that we would also be interested in. Actually, we even have a working solution to extract content from between start/stop tags, written by our colleagues from a partner company. There are a couple of things that we would like to fix with this solution: (a) It directly modifies HtmlParser.java, which is obviously unmaintainable. (b) It is a solution for specific tags, rather than picking them up from configuration parameters. (c) We have not yet traced the complete execution path for Nutch, i.e., when is the parser called, when are filters called, etc. Is there a document anywhere about this? We were thinking of a filter, but from what you say above, that is the wrong stage. (d) Ideally, whatever solution we come up with would be contributed back to Nutch, which also helps us from a maintenance standpoint. Is there a defined process for getting external plugins accepted into Nutch? We are willing to put in some time into this, starting the coming week. Where can we start a brainstorming Wiki for this? Is the Nutch Wiki the right place? Regards, Gora
Re: indexing just certain content
Don't think it will work because at the indexing filter stage all the HTML tags are gone from the text. I think you need to modify the HTML parser to filter out the tags you want to get rid of. In some use case I have I would like to perform 'intelligent indexing', ie use the tag information to extract specific fields to be indexed along with the main text. A reverse case of yours. Todate I did not find a way to do it. So if you find a solution I'm with you. 2009/10/7 BELLINI ADAM > > > in this class the BasicIndexingFilter.java, I think before adding the > contenent to the document i could parse it again to filter certain div tags > ?? > > text = parse.getText(); > > // i have to parse and filter the text here before adding it to the > docuement > > new_Filtred_text = text.myParser_New_method(text); > > doc.add("content", parse.getText()); > > what do you think about that ? > > _ > New! Faster Messenger access on the new MSN homepage > http://go.microsoft.com/?linkid=9677406 -- -MilleBii-
Re: indexing just certain content
in this class the BasicIndexingFilter.java, I think before adding the contenent to the document i could parse it again to filter certain div tags ?? text = parse.getText(); // i have to parse and filter the text here before adding it to the docuement new_Filtred_text = text.myParser_New_method(text); doc.add("content", parse.getText()); what do you think about that ? _ New! Faster Messenger access on the new MSN homepage http://go.microsoft.com/?linkid=9677406
Re: indexing just certain content
Look at the source code for the basic indexing plugin - it indexes the title tags and some other tags: should be a good starting point. Eric On Oct 5, 2009, at 1:20 PM, BELLINI ADAM wrote: hi, but how will i get the HTML tag ? is there any nutch method to get from the content the tag ?? thx Subject: Re: indexing just certain content From: e...@lakemeadonline.com Date: Mon, 5 Oct 2009 13:09:17 -0700 To: nutch-user@lucene.apache.org Adam, You could turn off all the indexing plugins and write your own plugin that only indexes certain meta content from your intranet - giving you complete control of the fields indexed. Eric On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote: hi does anybody know if it's possible to index just certain content ? i mean i need to dont index some garbage and repetitive data on my intranet. in other way if it is possible to tell the indexer dont index the content between certain tags like: plz dont index this bla bla bla thx to all _ New: Messenger sign-in on the MSN homepage http://go.microsoft.com/?linkid=9677403 _ Click less, chat more: Messenger on MSN.ca http://go.microsoft.com/?linkid=9677404
RE: indexing just certain content
hi, but how will i get the HTML tag ? is there any nutch method to get from the content the tag ?? thx > Subject: Re: indexing just certain content > From: e...@lakemeadonline.com > Date: Mon, 5 Oct 2009 13:09:17 -0700 > To: nutch-user@lucene.apache.org > > Adam, > > You could turn off all the indexing plugins and write your own plugin > that only indexes certain meta content from your intranet - giving you > complete control of the fields indexed. > > Eric > > On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote: > > > > > hi > > > > does anybody know if it's possible to index just certain content ? i > > mean i need to dont index some garbage and repetitive data on my > > intranet. > > > > in other way if it is possible to tell the indexer dont index the > > content between certain tags > > like: > > > > > > > > > > plz dont index this bla bla bla > > > > > > > > thx to all > > > > _ > > New: Messenger sign-in on the MSN homepage > > http://go.microsoft.com/?linkid=9677403 > _ Click less, chat more: Messenger on MSN.ca http://go.microsoft.com/?linkid=9677404
Re: indexing just certain content
Adam, You could turn off all the indexing plugins and write your own plugin that only indexes certain meta content from your intranet - giving you complete control of the fields indexed. Eric On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote: hi does anybody know if it's possible to index just certain content ? i mean i need to dont index some garbage and repetitive data on my intranet. in other way if it is possible to tell the indexer dont index the content between certain tags like: plz dont index this bla bla bla thx to all _ New: Messenger sign-in on the MSN homepage http://go.microsoft.com/?linkid=9677403