Re: how can I index only a portion of html content?
Jayant Kumar Gandhi wrote: It is possible in many ways. One of the ways to do it without using the HTML pasrser plugin is to do cloaking for your bot. Hi, Could I please know all possible methods for achieving this?? This seems to be a common problem but I failed to find decent answers on this forum. I'm using a content management system named Infoglue to create my website. Most of the pages in my site have an identical navigation bar, header and footer. The content in these sections show up in the search result. In a related question, what does de-duplication in nutch mean and how does it work?? Is it possible to configure nutch to remove duplicate contents like navigation bar during its de-duplication process?? Regards, Winz -- View this message in context: http://www.nabble.com/how-can-I-index-only-a-portion-of-html-content--tp5149557p25832007.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to ignore search results that don't have related keywords in main body?
Venkateshprasanna wrote: Hi, You can very well think of doing that if you know that you would crawl and index only a selected set of web pages, which follow the same design. Otherwise, it would turn out to be a never ending process - i.e., finding out the sections, frames, divs, spans, css classes and the likes - from each of the web pages. Scalability would obviously be an issue. Hi, Could I please know how we can ignore template items like header, footer and menu/navigations while crawling and indexing pages which follow the same design?? I'm using a content management system called Infoglue to develop my website. A standard template is applied for all the pages on the website. The search results from Nutch shows content from menu/navigation bar multiple times. I need to get rid of menu/navigation content from the search result. Please guide regarding this. Thanks, Vinay -- View this message in context: http://www.nabble.com/How-to-ignore-search-results-that-don%27t-have-related-keywords-in-main-body--tp22654668p25833636.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: indexing just certain content
MilleBii wrote: Andzej, The use case you are thinking is : at the parsing stage, filter out garbage content and index only the rest. I have a different use case, I want to keep everything as standard indexing _AND_ also extract part for being indexed in a dedicated field (which will be boosted at search time). In a document certain part have more importance than others in my case. So I would like either 1. to access html representation at indexing time... not possible or did not find how 2. create a dual representation of the document, plain standard, filtered document I think option 2. is much better because it better fits the model and allows for a lot of different other use cases. Actually, creativecommons provides hints how to do this .. but to be more explicit: * in your HtmlParseFilter you need to extract from DOM tree the parts that you want, and put them inside ParseData.metadata. This way you will preserve both the original text, and your special parts that you extracted. * in your IndexingFilter you will retrieve the parts from ParseData.metadata and add them as additional index fields (don't forget to specify indexing backend options). * in your QueryFilter plugin.xml you declare that QueryParser should pass your special fields without treating them as terms, and in the implementation you create a BooleanClause to be added to the translated query. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to ignore search results that don't have related keywords in main body?
winz wrote: Venkateshprasanna wrote: Hi, You can very well think of doing that if you know that you would crawl and index only a selected set of web pages, which follow the same design. Otherwise, it would turn out to be a never ending process - i.e., finding out the sections, frames, divs, spans, css classes and the likes - from each of the web pages. Scalability would obviously be an issue. Hi, Could I please know how we can ignore template items like header, footer and menu/navigations while crawling and indexing pages which follow the same design?? I'm using a content management system called Infoglue to develop my website. A standard template is applied for all the pages on the website. The search results from Nutch shows content from menu/navigation bar multiple times. I need to get rid of menu/navigation content from the search result. If all you index is this particular site, then you know the positions of navigation items, right? Then you can remove these elements in your HtmlParseFilter, or modify DOMContentUtils (in parse-html) to skip these elements. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: indexing just certain content
Hi, you said : '...* in your HtmlParseFilter you need to extract from DOM tree the parts that you want ...' but my problem is : i dont know what to extract becoz dont know all pages i'm indexing, i just know what to don't index 1 - I just know what to dont index...all pages have some sections that i wont index, since i know those section i want to take them off from the document and keep the rest of the important content. the sections are headers, top menus, right menus, left menus and some other sections: div id = 'header' bla bla /div div id = 'top_menu' bla bla /div div id = 'left_menu' bla bla /div div id = 'right_menu' bla bla /div mabe i could find some java classes which can delete sections form a an HTML page ?! if i found this one so i guess it will be more easy to use 2- you said dont forget backends index : could you tell me what are they ? 3- we are using SOLR, so i just have to index the important content...the search will be performed with solr so i guess i dont need the QueryFilter. best regards Date: Sat, 10 Oct 2009 16:04:10 +0200 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: indexing just certain content MilleBii wrote: Andzej, The use case you are thinking is : at the parsing stage, filter out garbage content and index only the rest. I have a different use case, I want to keep everything as standard indexing _AND_ also extract part for being indexed in a dedicated field (which will be boosted at search time). In a document certain part have more importance than others in my case. So I would like either 1. to access html representation at indexing time... not possible or did not find how 2. create a dual representation of the document, plain standard, filtered document I think option 2. is much better because it better fits the model and allows for a lot of different other use cases. Actually, creativecommons provides hints how to do this .. but to be more explicit: * in your HtmlParseFilter you need to extract from DOM tree the parts that you want, and put them inside ParseData.metadata. This way you will preserve both the original text, and your special parts that you extracted. * in your IndexingFilter you will retrieve the parts from ParseData.metadata and add them as additional index fields (don't forget to specify indexing backend options). * in your QueryFilter plugin.xml you declare that QueryParser should pass your special fields without treating them as terms, and in the implementation you create a BooleanClause to be added to the translated query. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ New! Get to Messenger faster: Sign-in here now! http://go.microsoft.com/?linkid=9677407
RE: indexing just certain content
yes MilleBii i tald you before that i created a DublinCore metadata parser and indexer...so i parsed my html and created fileds to get my DC metadata...my missing piece is how to delete sections form an html page :( if i will find this piece the rest will be like a peice of cake :) Date: Sat, 10 Oct 2009 16:41:44 +0200 Subject: Re: indexing just certain content From: mille...@gmail.com To: nutch-user@lucene.apache.org Andrzej, Great !!! I did not realize you could put your own content in ParseData.metadata and read it back in the IndexingFilter... this was my missing piece in the puzzle, for the rest I knew what to do. Thanks, 2009/10/10 Andrzej Bialecki a...@getopt.org MilleBii wrote: Andzej, The use case you are thinking is : at the parsing stage, filter out garbage content and index only the rest. I have a different use case, I want to keep everything as standard indexing _AND_ also extract part for being indexed in a dedicated field (which will be boosted at search time). In a document certain part have more importance than others in my case. So I would like either 1. to access html representation at indexing time... not possible or did not find how 2. create a dual representation of the document, plain standard, filtered document I think option 2. is much better because it better fits the model and allows for a lot of different other use cases. Actually, creativecommons provides hints how to do this .. but to be more explicit: * in your HtmlParseFilter you need to extract from DOM tree the parts that you want, and put them inside ParseData.metadata. This way you will preserve both the original text, and your special parts that you extracted. * in your IndexingFilter you will retrieve the parts from ParseData.metadata and add them as additional index fields (don't forget to specify indexing backend options). * in your QueryFilter plugin.xml you declare that QueryParser should pass your special fields without treating them as terms, and in the implementation you create a BooleanClause to be added to the translated query. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii- _ New! Faster Messenger access on the new MSN homepage http://go.microsoft.com/?linkid=9677406
RE: How to ignore search results that don't have related keywords in main body?
hi guyes it's just what im talking about in my post 'indexing just certain content'... you can read it mabe it could help you... i was asking how to get rid of the garbage sections in a document and to parse only the important data...so i guess you will create your own parser and indexer...but the problem is how could we delete those garbage section from an html...try to read my post...mabe we can gather our two posts...i dont know if we can gather posts on thsi mailing list...to keep tracking only one post... best regards Date: Sat, 10 Oct 2009 17:31:57 +0200 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: How to ignore search results that don't have related keywords in main body? winz wrote: Venkateshprasanna wrote: Hi, You can very well think of doing that if you know that you would crawl and index only a selected set of web pages, which follow the same design. Otherwise, it would turn out to be a never ending process - i.e., finding out the sections, frames, divs, spans, css classes and the likes - from each of the web pages. Scalability would obviously be an issue. Hi, Could I please know how we can ignore template items like header, footer and menu/navigations while crawling and indexing pages which follow the same design?? I'm using a content management system called Infoglue to develop my website. A standard template is applied for all the pages on the website. The search results from Nutch shows content from menu/navigation bar multiple times. I need to get rid of menu/navigation content from the search result. If all you index is this particular site, then you know the positions of navigation items, right? Then you can remove these elements in your HtmlParseFilter, or modify DOMContentUtils (in parse-html) to skip these elements. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ New! Faster Messenger access on the new MSN homepage http://go.microsoft.com/?linkid=9677406
RE: indexing just certain content
what i want is exactly explained in this second post : How to ignore search results that don't have related keywords in main body? From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: RE: indexing just certain content Date: Sat, 10 Oct 2009 15:35:31 + yes MilleBii i tald you before that i created a DublinCore metadata parser and indexer...so i parsed my html and created fileds to get my DC metadata...my missing piece is how to delete sections form an html page :( if i will find this piece the rest will be like a peice of cake :) Date: Sat, 10 Oct 2009 16:41:44 +0200 Subject: Re: indexing just certain content From: mille...@gmail.com To: nutch-user@lucene.apache.org Andrzej, Great !!! I did not realize you could put your own content in ParseData.metadata and read it back in the IndexingFilter... this was my missing piece in the puzzle, for the rest I knew what to do. Thanks, 2009/10/10 Andrzej Bialecki a...@getopt.org MilleBii wrote: Andzej, The use case you are thinking is : at the parsing stage, filter out garbage content and index only the rest. I have a different use case, I want to keep everything as standard indexing _AND_ also extract part for being indexed in a dedicated field (which will be boosted at search time). In a document certain part have more importance than others in my case. So I would like either 1. to access html representation at indexing time... not possible or did not find how 2. create a dual representation of the document, plain standard, filtered document I think option 2. is much better because it better fits the model and allows for a lot of different other use cases. Actually, creativecommons provides hints how to do this .. but to be more explicit: * in your HtmlParseFilter you need to extract from DOM tree the parts that you want, and put them inside ParseData.metadata. This way you will preserve both the original text, and your special parts that you extracted. * in your IndexingFilter you will retrieve the parts from ParseData.metadata and add them as additional index fields (don't forget to specify indexing backend options). * in your QueryFilter plugin.xml you declare that QueryParser should pass your special fields without treating them as terms, and in the implementation you create a BooleanClause to be added to the translated query. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- -MilleBii- _ New! Faster Messenger access on the new MSN homepage http://go.microsoft.com/?linkid=9677406 _ New! Get to Messenger faster: Sign-in here now! http://go.microsoft.com/?linkid=9677407
Re: How to ignore search results that don't have related keywords in main body?
BELLINI ADAM wrote: hi guyes it's just what im talking about in my post 'indexing just certain content'... you can read it mabe it could help you... i was asking how to get rid of the garbage sections in a document and to parse only the important data...so i guess you will create your own parser and indexer...but the problem is how could we delete those garbage section from an html...try to read my post...mabe we can gather our two posts...i dont know if we can gather posts on thsi mailing list...to keep tracking only one post... What is garbage? Can you define it in terms of regex pattern or XPath expression that points to specific elements in DOM tree? If you crawl a single (or few) sites with well defined templates then you can hardcode some rules for removing unwanted parts of the page. If you can't do this, then there are some heuristic methods to solve this. There are two groups of methods: * page at a time (local): this group of methods considers only the current page that you analyze. The quality of filtering is usually limited. * groups of pages (e.g. per site): these methods consider many pages at a time, and try to find recurring theme among them. Since you first need to accumulate some pages it can't be done on the fly, i.e. this requires a separate post-processing step. The easiest to implement in Nutch is the first approach (page at a time). There are many possible implementations - e.g. based on text patterns, on visual position of elements, on DOM tree patterns, on block of content characteristics, etc. Here's for example a simple method: * collect text from the page in blocks, where each block fits within structural tags (div and table tags). Collect also the number of a links in each block. * remove a percentage of the smallest blocks, where link number is high - these are likely navigational elements. * reconstruct the whole page from the remaining blocks. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: How to ignore search results that don't have related keywords in main body?
As i explained in my poste the sections i dont wnat to index areare headers, top menus, right menus, left menus : this is what i mean by garbage. div id = 'header' bla bla /div div id = 'top_menu' bla bla /div div id = 'left_menu' bla bla /div div id = 'right_menu' bla bla /diveach page contains the same header and menus section and i don't want index them becoz they are the same... so in each page i just want to parse those sections to get outlinks but dont want to index them...so i have to create a filtred content (without those section). but how to construct this content since i dont know all the blocks and tags that this pages will contains and i even dont know if they are well formed...(its just HTML) the only thing i'm sure about it that there is a template which applies to all pages, this templates are the div sections described above...(menus, left-menus, etc). so i guess the easiest solution is to find a java class which take an HTML file and certains sections div id = 'header' as parameters and just delete those sections form the HTML file and produce the new cleaned HTML http://www.israel-stop.com/fr a href=http://www.israel-stop.com/fr;israel/a Date: Sat, 10 Oct 2009 18:21:47 +0200 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: How to ignore search results that don't have related keywords in main body? BELLINI ADAM wrote: hi guyes it's just what im talking about in my post 'indexing just certain content'... you can read it mabe it could help you... i was asking how to get rid of the garbage sections in a document and to parse only the important data...so i guess you will create your own parser and indexer...but the problem is how could we delete those garbage section from an html...try to read my post...mabe we can gather our two posts...i dont know if we can gather posts on thsi mailing list...to keep tracking only one post... What is garbage? Can you define it in terms of regex pattern or XPath expression that points to specific elements in DOM tree? If you crawl a single (or few) sites with well defined templates then you can hardcode some rules for removing unwanted parts of the page. If you can't do this, then there are some heuristic methods to solve this. There are two groups of methods: * page at a time (local): this group of methods considers only the current page that you analyze. The quality of filtering is usually limited. * groups of pages (e.g. per site): these methods consider many pages at a time, and try to find recurring theme among them. Since you first need to accumulate some pages it can't be done on the fly, i.e. this requires a separate post-processing step. The easiest to implement in Nutch is the first approach (page at a time). There are many possible implementations - e.g. based on text patterns, on visual position of elements, on DOM tree patterns, on block of content characteristics, etc. Here's for example a simple method: * collect text from the page in blocks, where each block fits within structural tags (div and table tags). Collect also the number of a links in each block. * remove a percentage of the smallest blocks, where link number is high - these are likely navigational elements. * reconstruct the whole page from the remaining blocks. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ New! Faster Messenger access on the new MSN homepage http://go.microsoft.com/?linkid=9677406
OutOfMemoryError: Java heap space
hi all, I am getting this JVM error below during a recrawl specifically during the execution of $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* i am running on a single machine: Linux 2.6.24-23-xen x86_64 4G RAM java-6-sun nutch-1.0 JAVA_HEAP_MAX=-Xmx1000m Any suggestions? I am about to up my heap max to Xmx2000m i havent encountered this before running with the above specs, so i am not sure what could have changed? Any suggestions will be greatly appreciated. Thanks. 2009-10-11 14:29:56,752 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce reduce 2009-10-11 14:30:15,801 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce reduce 2009-10-11 14:31:19,197 INFO [org.apache.hadoop.mapred.TaskRunner] - Communication exception: java.lang.OutOfMemoryError: Java heap space at java.util.ResourceBundle$Control.getCandidateLocales(ResourceBundle.java:2220) at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1229) at java.util.ResourceBundle.getBundle(ResourceBundle.java:715) at org.apache.hadoop.mapred.Counters$Group.getResourceBundle(Counters.java:218) at org.apache.hadoop.mapred.Counters$Group.init(Counters.java:202) at org.apache.hadoop.mapred.Counters.getGroup(Counters.java:410) at org.apache.hadoop.mapred.Counters.incrAllCounters(Counters.java:491) at org.apache.hadoop.mapred.Counters.sum(Counters.java:506) at org.apache.hadoop.mapred.LocalJobRunner$Job.statusUpdate(LocalJobRunner.java:222) at org.apache.hadoop.mapred.Task$1.run(Task.java:418) at java.lang.Thread.run(Thread.java:619) 2009-10-11 14:31:22,197 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce reduce 2009-10-11 14:31:25,197 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce reduce 2009-10-11 14:31:40,002 WARN [org.apache.hadoop.mapred.LocalJobRunner] - job_local_0001 java.lang.OutOfMemoryError: Java heap space at java.util.concurrent.locks.ReentrantLock.init(ReentrantLock.java:234) at java.util.concurrent.ConcurrentHashMap$Segment.init(ConcurrentHashMap.java:289) at java.util.concurrent.ConcurrentHashMap.init(ConcurrentHashMap.java:613) at java.util.concurrent.ConcurrentHashMap.init(ConcurrentHashMap.java:652) at org.apache.hadoop.io.AbstractMapWritable.init(AbstractMapWritable.java:49) at org.apache.hadoop.io.MapWritable.init(MapWritable.java:42) at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260) at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54) at org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:101) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233) at org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:377) at org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:113) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)