RE: How to ignore search results that don't have related keywords in main body?
Just use the creativecommon library to manipulate the DOM in your new parser.
RE: indexing just certain content
This is not very clear: There is big difference between removing garbage for indexingFilter and removing search results... I think you want the first one. You just need to build a custom Parser that will filter out the tags you dont want
RE: indexing just certain content
we are using SOLR, I dont know how to remove search results, that's why i dont want to index the garbage data...and that's why i'm wondering to remove those data in the parse operation...yes i want to filter out the data from the HTML, and this is my big problem...in my post i'm asking if there is a java class that delete section form an HTML ! since i know only the sections i want to delete (it's a template), i'm not able to construct a new HTML file by taking only section i need since i dont know those section and dont know if the HTML tags are well dompted(the only thing i know is that the section i want to remove are DIV sections and i know that they are dompted). so the big deal is : removing known section from an HTML file. (without knowing the other sections). i will try to construct such a class to clean those html files From: mille...@gmail.com Subject: RE: indexing just certain content Date: Sun, 11 Oct 2009 11:02:21 +0200 To: nutch-user@lucene.apache.org This is not very clear: There is big difference between removing garbage for indexingFilter and removing search results... I think you want the first one. You just need to build a custom Parser that will filter out the tags you dont want _ New! Faster Messenger access on the new MSN homepage http://go.microsoft.com/?linkid=9677406
RE: OutOfMemoryError: Java heap space
i guess your segments are too big... try to merge just few of them in one shot if you have N segments try to start merging just N-1 if you still hv the error, try N-2...till you fill find the best number of segments you can merge at one shot thx Subject: OutOfMemoryError: Java heap space From: fa...@butterflycluster.net To: nutch-user@lucene.apache.org Date: Sun, 11 Oct 2009 15:26:14 +1100 hi all, I am getting this JVM error below during a recrawl specifically during the execution of $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* i am running on a single machine: Linux 2.6.24-23-xen x86_64 4G RAM java-6-sun nutch-1.0 JAVA_HEAP_MAX=-Xmx1000m Any suggestions? I am about to up my heap max to Xmx2000m i havent encountered this before running with the above specs, so i am not sure what could have changed? Any suggestions will be greatly appreciated. Thanks. 2009-10-11 14:29:56,752 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce reduce 2009-10-11 14:30:15,801 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce reduce 2009-10-11 14:31:19,197 INFO [org.apache.hadoop.mapred.TaskRunner] - Communication exception: java.lang.OutOfMemoryError: Java heap space at java.util.ResourceBundle$Control.getCandidateLocales(ResourceBundle.java:2220) at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1229) at java.util.ResourceBundle.getBundle(ResourceBundle.java:715) at org.apache.hadoop.mapred.Counters$Group.getResourceBundle(Counters.java:218) at org.apache.hadoop.mapred.Counters$Group.init(Counters.java:202) at org.apache.hadoop.mapred.Counters.getGroup(Counters.java:410) at org.apache.hadoop.mapred.Counters.incrAllCounters(Counters.java:491) at org.apache.hadoop.mapred.Counters.sum(Counters.java:506) at org.apache.hadoop.mapred.LocalJobRunner$Job.statusUpdate(LocalJobRunner.java:222) at org.apache.hadoop.mapred.Task$1.run(Task.java:418) at java.lang.Thread.run(Thread.java:619) 2009-10-11 14:31:22,197 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce reduce 2009-10-11 14:31:25,197 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce reduce 2009-10-11 14:31:40,002 WARN [org.apache.hadoop.mapred.LocalJobRunner] - job_local_0001 java.lang.OutOfMemoryError: Java heap space at java.util.concurrent.locks.ReentrantLock.init(ReentrantLock.java:234) at java.util.concurrent.ConcurrentHashMap$Segment.init(ConcurrentHashMap.java:289) at java.util.concurrent.ConcurrentHashMap.init(ConcurrentHashMap.java:613) at java.util.concurrent.ConcurrentHashMap.init(ConcurrentHashMap.java:652) at org.apache.hadoop.io.AbstractMapWritable.init(AbstractMapWritable.java:49) at org.apache.hadoop.io.MapWritable.init(MapWritable.java:42) at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260) at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54) at org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:101) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233) at org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:377) at org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:113) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) _ New! Faster Messenger access on the new MSN homepage http://go.microsoft.com/?linkid=9677406
Re: Incremental Whole Web Crawling
When I set generate.update.db to true and then run generate, it only runs twice and generates 100K for the 1st gen, 62.5K for the second gen and 0 for the 3rd gen on a seed list of 1.6M. I don't understand this, for a topN of 100K it should run 16 times and create 16 distinct lists if I am not mistaken. Eric On Oct 5, 2009, at 10:01 PM, Gaurang Patel wrote: Hey, Never mind. I got *generate.update.db* in *nutch-default.xml* and set it true. Regards, Gaurang 2009/10/5 Gaurang Patel gaurangtpa...@gmail.com Hey Andrzej, Can you tell me where to set this property (generate.update.db)? I am trying to run similar kind of crawl scenario that Eric is running. -Gaurang 2009/10/5 Andrzej Bialecki a...@getopt.org Eric wrote: Andrzej, Just to make sure I have this straight, set the generate.update.db property to true then bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16 times? Yes. When this property is set to true, then each fetchlist will be different, because the records for those pages that are already on another fetchlist will be temporarily locked. Please note that this lock holds only for 1 week, so you need to fetch all segments within one week from generating them. You can fetch and updatedb in arbitrary order, so once you fetched some segments you can run the parsing and updatedb just from these segments, without waiting for all 16 segments to be processed. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
RE: OutOfMemoryError: Java heap space
that seems to work; thanks for that. hi there, i ended up using a topN during my generate phase, but i didnt want to do this - although it seems fix my problem. what i have been observing also is that the reduce action seems to take very long on large segments. would any one shed some light on the ratio of segment size to processing time ratio on a standard machine say: 2G RAM 4-core server ? my segments in my test environment are now about 3Meg or so each, at one point i had a 600Meg segments and mergeSegments seemed to take forever and eventually stop responding.. the last i heard from the process was something like: 2009-10-11 14:29:56,752 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce reduce .. .. over and over then finally silence... any insight? i guess your segments are too big... try to merge just few of them in one shot if you have N segments try to start merging just N-1 if you still hv the error, try N-2...till you fill find the best number of segments you can merge at one shot thx Subject: OutOfMemoryError: Java heap space From: fa...@butterflycluster.net To: nutch-user@lucene.apache.org Date: Sun, 11 Oct 2009 15:26:14 +1100 hi all, I am getting this JVM error below during a recrawl specifically during the execution of $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* i am running on a single machine: Linux 2.6.24-23-xen x86_64 4G RAM java-6-sun nutch-1.0 JAVA_HEAP_MAX=-Xmx1000m Any suggestions? I am about to up my heap max to Xmx2000m i havent encountered this before running with the above specs, so i am not sure what could have changed? Any suggestions will be greatly appreciated. Thanks. 2009-10-11 14:29:56,752 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce reduce 2009-10-11 14:30:15,801 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce reduce 2009-10-11 14:31:19,197 INFO [org.apache.hadoop.mapred.TaskRunner] - Communication exception: java.lang.OutOfMemoryError: Java heap space at java.util.ResourceBundle$Control.getCandidateLocales(ResourceBundle.java:2220) at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1229) at java.util.ResourceBundle.getBundle(ResourceBundle.java:715) at org.apache.hadoop.mapred.Counters$Group.getResourceBundle(Counters.java:218) at org.apache.hadoop.mapred.Counters$Group.init(Counters.java:202) at org.apache.hadoop.mapred.Counters.getGroup(Counters.java:410) at org.apache.hadoop.mapred.Counters.incrAllCounters(Counters.java:491) at org.apache.hadoop.mapred.Counters.sum(Counters.java:506) at org.apache.hadoop.mapred.LocalJobRunner$Job.statusUpdate(LocalJobRunner.java:222) at org.apache.hadoop.mapred.Task$1.run(Task.java:418) at java.lang.Thread.run(Thread.java:619) 2009-10-11 14:31:22,197 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce reduce 2009-10-11 14:31:25,197 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce reduce 2009-10-11 14:31:40,002 WARN [org.apache.hadoop.mapred.LocalJobRunner] - job_local_0001 java.lang.OutOfMemoryError: Java heap space at java.util.concurrent.locks.ReentrantLock.init(ReentrantLock.java:234) at java.util.concurrent.ConcurrentHashMap$Segment.init(ConcurrentHashMap.java:289) at java.util.concurrent.ConcurrentHashMap.init(ConcurrentHashMap.java:613) at java.util.concurrent.ConcurrentHashMap.init(ConcurrentHashMap.java:652) at org.apache.hadoop.io.AbstractMapWritable.init(AbstractMapWritable.java:49) at org.apache.hadoop.io.MapWritable.init(MapWritable.java:42) at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260) at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54) at org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:101) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940) at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233) at org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:377) at org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:113) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) _ New! Faster Messenger access on the new MSN homepage http://go.microsoft.com/?linkid=9677406