RE: How to ignore search results that don't have related keywords in main body?

2009-10-11 Thread MilleBii
Just use the creativecommon library to manipulate the DOM in your new parser.  


RE: indexing just certain content

2009-10-11 Thread MilleBii
This is not very clear:
There is big difference between removing garbage for indexingFilter and 
removing search  results... I think you want the first one. 

You just need to build a custom Parser that will filter out the tags you dont 
want 


RE: indexing just certain content

2009-10-11 Thread BELLINI ADAM

we are using SOLR, I dont know how to remove search results, that's why i dont 
want to index the garbage data...and that's why i'm wondering to remove those 
data in the parse operation...yes i want to filter out the data from the HTML, 
and this is my big problem...in my post i'm asking if there is a java class 
that delete section form an HTML ! since i know only the sections i want to 
delete (it's a template), i'm not able to construct a new HTML file by taking 
only section i need since i dont know those section and dont know if the HTML 
tags are well dompted(the only thing i know is that the section i want to 
remove are DIV sections and i know that they are dompted).
so the big deal is : removing known section from an HTML file. (without knowing 
the other sections).
i will try to construct such a class to clean those html files



 From: mille...@gmail.com
 Subject: RE: indexing just certain content
 Date: Sun, 11 Oct 2009 11:02:21 +0200
 To: nutch-user@lucene.apache.org
 
 This is not very clear:
 There is big difference between removing garbage for indexingFilter and 
 removing search  results... I think you want the first one. 
 
 You just need to build a custom Parser that will filter out the tags you dont 
 want 
  
_
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

RE: OutOfMemoryError: Java heap space

2009-10-11 Thread BELLINI ADAM

i guess your segments are too big...
try to merge just few of them in one shot
if you have N segments try to start merging just N-1 if you still hv the error, 
try N-2...till you fill find the best number of segments you can merge at one 
shot

thx


 Subject: OutOfMemoryError: Java heap space
 From: fa...@butterflycluster.net
 To: nutch-user@lucene.apache.org
 Date: Sun, 11 Oct 2009 15:26:14 +1100
 
 hi all,
 
 I am getting this JVM error below during a recrawl specifically during the 
 execution of 
 
 $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
 
 i am running on a single machine:
 Linux 2.6.24-23-xen  x86_64
 4G RAM
 java-6-sun
 nutch-1.0
 JAVA_HEAP_MAX=-Xmx1000m 
 
 Any suggestions? I am about to up my heap max to Xmx2000m
 
 i havent encountered this before running with the above specs, so i am not 
 sure what could have changed?
 Any suggestions will be greatly appreciated.
 
 Thanks.
 
 
  
  
  2009-10-11 14:29:56,752 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 
  reduce  reduce
  2009-10-11 14:30:15,801 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 
  reduce  reduce
  2009-10-11 14:31:19,197 INFO [org.apache.hadoop.mapred.TaskRunner] - 
  Communication exception: java.lang.OutOfMemoryError: Java heap space
  at 
  java.util.ResourceBundle$Control.getCandidateLocales(ResourceBundle.java:2220)
  at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1229)
  at java.util.ResourceBundle.getBundle(ResourceBundle.java:715)
  at 
  org.apache.hadoop.mapred.Counters$Group.getResourceBundle(Counters.java:218)
  at org.apache.hadoop.mapred.Counters$Group.init(Counters.java:202)
  at org.apache.hadoop.mapred.Counters.getGroup(Counters.java:410)
  at org.apache.hadoop.mapred.Counters.incrAllCounters(Counters.java:491)
  at org.apache.hadoop.mapred.Counters.sum(Counters.java:506)
  at 
  org.apache.hadoop.mapred.LocalJobRunner$Job.statusUpdate(LocalJobRunner.java:222)
  at org.apache.hadoop.mapred.Task$1.run(Task.java:418)
  at java.lang.Thread.run(Thread.java:619)
  
  2009-10-11 14:31:22,197 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 
  reduce  reduce
  2009-10-11 14:31:25,197 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 
  reduce  reduce
  2009-10-11 14:31:40,002 WARN [org.apache.hadoop.mapred.LocalJobRunner] - 
  job_local_0001
  java.lang.OutOfMemoryError: Java heap space
  at 
  java.util.concurrent.locks.ReentrantLock.init(ReentrantLock.java:234)
  at 
  java.util.concurrent.ConcurrentHashMap$Segment.init(ConcurrentHashMap.java:289)
  at 
  java.util.concurrent.ConcurrentHashMap.init(ConcurrentHashMap.java:613)
  at 
  java.util.concurrent.ConcurrentHashMap.init(ConcurrentHashMap.java:652)
  at 
  org.apache.hadoop.io.AbstractMapWritable.init(AbstractMapWritable.java:49)
  at org.apache.hadoop.io.MapWritable.init(MapWritable.java:42)
  at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260)
  at 
  org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
  at 
  org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:101)
  at 
  org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
  at 
  org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
  at 
  org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940)
  at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880)
  at 
  org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237)
  at 
  org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233)
  at org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:377)
  at org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:113)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
  at 
  org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
 
  
_
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

Re: Incremental Whole Web Crawling

2009-10-11 Thread Eric Osgood
When I set generate.update.db to true and then run generate, it only  
runs twice and generates 100K for the 1st gen, 62.5K for the second  
gen and 0 for the 3rd gen on a seed list of 1.6M. I don't understand  
this, for a topN of 100K it should run 16 times and create 16 distinct  
lists if I am not mistaken.


Eric


On Oct 5, 2009, at 10:01 PM, Gaurang Patel wrote:


Hey,

Never mind. I got *generate.update.db* in *nutch-default.xml* and  
set it

true.

Regards,
Gaurang

2009/10/5 Gaurang Patel gaurangtpa...@gmail.com


Hey Andrzej,

Can you tell me where to set this property (generate.update.db)? I am
trying to run similar kind of crawl scenario that Eric is running.

-Gaurang

2009/10/5 Andrzej Bialecki a...@getopt.org

Eric wrote:



Andrzej,

Just to make sure I have this straight, set the generate.update.db
property to true then

bin/nutch generate crawl/crawldb crawl/segments -topN 10: 16  
times?




Yes. When this property is set to true, then each fetchlist will be
different, because the records for those pages that are already on  
another
fetchlist will be temporarily locked. Please note that this lock  
holds only

for 1 week, so you need to fetch all segments within one week from
generating them.

You can fetch and updatedb in arbitrary order, so once you fetched  
some
segments you can run the parsing and updatedb just from these  
segments,

without waiting for all 16 segments to be processed.



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com






Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



RE: OutOfMemoryError: Java heap space

2009-10-11 Thread fadzi
that seems to work; thanks for that.
hi there,

i ended up using a topN during my generate phase, but i didnt want to do
this - although it seems fix my problem.

what i have been observing also is that the reduce action seems to take
very long on large segments.

would any one shed some light on the ratio of segment size to processing
time ratio on a standard machine say:
2G RAM
4-core server ?

my segments in my test environment are now about 3Meg or so each, at one
point i had a 600Meg segments and mergeSegments seemed to take forever and
eventually stop responding.. the last i heard from the process was
something like:

2009-10-11 14:29:56,752 INFO [org.apache.hadoop.mapred.LocalJobRunner] -
reduce  reduce .. ..

over and over then finally silence...

any insight?


 i guess your segments are too big...
 try to merge just few of them in one shot
 if you have N segments try to start merging just N-1 if you still hv the
 error, try N-2...till you fill find the best number of segments you can
 merge at one shot

 thx


 Subject: OutOfMemoryError: Java heap space
 From: fa...@butterflycluster.net
 To: nutch-user@lucene.apache.org
 Date: Sun, 11 Oct 2009 15:26:14 +1100

 hi all,

 I am getting this JVM error below during a recrawl specifically during
 the execution of

 $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*

 i am running on a single machine:
 Linux 2.6.24-23-xen  x86_64
 4G RAM
 java-6-sun
 nutch-1.0
 JAVA_HEAP_MAX=-Xmx1000m

 Any suggestions? I am about to up my heap max to Xmx2000m

 i havent encountered this before running with the above specs, so i am
 not sure what could have changed?
 Any suggestions will be greatly appreciated.

 Thanks.


 
 
  2009-10-11 14:29:56,752 INFO [org.apache.hadoop.mapred.LocalJobRunner]
 - reduce  reduce
  2009-10-11 14:30:15,801 INFO [org.apache.hadoop.mapred.LocalJobRunner]
 - reduce  reduce
  2009-10-11 14:31:19,197 INFO [org.apache.hadoop.mapred.TaskRunner] -
 Communication exception: java.lang.OutOfMemoryError: Java heap space
 at
 java.util.ResourceBundle$Control.getCandidateLocales(ResourceBundle.java:2220)
 at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1229)
 at java.util.ResourceBundle.getBundle(ResourceBundle.java:715)
 at
 org.apache.hadoop.mapred.Counters$Group.getResourceBundle(Counters.java:218)
 at org.apache.hadoop.mapred.Counters$Group.init(Counters.java:202)
 at org.apache.hadoop.mapred.Counters.getGroup(Counters.java:410)
 at
 org.apache.hadoop.mapred.Counters.incrAllCounters(Counters.java:491)
 at org.apache.hadoop.mapred.Counters.sum(Counters.java:506)
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.statusUpdate(LocalJobRunner.java:222)
 at org.apache.hadoop.mapred.Task$1.run(Task.java:418)
 at java.lang.Thread.run(Thread.java:619)
 
  2009-10-11 14:31:22,197 INFO [org.apache.hadoop.mapred.LocalJobRunner]
 - reduce  reduce
  2009-10-11 14:31:25,197 INFO [org.apache.hadoop.mapred.LocalJobRunner]
 - reduce  reduce
  2009-10-11 14:31:40,002 WARN [org.apache.hadoop.mapred.LocalJobRunner]
 - job_local_0001
  java.lang.OutOfMemoryError: Java heap space
 at
 java.util.concurrent.locks.ReentrantLock.init(ReentrantLock.java:234)
 at
 java.util.concurrent.ConcurrentHashMap$Segment.init(ConcurrentHashMap.java:289)
 at
 java.util.concurrent.ConcurrentHashMap.init(ConcurrentHashMap.java:613)
 at
 java.util.concurrent.ConcurrentHashMap.init(ConcurrentHashMap.java:652)
 at
 org.apache.hadoop.io.AbstractMapWritable.init(AbstractMapWritable.java:49)
 at org.apache.hadoop.io.MapWritable.init(MapWritable.java:42)
 at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260)
 at
 org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
 at
 org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:101)
 at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
 at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
 at
 org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940)
 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880)
 at
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237)
 at
 org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233)
 at
 org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:377)
 at
 org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:113)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)


 _
 New! Faster Messenger access on the new MSN homepage
 http://go.microsoft.com/?linkid=9677406