Merging issues!

2009-10-07 Thread tittutomen

I've tried to merge small indexes to existing index.

what I've observed is after merging the Index size is decreasing! I started
with a Index having 85K documents...then I merge 5K docs each...after 3
cycles instead of it becoming 100K, strangly it has become an index having
80K documents only. 

Any idea what goes wrong?

-TT
-- 
View this message in context: 
http://www.nabble.com/Merging-issues%21-tp25781003p25781003.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Targeting Specific Links

2009-10-07 Thread Andrzej Bialecki

Eric Osgood wrote:

Andrzej,

How would I check for a flag during fetch?


You would check for a flag during generation - please check 
ScoringFilter.generatorSortValue(), that's where you can check for a 
flag and set the sort value to Float.MIN_VALUE - this way the link will 
never be selected for fetching.


And you would put the flag in CrawlDatum metadata when ParseOutputFormat 
calls ScoringFilter.distributeScoreToOutlinks().




Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but 
still needing a total of X links per page, if I find the links I want, I 
add them to the list up until X, if I don' reach X, I add other links 
until X is reached. This way, I don't waste crawl time on non-relevant 
links.


You can modify the collection of target links passed to 
distributeScoreToOutlinks() - this way you can affect both which links 
are stored and what kind of metadata each of them gets.


As I said, you can also use just plain URLFilters to filter out unwanted 
links, but that API gives you much less control because it's a simple 
yes/no that considers just URL string. The advantage is that it's much 
easier to implement than a ScoringFilter.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



ApacheCon US

2009-10-07 Thread Grant Ingersoll
Just a friendly reminder to all about Lucene ecosystem events at  
ApacheCon US this year.  We have two days of talks on pretty much  
every project under Lucene (see http://lucene.apache.org/#14+August+2009+-+Lucene+at+US+ApacheCon 
) plus a meetup and a two day training on Lucene and a 1 day training  
on Solr.  The Lucene training will cover Lucene 2.9 and I'm sure  
Erik's Solr one will cover Solr 1.4.  I also know there will be quite  
a few Lucene, et. al. committers at ApacheCon this year, so it should  
be a good year to interact and discuss your favorite projects.


ApacheCon US is in Oakland (near San Francisco) the week of November  
2nd.  The trainings are on the 2nd and 3rd, and the main conference  
starts on the 4th.


You can register at http://www.us.apachecon.com/c/acus2009/

Hope to see you there,
Grant


Re: mapred.ReduceTask - java.io.FileNotFoundException

2009-10-07 Thread bhavin pandya
Hi,

I don't know exact cause of exception but it is resolved.

Earlier I have done some changes in IP settings but still server was
using old configuration.
I restart the service and problem is resolved.

Thanks.
Bhavin

On Tue, Oct 6, 2009 at 4:48 PM, tittutomen subasmahapa...@gmail.com wrote:



 bhavin pandya-3 wrote:

 Hi,

 I am trying to configure nutch and hadoop on 2 node. But while trying
 to fetch, i am getting this exception. (same exception i am getting
 sometime while injecting new seed)

 2009-10-06 14:56:51,609 WARN  mapred.ReduceTask -
 java.io.FileNotFoundException: http://127.0.0.1:50060/mapOutput?
 job=job_200910061454_0001map=attempt_200910061454_0001_m_00_0reduce=3
         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
         at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
         at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
         at
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1345)
         at java.security.AccessController.doPrivileged(Native Method)
         at
 sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1339)
         at
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:993)
         at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1293)
         at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1231)
         at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1144)
         at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1084)
 Caused by: java.io.FileNotFoundException:
 http://127.0.0.1:50060/mapOutput?job=job_200910061454_0001map


 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
 taskTrack
 er/jobcache/job_200910061454_0001/attempt_200910061454_0001_m_00_0/output/f
 ile.out.index in any of the configured local directories
         at
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalP
 athToRead(LocalDirAllocator.java:381)
         at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAl
 locator.java:138)
         at
 org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTrac
 ker.java:2840)
         at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
         at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
         at
 org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:42
 7)
         at
 org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicat
 ionHandler.java:475)
         at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:
 567)
         at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
         at
 org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicatio
 nContext.java:635)
         at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
         at org.mortbay.http.HttpServer.service(HttpServer.java:954)
         at
 org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
         at
 org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
         at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
         at
 org.mortbay.http.SocketListener.handleConnection(SocketListener.java
 :244)
         at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
         at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)


 And then continueous message in hadooop.log like
 2009-10-06 15:56:43,918 WARN  mapred.ReduceTask -
 attempt_200910061538_0005_r_01_0 adding host 127.0.0.1 to penalty
 box, next contact in 150 seconds


 Here is my hadoop-site.xml content:
 property
   namefs.default.name/name
   valuehdfs://crawler1.mydomain.com:9000//value
   description
     The name of the default file system. Either the literal string
     local or a host:port for NDFS.
   /description
 /property

 property
   namemapred.job.tracker/name
   valuecrawler1.mydomain.com:9001/value
   description
     The host and port that the MapReduce job tracker runs at. If
     local, then jobs are run in-process as a single map and
     reduce task.
   /description
 /property

 property
   namemapred.map.tasks/name
   value2/value
   description
     define mapred.map tasks to be number of slave hosts
   /description
 /property

 property
   namemapred.reduce.tasks/name
   value2/value
   description
     define mapred.reduce tasks to be number of slave hosts
   /description
 /property

 property
   namedfs.name.dir/name
   value/nutch/filesystem/name/value
 /property

 property
   namedfs.data.dir/name
   value/nutch/filesystem/data/value
 /property

 property
 

Re: indexing just certain content

2009-10-07 Thread BELLINI ADAM


 in this class the BasicIndexingFilter.java, I think before adding the 
contenent to the document i could parse it again to filter certain div tags ??

text = parse.getText();

// i have to parse and filter the text here before adding it to the docuement 

new_Filtred_text = text.myParser_New_method(text);

doc.add(content, parse.getText());

what do you think about that ?
  
_
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406