Merging issues!
I've tried to merge small indexes to existing index. what I've observed is after merging the Index size is decreasing! I started with a Index having 85K documents...then I merge 5K docs each...after 3 cycles instead of it becoming 100K, strangly it has become an index having 80K documents only. Any idea what goes wrong? -TT -- View this message in context: http://www.nabble.com/Merging-issues%21-tp25781003p25781003.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Targeting Specific Links
Eric Osgood wrote: Andrzej, How would I check for a flag during fetch? You would check for a flag during generation - please check ScoringFilter.generatorSortValue(), that's where you can check for a flag and set the sort value to Float.MIN_VALUE - this way the link will never be selected for fetching. And you would put the flag in CrawlDatum metadata when ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks(). Maybe this explanation can shed some light: Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I add other links until X is reached. This way, I don't waste crawl time on non-relevant links. You can modify the collection of target links passed to distributeScoreToOutlinks() - this way you can affect both which links are stored and what kind of metadata each of them gets. As I said, you can also use just plain URLFilters to filter out unwanted links, but that API gives you much less control because it's a simple yes/no that considers just URL string. The advantage is that it's much easier to implement than a ScoringFilter. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
ApacheCon US
Just a friendly reminder to all about Lucene ecosystem events at ApacheCon US this year. We have two days of talks on pretty much every project under Lucene (see http://lucene.apache.org/#14+August+2009+-+Lucene+at+US+ApacheCon ) plus a meetup and a two day training on Lucene and a 1 day training on Solr. The Lucene training will cover Lucene 2.9 and I'm sure Erik's Solr one will cover Solr 1.4. I also know there will be quite a few Lucene, et. al. committers at ApacheCon this year, so it should be a good year to interact and discuss your favorite projects. ApacheCon US is in Oakland (near San Francisco) the week of November 2nd. The trainings are on the 2nd and 3rd, and the main conference starts on the 4th. You can register at http://www.us.apachecon.com/c/acus2009/ Hope to see you there, Grant
Re: mapred.ReduceTask - java.io.FileNotFoundException
Hi, I don't know exact cause of exception but it is resolved. Earlier I have done some changes in IP settings but still server was using old configuration. I restart the service and problem is resolved. Thanks. Bhavin On Tue, Oct 6, 2009 at 4:48 PM, tittutomen subasmahapa...@gmail.com wrote: bhavin pandya-3 wrote: Hi, I am trying to configure nutch and hadoop on 2 node. But while trying to fetch, i am getting this exception. (same exception i am getting sometime while injecting new seed) 2009-10-06 14:56:51,609 WARN mapred.ReduceTask - java.io.FileNotFoundException: http://127.0.0.1:50060/mapOutput? job=job_200910061454_0001map=attempt_200910061454_0001_m_00_0reduce=3 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1345) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1339) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:993) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1293) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1231) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1144) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1084) Caused by: java.io.FileNotFoundException: http://127.0.0.1:50060/mapOutput?job=job_200910061454_0001map org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTrack er/jobcache/job_200910061454_0001/attempt_200910061454_0001_m_00_0/output/f ile.out.index in any of the configured local directories at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalP athToRead(LocalDirAllocator.java:381) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAl locator.java:138) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTrac ker.java:2840) at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:42 7) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicat ionHandler.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java: 567) at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicatio nContext.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954) at org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java :244) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) And then continueous message in hadooop.log like 2009-10-06 15:56:43,918 WARN mapred.ReduceTask - attempt_200910061538_0005_r_01_0 adding host 127.0.0.1 to penalty box, next contact in 150 seconds Here is my hadoop-site.xml content: property namefs.default.name/name valuehdfs://crawler1.mydomain.com:9000//value description The name of the default file system. Either the literal string local or a host:port for NDFS. /description /property property namemapred.job.tracker/name valuecrawler1.mydomain.com:9001/value description The host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.map.tasks/name value2/value description define mapred.map tasks to be number of slave hosts /description /property property namemapred.reduce.tasks/name value2/value description define mapred.reduce tasks to be number of slave hosts /description /property property namedfs.name.dir/name value/nutch/filesystem/name/value /property property namedfs.data.dir/name value/nutch/filesystem/data/value /property property
Re: indexing just certain content
in this class the BasicIndexingFilter.java, I think before adding the contenent to the document i could parse it again to filter certain div tags ?? text = parse.getText(); // i have to parse and filter the text here before adding it to the docuement new_Filtred_text = text.myParser_New_method(text); doc.add(content, parse.getText()); what do you think about that ? _ New! Faster Messenger access on the new MSN homepage http://go.microsoft.com/?linkid=9677406