Re: Next Generation Nutch
Hi Otis, Thanks for your comments. My responses inline below: Hm, I have to say I'm not sure if I agree 100% with part 1. I think it would be great to have such flexibility, but I wonder if trying to achieve it would be over-engineering. Do people really need that? I don't know, maybe! If they do, then ignore my comment. :) Well, in the past, at least in my experience, this is exactly what has paid off for us. Having the flexibility to architect a system that isn't tied to the underlying technology. We once had a situation at JPL where a software product was using CORBA for its underlying middleware implementation framework. This (previously free) CORBA solution turned into a 30K/year licensed solution, at the direction of the vendor in a 1 week timeframe. Because we had architected and engineered our software system to be independent of the underlying middleware substrate, we were able to switch over to a free, Java-RMI based solution in the matter of a weekend. Of course, this is typically bound to certain classes of underlying substrates, and middleware solutions (e.g., it would be difficult to switch out certain middlewares with vastly different architectural styles, say, if we were trying to switch from CORBA to a P2P based solution like JXTA), but all I'm saying is that it would be great if we didn't have to dictate to a potential Nutch 2.0 user that to use our scalable, open source search engine solution, you have to change from a JMS house to a Hadoop house. It would be nice to say that we've architected Nutch 2.0 to be independent of the underlying middleware provider. Of course, we can provide a default implementation based on the existing Hadoop substrate, but we should provide interfaces, data components, and architectural guidelines to be able to change to say, a Nutch solution over XML-RPC, or Web-Services, or JMS, without breaking the core architecture. Right now, I'm convinced that can't be done, or in other words, it's too hard to tease the Hadoop notions out of Nutch as it exists today. I'm curious about 2. - could you please explain a little what you mean by too tied to the underlying orchestration process and infrastructure.? What I mean by this is that the Fetcher/Fetcher2 dictates the orchestration process for crawling: there is no separate, independent Nutch crawler. Fetcher2 itself is a MapRunnable job (e.g., a term from the Hadoop vocabulary). In my mind, the crawler process needs to be a separate subsystem in Nutch, independent of the underlying middleware substrate (kind of like I'm suggesting above). As an example: how would we take the existing Nutch Fetcher2, and run it over JMS? Or XML-RPC? Or RMI? So, I guess that's all I'm saying -- the Nutch 2.0 architecture should be clearly insulated from the underlying middleware technology. That's my main concern moving forward. Hope that helps to explain my point of view. :) If not, let me know and I would be happy to chat more about it. Thanks! Cheers, Chris Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Chris Mattmann [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Friday, April 11, 2008 9:10:30 PM Subject: Re: Next Generation Nutch Hi Dennis, Thanks for putting this together. I think that it's also important to add to this list the ability to cleanly separate out the following major components: 1. The underlying distributed computing infrastructure (e.g., why does it have to be Hadoop? Shouldn't Nutch be able to run on top of JMS, or XML-RPC, or what about even grid computing technologies, and web services? Hadoop can certainly be _the_ core implementation of the underlying substrate, but the ability to change this out should be a lot easier than it currently is. Read on below to see what I mean.) 2. The crawler. Right now I think it's much too tied to the underlying orchestration process and infrastructure. 3. The data structures. You do mention this below, but I would add to it that the data structures for Nutch should be simple POJOs and not have any tie to the underlying infrastructure (e.g., no need for Writeable methods, etc.) I think that with these types of guiding principles above, along with what you mention below, there is the potential here to generate a really flexible, reusable architecture, that, when folks come along and mention, Oh I've written Crawler XXX, how do I integrate it into Nutch, we don't have to come back and say that the entire system has to be changed; or even worse, that it cannot be done at all. My 2 cents, Chris On 4/11/08 2:59 PM, Dennis Kubes [EMAIL PROTECTED] wrote: I have been thinking about a next generation Nutch for a while now, had some talks with some of the other committers, and have gotten around to putting some thoughts / requirements down on paper. I wanted to run these by the community and get feedback. This message
Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml
Hi Bradford, I'm running Nutch 0.9 and Hadoop on 5 new, fast servers connected to a multiple T-3 line. Although it works fine, the fetch portion of the crawls seems to be awfully slow. The status message at one point is 157 pages, 1 errors, 1.7 pages/s, 487 kb/s. Less than one page a second seems to be awfully slow, given the environment I'm in. Is it a configuration issue? I'm using 200 threads per fetcher. I've also tried only 10 threads :) There are other parameters that control the speed of the fetch. What is your value for speculative execution? I remember seeing something on the list that this should parameter should be turned off to optimize fetch speed. Give that a try, and let me know how it works out. I'm also seeing my hadoop.logs rapidly filled with the error message mentioned in [NUTCH-618], which states: 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader: Invalid media type alias: text/xml org.apache.tika.mime.MimeTypeException: Media type alias already exists: text/xml Is this impacting the performance? I've tried removing conf/tika-mimetypes.xml on all my machines, but that doesn't seem to resolve the error message. Though definitely annoying I am fairly sure it's not directly affecting your performance since the message is a simple WARNING that a media type detected has been added multiple times to the time mime types registry. I certainly need to address this issue though, so thanks for giving me some motivation. Let me know what the results of the speculative execution adjustment is. Also, it may help to vocalize (here on the list) any other configuration adjustments you have (or will have) made. HTH, Chris Much thanks in advance :) Cheers, Bradford __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Tika Error ?
Hi Emmanuel, Could you please post your /data/sengine/search/conf/tika-mimetypes.xml file? Thanks, Chris On 2/14/08 6:07 AM, Emmanuel [EMAIL PROTECTED] wrote: Hi Guys, I've updated my nutch version to use the latest trunk with the new TIKA jar. I run a crawl and i've got a lot of error like that 2008-02-14 22:02:51,494 INFO conf.Configuration - found resource tika-mimetypes.xml at file:/data/sengine/search/conf/tika-mimetypes.xml 2008-02-14 22:02:51,499 WARN mime.MimeTypesReader - Invalid media type alias: text/xml org.apache.tika.mime.MimeTypeException: Media type alias already exists: text/xml at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312) at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238) at org.apache.tika.mime.MimeTypesReader.readMimeType( MimeTypesReader.java:168) at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java :138) at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java :121) at org.apache.tika.mime.MimeTypesFactory.create( MimeTypesFactory.java:56) at org.apache.nutch.util.MimeUtil.init(MimeUtil.java:58) at org.apache.nutch.protocol.Content.init(Content.java:85) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput( HttpBase.java:226) at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java :523) 2008-02-14 22:02:51,500 WARN mime.MimeTypesReader - Invalid media type alias: application/x-dosexec;exe org.apache.tika.mime.MimeTypeException: Invalid media type alias: application/x-dosexec;exe at org.apache.tika.mime.MimeType.addAlias(MimeType.java:242) at org.apache.tika.mime.MimeTypesReader.readMimeType( MimeTypesReader.java:168) at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java :138) at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java :121) at org.apache.tika.mime.MimeTypesFactory.create( MimeTypesFactory.java:56) at org.apache.nutch.util.MimeUtil.init(MimeUtil.java:58) at org.apache.nutch.protocol.Content.init(Content.java:85) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput( HttpBase.java:226) at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java :523) Is that normal ? Do i miss something ? __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: java.lang.NoClassDefFoundError: org/apache/tika/mime/MimeTypeException in cached.jsp
Hi Mubey, I think that this has been identified as a potential bug. Please file a JIRA issue: http://issues.apache.org/jira/browse/NUTCH And I (or any of the other developers) would be happy to investigate it for you. I saw some chatter on the mailing lists the other day regarding this and one of the other developers suggested that the tika jar is probably not being copied over into the Nutch WAR file. I'll check this out, but please, in the meantime file the bug report so that we have a record of it moving forward. Thanks! Cheers, Chris On 2/4/08 11:10 AM, Mubey N. [EMAIL PROTECTED] wrote: I am using the latest trunk. Whenever I search something in it and click on the cached link, I get this error from cached.jsp:- java.lang.NoClassDefFoundError: org/apache/tika/mime/MimeTypeException java.lang.Class.forName0(Native Method) java.lang.Class.forName(Class.java:247) org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524) org.apache.hadoop.io.WritableName.getClass(WritableName.java:72) org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1405) org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1360) org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1349) org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1344) org.apache.hadoop.io.MapFile$Reader.init(MapFile.java:254) org.apache.hadoop.io.MapFile$Reader.init(MapFile.java:242) org.apache.hadoop.mapred.MapFileOutputFormat.getReaders(MapFileOutputFormat.ja va:91) org.apache.nutch.searcher.FetchedSegments$Segment.getReaders(FetchedSegments.j ava:90) org.apache.nutch.searcher.FetchedSegments$Segment.getContent(FetchedSegments.j ava:68) org.apache.nutch.searcher.FetchedSegments.getContent(FetchedSegments.java:139) org.apache.nutch.searcher.NutchBean.getContent(NutchBean.java:346) org.apache.jsp.cached_jsp._jspService(cached_jsp.java:112) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) javax.servlet.http.HttpServlet.service(HttpServlet.java:803) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:393 ) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266) javax.servlet.http.HttpServlet.service(HttpServlet.java:803) Is this a known bug in Nutch-1.0 Development version or is it a mistake at my end? __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: java.lang.NoClassDefFoundError Nutch 0.9
Hi Karthik, The default ant target for Nutch is job. You can do one of the following: type 'ant clean' first, to remove your working class information type 'ant' to call the default target ('job'), or explicitly call 'ant job' That should fix your issue. Thanks! Cheers, Chris On 11/8/07 12:12 PM, karthik085 [EMAIL PROTECTED] wrote: Hi, I got nutch from svn tags - release0.9 - but can't get rid of this problem. I did ant compile ant jar ant war All of them build successfully with different versions of ant - 1.6.5 and 1.7.0 When running nutch crawl - I get Exception in thread main java.lang.NoClassDefFoundError: org/apache/nutch/crawl/Crawl Even tried some solutions as explained in the forum - changing ant versions, adding classpath(doesn't matter - nutch script overrides ) - but none of them worked. How do I get rid of this problem? Thanks, Karthik __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Indexing Feeds Blog Posts with Nutch
Hi Pike, Parse-rss indexes the whole feed, whereas the feed plugin takes advantage of NUTCH-443, which allows Parsers to return multiple Parse objects, which indexes each item in the feed as its own record. HTH, Chris On 10/15/07 7:25 AM, Pike [EMAIL PROTECTED] wrote: Hi I have this with all results: what is indexed seems to be 1 record per feed, containing a parsed version of the content including all its items, with sometimes bits of xml and html markup in it. I was assuming this is the intended behaviour ? It may well be the intended behaviour, but it's not the behaviour I want. Indeed, it is not what I expected either. Chris, can you confirm this is the idea ? Did you ever consider indexing separate items ? curious, *pike __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Indexing Feeds Blog Posts with Nutch
Hi Brian, Sorry for taking so long to reply. Here ya go: Do you have any URLs for feeds that are reliably parsed and indexed by the feed parser? I haven't tested/used this plugin in a quite a while. There was someone on the nutch-user list before, nutch.newbie, that was doing quite a bit of feed parsing. Nutch.newbie, if you're still around, could you send Brian a list of feeds that you were testing on? Does it actually index atom at present? There's something in the code that looks for application/rss+xml as the mime type. AFAIK, the plugin does in fact index atom. The plugin itself is built on top of the underlying ROME toolkit if I remember correctly. HTH, Chris Brian Ulicny On Thu, 11 Oct 2007 15:23:04 -0700, Chris Mattmann [EMAIL PROTECTED] said: Hi Rick, Glad to hear that you're interested in using Nutch! There are currently 2 plugins that parse feeds and get them indexed: parse-rss - older, but gets the job done feed - newer, and takes advantage of the ability to parse/index feeds in one step, rather than in many There are other idiosyncrasies about each of these plugins so feel free to ask specific questions to the main developers of each of them. The parse-rss plugin was primarily developed by me, and the feed plugin was primarily developed by Do#287;acan Güney, another Nutch committer like myself. As for the error that you're getting below, it's due to the fact that Nutch can't reliable differentiate between the mime type of different XML content. So, to Nutch, even though it's a .rss file, its mime type is application/xml. Because the mime type, though a true mime type of the file, is not the preferred mime type (application/rss+xml, or the like), Nutch has trouble finding the appropriate parser to parse the content. For instance, according to parse-plugins.xml (a file in your $NUTCH_HOME/conf directory), the parse-rss plugin and the feed plugin are registered to parse application/rss+xml, but not application/xml. The current trunk version of Nutch recently had a fix committed for this very issue (http://issues.apache.org/jira/browse/NUTCH-562). If you have any more specific questions, I'd be happy to answer them. Thanks! Cheers, Chris On 10/11/07 9:14 AM, Rick Moynihan [EMAIL PROTECTED] wrote: Hi all, I've recently downloaded Nutch v0.9, to experiment in searching blog posts and RSS/Atom feeds. So far I have managed to get it to successfully crawl, index and search some websites. I am now starting my investigations to use Nutch to crawl/index/search news/blog feeds. And have included the parse-rss plugin which appears to ship in the plugins/ directory by pasting the following into my nutch-site.xml file: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(rss|text|html)|index-basic|query -( basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) /v alue /property However some feeds appear to return the following error (apparently because they are being returned with a mime-type of application/xml. Error parsing: http://example.com/.rss: failed(2,200): org.apache.nutch.parse.ParseException: parser not found for contentType=application/xml url=http://example.com/.rss It also appears when searching that the returned results point to the matching feed rather than the matching item. Is there a way around this? Or am I best parsing out the item urls (e.g. via a shell script) somehow adding them to the crawlist and indexing the HTML as normal? Also, if anyone is using Nutch to index blogs/feeds, then I'd be interested in how you have it configured. Thanks again, __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Indexing Feeds Blog Posts with Nutch
Hi Rick, Glad to hear that you're interested in using Nutch! There are currently 2 plugins that parse feeds and get them indexed: parse-rss - older, but gets the job done feed - newer, and takes advantage of the ability to parse/index feeds in one step, rather than in many There are other idiosyncrasies about each of these plugins so feel free to ask specific questions to the main developers of each of them. The parse-rss plugin was primarily developed by me, and the feed plugin was primarily developed by Doğacan Güney, another Nutch committer like myself. As for the error that you're getting below, it's due to the fact that Nutch can't reliable differentiate between the mime type of different XML content. So, to Nutch, even though it's a .rss file, its mime type is application/xml. Because the mime type, though a true mime type of the file, is not the preferred mime type (application/rss+xml, or the like), Nutch has trouble finding the appropriate parser to parse the content. For instance, according to parse-plugins.xml (a file in your $NUTCH_HOME/conf directory), the parse-rss plugin and the feed plugin are registered to parse application/rss+xml, but not application/xml. The current trunk version of Nutch recently had a fix committed for this very issue (http://issues.apache.org/jira/browse/NUTCH-562). If you have any more specific questions, I'd be happy to answer them. Thanks! Cheers, Chris On 10/11/07 9:14 AM, Rick Moynihan [EMAIL PROTECTED] wrote: Hi all, I've recently downloaded Nutch v0.9, to experiment in searching blog posts and RSS/Atom feeds. So far I have managed to get it to successfully crawl, index and search some websites. I am now starting my investigations to use Nutch to crawl/index/search news/blog feeds. And have included the parse-rss plugin which appears to ship in the plugins/ directory by pasting the following into my nutch-site.xml file: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(rss|text|html)|index-basic|query-( basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/v alue /property However some feeds appear to return the following error (apparently because they are being returned with a mime-type of application/xml. Error parsing: http://example.com/.rss: failed(2,200): org.apache.nutch.parse.ParseException: parser not found for contentType=application/xml url=http://example.com/.rss It also appears when searching that the returned results point to the matching feed rather than the matching item. Is there a way around this? Or am I best parsing out the item urls (e.g. via a shell script) somehow adding them to the crawlist and indexing the HTML as normal? Also, if anyone is using Nutch to index blogs/feeds, then I'd be interested in how you have it configured. Thanks again, __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Suggested fixes to http://wiki.apache.org/nutch/WritingPluginExample-0.9
Hi Jasper, As I understand it, you can make these updates yourself. Sign up for a wiki account and then login with your username/password and you can update the page yourself. Thanks! Cheers, Chris On 7/19/07 10:10 AM, Jasper Kamperman [EMAIL PROTECTED] wrote: Hi, I spent several hours hunting down two small issues in the otherwise very well done example. To prevent others from running into this I'd like to share them here. I don't know how to get in touch with RicardoJMendez who maintains the page -- anyone does, please forward this to him. I did the exercise after checking out lucene/nutch/branches/ branch-0.9 from SVN 1. The package declaration at the top of TestRecommendedParser.java is wrong, it reads: package org.apache.nutch; but it should be:b package org.apache.nutch.parse.recommended; 2. Per ../build-plugin.xml the property for the location of the test data is not test.input but test.data so the line that initializes testDir should read: private static final File testDir = new File(System.getProperty (test.data)); Hope this helps, Jasper
Re: nutch-09 start problem
Hi Ratnesh, I'm not sure that declaring Nutch 0.9 an unstable version is an entirely appropriate label -- it's been through several stress tests by the committers so far, and it seems to be performing well enough -- so much so that we decided it was worthwhile to make a release of it :). I believe that the user's problem below had to do with not running Nutch using JDK 5 (now, a requirement). Cheers, Chris On 4/12/07 6:13 AM, Ratnesh,V2Solutions India [EMAIL PROTECTED] wrote: I thnk that nutch-0.9 is unstable version , and it's not for developement purpose but I am not sure enough. we have used nutch-0.8.1. and it's working fine without an error . and What I feel that you will get enough support from the list if you have nutch-0.8 since most of us have used with this version. But' it's encouraging working with new version , so I will appreciate if you update your work on the list if you come up with any solution so that others can take a help from this. Thnx Ratnesh,V2Solutions India Dima Mazmanov wrote: Hi, I tried to setup nutch 0.9, but when I execute my script I get following error. Exception in thread main java.lang.UnsupportedClassVersionError: org/apache/hadoop/util/PlatformName (Unsupported major.minor version 49.0) at java.lang.ClassLoader.defineClass0(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:537) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:123) at java.net.URLClassLoader.defineClass(URLClassLoader.java:251) at java.net.URLClassLoader.access$100(URLClassLoader.java:55) at java.net.URLClassLoader$1.run(URLClassLoader.java:194) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:187) at java.lang.ClassLoader.loadClass(ClassLoader.java:289) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:274) at java.lang.ClassLoader.loadClass(ClassLoader.java:235) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:302) Exception in thread main java.lang.UnsupportedClassVersionError: org/apache/nutch/indexer/IndexMerger (Unsupported major.minor version 49.0) at java.lang.ClassLoader.defineClass0(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:537) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:123) at java.net.URLClassLoader.defineClass(URLClassLoader.java:251) at java.net.URLClassLoader.access$100(URLClassLoader.java:55) at java.net.URLClassLoader$1.run(URLClassLoader.java:194) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:187) at java.lang.ClassLoader.loadClass(ClassLoader.java:289) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:274) at java.lang.ClassLoader.loadClass(ClassLoader.java:235) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:302) How can I solve it? Thanks __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Nutch 0.9 officially released!
Hi Folks, After some hard work from all folks involved, we've managed to push out Apache Nutch, release 0.9. This is the second release of Nutch based entirely on the underlying Hadoop platform. This release includes several critical bug fixes, as well as key speedups described in more detail at Sami Siren's blog: http://blog.foofactory.fi/2007/03/twice-speed-half-size.html See the list of changes made in this version: http://www.apache.org/dist/lucene/nutch/CHANGES-0.9.txt The release is available here. http://www.apache.org/dyn/closer.cgi/lucene/nutch/ Special thanks to (in no particular order): Andrzej Bialecki, Dennis Kubes, Sami Siren, and the rest of the Nutch development team for providing lots of help along the way, and for allowing me to be the release manager! Enjoy the new release! Cheers, Chris
Re: Problem crawling/fetching using https
Folks, I've went ahead and added the following comment in nutch-default.xml: In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. Hopefully this will help folks in the future with this. Thanks! Cheers, Chris On 1/24/07 3:29 PM, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Guys, Yep, I couldn't remember exactly what the issues were. Thanks for digging that up, Andrzej. So, yeah, anyways it may make sense to update nutch-site.xml with the comment below, with performance problems replaced with intermittent problems with the underlying commons-httpclient library. If you guys agree, I'll add the comment to nutch-site... Cheers, Chris On 1/24/07 3:10 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Michael Wechner wrote: ok. So what about adding a comment to nutch-site.xml, e.g. !-- NOTE: In order to use https please add protocol-httpclient, but be aware of possible performance problems! -- They were not performance problems. There were some issues related to using multiple threads, which would sometimes cause the httpclient library to fail. There was also a logging message produce in the internals of httpclient that was difficult to turn off - but now that we are using log4j this should be straightforward. There was a bug in chunked encoding handling that would cause hangs. There were also other intermittent problems with this library, so after much deliberation we decided to leave the simpler plugin as the default ... These issues may have been solved in a newer version of httpclient library. __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Problem crawling/fetching using https
Hi Michi, I am pretty sure that in order to support https, you need to enable the protocol-httpclient plugin, which is based on commons-httpclient. There isn't a protocol-https plugin as far as I know. Try that and see if that fixes your issue. Thanks! Cheers, Chris On 1/24/07 2:29 PM, Michael Wechner [EMAIL PROTECTED] wrote: Hi I try to fetch data from a website using https, whereas I have added valuenutch-extensionpoints|protocol-file|protocol-http|protocol-https to nutch-site.xml but still receive the following error fetch of https://www.foo.bar/ failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https Is there anything else one has to do? I am using Nutch 0.8.x Thanks Michi
Re: Problem crawling/fetching using https
Hi Michi, Btw, wouldn't it make sense to add protocol-httpclient as default, because I guess I am not the only one trying to fetch pages using https? Indeed. The issue with this was in fact that some time ago, the powers that be decided that it probably made sense to make protocol-httpclient the default. However, due to some performance issues with the underlying commons-httpclient Apache library (I think), it was decided to go with protocol-http, which turned out to be must faster/more reliable, etc, at the expense of not natively supporting HTTPS. I wonder what the user community thinks about this now though? What do other people think? Have the issues with protocol-httpclient gone away, such that it makes sense to enable it again? Cheers, Chris Thanks again Michi Thanks! Cheers, Chris On 1/24/07 2:29 PM, Michael Wechner [EMAIL PROTECTED] wrote: Hi I try to fetch data from a website using https, whereas I have added valuenutch-extensionpoints|protocol-file|protocol-http|protocol-https to nutch-site.xml but still receive the following error fetch of https://www.foo.bar/ failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https Is there anything else one has to do? I am using Nutch 0.8.x Thanks Michi
Re: Problem crawling/fetching using https
Hi Guys, Yep, I couldn't remember exactly what the issues were. Thanks for digging that up, Andrzej. So, yeah, anyways it may make sense to update nutch-site.xml with the comment below, with performance problems replaced with intermittent problems with the underlying commons-httpclient library. If you guys agree, I'll add the comment to nutch-site... Cheers, Chris On 1/24/07 3:10 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Michael Wechner wrote: ok. So what about adding a comment to nutch-site.xml, e.g. !-- NOTE: In order to use https please add protocol-httpclient, but be aware of possible performance problems! -- They were not performance problems. There were some issues related to using multiple threads, which would sometimes cause the httpclient library to fail. There was also a logging message produce in the internals of httpclient that was difficult to turn off - but now that we are using log4j this should be straightforward. There was a bug in chunked encoding handling that would cause hangs. There were also other intermittent problems with this library, so after much deliberation we decided to leave the simpler plugin as the default ... These issues may have been solved in a newer version of httpclient library. __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Indexing xml documents on local file system
Hi Thorsten On 11/27/06 4:00 AM, Thorsten Scherler [EMAIL PROTECTED] wrote: Reading the wiki and the docu I get the impression I need to write my own implementation of an indexer/searcher plugin, which is able to filter/index crucial filter information such as summary year=2006 number=209 date=27-10-2006 section=1, organisation name=Consejería de Economia y Hacienda and disposition type=Resolución . Yes, you may need to write your own parse, indexer and searcher plugins, however, I am currently working on getting the parse-xml plugin into the Nutch sources. The parse-xml plugin includes an indexing filter for the fields that are extracted by the xml parser. The XML parser is configurable to custom schemas and fields that need to be extracted. This plugin is available currently in JIRA, attached to this issue: http://issues.apache.org/jira/browse/NUTCH-185 I am working hard to get this plugin ported to the latest trunk source, and ready to be committed to the sources. I hope to attach a patch within the next week that brings this plugin up to date, and gets the code ready for prime-time (formatting, public javadocs, etc.). Once I attach the patch, you may find that you only need to write your searcher plugin. Then again, in the interest of time, you may go the route for writing your own set of plugins. In that case, you can find examples of how to write the parse/index/query plugins, by looking at the Nutch source, in subversion, available here: Parse plugins: http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/plugin/parse-* Index plugins: http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/plugin/index-* Query plugins: http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/plugin/query-* Still being a newbie to nutch I would appreciate the opinion of experienced devs whether nutch is the right choice and if so how I should start. I think that you could do this with Nutch, and if you do, for free, you get: Crawling Parsing/Indexing Search Webapp, and RSS based OpenSearch servlet You could also do this with Lucene, but I think you may find that you end up maintaining more code, and having to rewrite existing functionality available within Nutch. Just my 2 cents... Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: java.lang.NullPointerException
Hi there, You need to set your http.agent.name property within $NUTCH_HOME/conf/nutch-default.xml. HTH, Chris On 10/11/06 3:57 AM, Guruprasad Iyer [EMAIL PROTECTED] wrote: Hello, I have Nutch 0.8.1 installed on linux (FC3) along with java 1.5.0_07. When I run the crawl command I get the above error. Here is a snapshot of the log file- 2006-10-11 15:39:16,234 FATAL api.RobotRulesParser - Agent we advertise (null) not listed first in 'http.robots.agents' property! and it says fetcher.Fetcher - fetch of the site failed with: java.lang.NullPointerException Can anybody help? Thanks.
Re: java.lang.NullPointerException
Hi Guruprasad, The property should be set to the agent name that you would like to appear identifying your organization when your Nutch crawling agent visits websites during its crawl. You could set it to foo/bar and it would work fine, but you probably want to think of an appropriate identifying name and then set it to that. Cheers, Chris On 10/11/06 8:36 AM, Guruprasad Iyer [EMAIL PROTECTED] wrote: Hi Chris, Thanks for the reply. But, what value should I set it to? Can you help me on this? Thanks once again. Cheers, Guruprasad On 10/11/06, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, You need to set your http.agent.name property within $NUTCH_HOME/conf/nutch-default.xml. HTH, Chris On 10/11/06 3:57 AM, Guruprasad Iyer [EMAIL PROTECTED] wrote: Hello, I have Nutch 0.8.1 installed on linux (FC3) along with java 1.5.0_07. When I run the crawl command I get the above error. Here is a snapshot of the log file- 2006-10-11 15:39:16,234 FATAL api.RobotRulesParser - Agent we advertise (null) not listed first in 'http.robots.agents' property! and it says fetcher.Fetcher - fetch of the site failed with: java.lang.NullPointerException Can anybody help? Thanks.
Re: rss integration
Hi Ernesto, You need to make sure that the links inside of the RSS files that are getting indexed are not filtered out by your url filter. For instance, say you had an RSS file that had the following links: http://foo.com/news/ http://foo.bar.com/sports/ http://bar.foo.com/breaking/news/highlights Well, you would need in your url filter to add support for each of the different host names and paths that you would be indexing. So, in your example below, I'm pretty sure that your URL filter below limits you to only those 2 domains, rss.cnn.com and www.cnn.com. I think that if you chanted your filter, for example to: +^http://([a-z0-9]*\.)*cnn.com/ That might help. Ensure that the links present in the CNN RSS files fall within the *.cnn.com domain, otherwise, update your url filter accordingly. More specific comments below: On 9/10/06 11:23 PM, Ernesto De Santis [EMAIL PROTECTED] wrote: Hi Chris Thanks for your response. But I can't do that it works. All times it indexes the whole channel as one Document. I did these steps (to index a cnn channel): 1- write in my seed file, with just one seed: http://rss.cnn.com/rss/cnn_topstories.rss Good, that's the right thing to do. 2- include the parser: In the file nutch-default.xml, tag plugin.includes, I include the rss parser: valueprotocol-http|urlfilter-regex|parse-(rss|text|html|js)|index-basic|quer y-(basic|site|url)|summary-basic|scoring-opic|index-url-category/value Perfect. 3- Accept cnn hosts In the file crawl.urlfilter.txt I wrote: +^http://rss.cnn.com/ +^http://www.cnn.com/ See my comments above here. I think that you need to change these. Then I run the crawler, but always I get an index with once Document. I try some things more, without successes... (like set db.ignore.internal.links to false, change the mimetype parsers order, I did read some problem about that in a post yours) Do you know what I'm forgetting? How can I be sure if parser-rss is parsing some content? Can I get some log about that? Yup, there should be some information in the nutch.log file. Do a grep for parse-rss or RSSParser in the log file. About outlinks, I don't understand what I must do with them. I need do something with outlinks after parser-rss work? Nope. Outlinks are links coming out of a page of content. So, if there are 5 links in a web page, or an RSS document, then there are 5 so-called Outlinks in Nutch terminology. During the parsing phase, as content is parsed individually, Nutch requires a parser to append any Outlinks found in a particular piece of content and return them back to the Fetcher so that they too can be crawled. HTH, Chris Thanks a lot ... again. Ernesto. Chris Mattmann escribió: Hi Ernesto, The RSSParser in Nutch does in fact index the individual item links: they are added as Outlinks during each iteration in which the RSSParser is called. Both the channel text and the item text are indexed. Also, since each Item link is added as an Outlink to the list of returned Outlinks, Nutch is able to crawl many urls that can come out of a single RSS feed. HTH, Chris On 9/10/06 5:54 PM, Ernesto De Santis [EMAIL PROTECTED] wrote: Hi all I'm trying to integrate a rss and atom source to my nutch index. I see that nutch has a RSSParser, but it seems that index the whole source as one source, right? I want to index each item separately. Some body do it? What's the best approach. I hope about do a external process to add Document's to nutch(lucene) index using a rss fetcher like Rome. The negative point about it, is that it isn't integrated with nutch. I don't know details of nutch core to hack it, I don't know if is possible to integrate it in nutch. Thanks a lot! Ernesto. __ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas __ Preguntá. Respondé. Descubrí. Todo lo que querías saber, y lo que ni imaginabas, está en Yahoo! Respuestas (Beta). ¡Probalo ya! http://www.yahoo.com.ar/respuestas
Re: intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop
Hi there Tomi, On 8/30/06 12:25 PM, Tomi NA [EMAIL PROTECTED] wrote: I'm attempting to crawl a single samba mounted share. During testing, I'm crawling like this: ./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20 I'm using luke 0.6 to query and analyze the index. PROBLEMS 1.) search by file type doesn't work I expected that a search file type:pdf would have returned a list of files on the local filesystem, but it does not. I believe that the keyword is type, so your query should be type:pdf (without the quotes). I'm not positive about this either, but I believe you have to give the fully qualified mimeType, as in application/pdf. Not definitely sure about that though so you should experiment. Additionally, in order for the mimeTypes to be indexed properly, you need to have the index-more plugin enabled. Check your $NUTCH_HOME/conf/nutch-site.xml, and look for the property plugin.includes and make sure that the index-more plugin is enabled there. 2.) invalid nutch file type detection I see the following in the hadoop.log: --- 2006-08-30 15:12:07,766 WARN parse.ParseUtil - Unable to successfully parse content file:/mnt/bobdocs/acta.zip of type application/zip 2006-08-30 15:12:07,766 WARN fetcher.Fetcher - Error parsing: file:/mnt/bobdocs/acta.zip: failed(2,202): Content truncated at 1024000 bytes. Parser can't handle incomplete pdf file. --- acta.zip is a .zip file, not a .pdf. Don't have any idea why this happens. This may result from the contentType returned by the web server for acta.zip. Check the web server that the file is hosted on, and see what the server responds for the contentType for that file. Additionally, you may want to check if magic is enabled for mimeTypes. This allows the mimeType to be sensed through the use of hex codes compared with the beginning of each file. 3.) Why is the TextParser mapped to application/pdf and what has that have to do with indexing a .txt file? - 2006-08-30 15:12:02,593 INFO fetcher.Fetcher - fetching file:/mnt/bobdocs/popis-vg-procisceni.txt 2006-08-30 15:12:02,916 WARN parse.ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType application/pdf via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/pdf - The TextParser * was * enabled as a last resort sort of means of extracting * some * content from a PDF file, that is, if the parse-pdf plugin wasn't enabled, or it failed for some reason. Since parse-text is the 2nd option for parsing PDF files, there most likely was some sort of error in the original PDF parser. The way that the ParserFactory works now is that it iterates through a preference list of parsers (specified in $NUTCH_HOME/conf/parse-plugins.xml), and tries to parse the underlying content. The first successful parse is returned back to the Fetcher. 4.) Some .doc files can't be indexed, although I can open them via openoffice 2 with no problems - 2006-08-30 15:12:02,991 WARN parse.ParseUtil - Unable to successfully parse content file:/mnt/bobdocs/cards2005.doc of type application/msword 2006-08-30 15:12:02,991 WARN fetcher.Fetcher - Error parsing: file:/mnt/bobdocs/cards2005.doc: failed(2,0): Can't be handled as micrsosoft document. java.lang.StringIndexOutOfBoundsException: String in dex out of range: -1024 - What version of MS Word were you trying to index? I believe that the POI library used by the word parser can only handle certain versions of MS Word documents, although I'm not positive about this. As for 5 and 6 I'm not entirely sure about those problems. I wish you luck in solving both of them though, and hope what I said above helps you out. Thanks! Cheers, Chris 5.) MoreIndexingFilter doesn't seem to work The relevant part of the hadoop.log file: - 2006-08-30 15:13:40,235 WARN more.MoreIndexingFilter - file:/mnt/bobdocs/EU2007-2013.pdforg.apache.nutch.util.mime.MimeTypeException: The type can not be null or empty - This happens with other file types, as well: - 2006-08-30 15:13:54,697 WARN more.MoreIndexingFilter - file:/mnt/bobdocs/popis-vg-procisceni.txtorg.apache.nutch.util.mime.MimeTypeEx ception: The type can not be null or empty - 6.) At the moment, I'm crawling the same directory (/mnt/bobdocs), the crawl process seems to be stuck in an infinite loop and I have no way of knowing what's going on as the .log isn't flushed until the process finishes. ENVIRONMENT logs/hadoop.log inspection reveals things like this: My (relevant) crawl settings are: - namedb.max.anchor.length/name value511/value namedb.max.outlinks.per.page/name value-1/value namefetcher.server.delay/name value0/value namefetcher.threads.fetch/name value5/value namefetcher.verbose/name valuetrue/value namefile.content.limit/name
Re: RSS search by nutch
Hi there Dima, I'm not exactly sure what you mean by real time, but there is an RSS Parsing plugin in Nutch that can parse RSS feeds that Nutch encounters during its crawl. You can enable parse-rss by opening up $NUTCH_HOME/conf/nutch-site.xml, and searching for the property plugin.includes. For the value of plugin.includes, ensure that there is an entry for parse-rss somewhere in that property value. HTH, Chris On 8/28/06 10:44 AM, Dima Gritsenko [EMAIL PROTECTED] wrote: Hi, Does nutch have a class for searching incoming RSS feeds in real time? Thank you. Dima. __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: RSS search by nutch
Hi Jeremy, On 8/28/06 10:18 AM, HUYLEBROECK Jeremy RD-ILAB-SSF [EMAIL PROTECTED] wrote: The Nutch Feed/RSS plugin (parse-rss) only allows you to search the entire channel/feed text, not items individually. Actually, this isn't entirely the case. parse-rss actually indexes the item text (see line 148 in RSSParser.java) as well. Additionally, parse-rss adds the individual item links to the Outlinks (see lines 161 and 163 in RSSParser.java) , and they get crawled as well, in addition to the channel text (see line 123 in RSSParser.java) and channel outlink (see lines 130 and 132 in RSSParser.java). You'll have to develop your own if it's what you are trying to do. I also found that the feedparse library used by parse-rss doesn't read properly all formats and I myself moved to the ROME library for now. I haven't really noticed any formats not really handled by commons-feedparser. What formats have you noticed that it doesn't handle? Cheers, Chris -Original Message- From: Dima Gritsenko [mailto:[EMAIL PROTECTED] Sent: Monday, August 28, 2006 10:44 AM To: nutch-user@lucene.apache.org Subject: RSS search by nutch Hi, Does nutch have a class for searching incoming RSS feeds in real time? Thank you. Dima.
Re: Speeding up compilation without compiling plugins
Hi Michael, I believe that there is an ant task called compile-core. If you just type: # ant compile-core Rather than: # ant You should be good to go. HTH, Chris On 8/25/06 5:48 AM, Michael Wechner [EMAIL PROTECTED] wrote: Hi How can I disable the compiling of all plugins such that I can speedup overall compile when I just did changes within the core? Thanks Michi
Re: [Nutch-0.8] Missing WAR file
Hi Guys, On 8/12/06 9:27 AM, Hou Keat Lee [EMAIL PROTECTED] wrote: Hi, May be I'm missing something here. If the packaged WAR file is suppose to be used, how does nutch links back to my crawling results and indexes? Another option for this would be to use the generated nutch.xml file that appears in the build directory (e.g., $NUTCH_HOME/build) when you run the ant war command. Since NUTCH-210, this context.xml file is generated and allows you to adapt the runtime parameters (e.g., index dir) without touching the nutch.war file. Instead of placing nutch.war in /path/to/tomcat/webapps/, place nutch.xml in there (for Tomcat 4.x), and in /path/to/tomcat/conf/Catalina/localhost/ (for Tomcat 5.x). Cheers, Chris Also, after deploying the WAR file, I've encountered some permission error when trying to do a search. What are the permission required for the search? Thanks. See below the exception thrown: == *exception* org.apache.jasper.JasperException: access denied (java.util.PropertyPermission user.dir read) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:372 ) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j ava:25) java.lang.reflect.Method.invoke(Method.java:585) org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAsPrivileged(Subject.java:517) org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272) org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161) *root cause* java.security.AccessControlException: access denied (java.util.PropertyPermission user.dir read) java.security.AccessControlContext.checkPermission(AccessControlContext.java:2 64) java.security.AccessController.checkPermission(AccessController.java:427) java.lang.SecurityManager.checkPermission(SecurityManager.java:532) java.lang.SecurityManager.checkPropertyAccess(SecurityManager.java:1285) java.lang.System.getProperty(System.java:627) org.apache.hadoop.fs.LocalFileSystem.init(LocalFileSystem.java:31) org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:99) org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86) org.apache.nutch.searcher.NutchBean.init(NutchBean.java:94) org.apache.nutch.searcher.NutchBean.init(NutchBean.java:83) org.apache.nutch.searcher.NutchBean.get(NutchBean.java:70) org.apache.jsp.search_jsp._jspService(search_jsp.java:104) org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324 ) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j ava:25) java.lang.reflect.Method.invoke(Method.java:585) org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAsPrivileged(Subject.java:517) org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272) org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161) == On 8/13/06, Sami Siren [EMAIL PROTECTED] wrote: Hou Keat Lee wrote: Hi all, I'm trying out the nutch on my Ubuntu box. I've managed to follow the tutorial for Nutch v0.8 and manage to follow the steps to perform crawling. However, when the crawl completed I didn't see the expected WAR. Is there something wrong with the crawling and thus the WAR file is not created automatically? I've taken a look at the log and didn't see anything wrong. The .war file is not generated during crawling but is distributed as part of the released nutch-0.8.tar.gz package. Location of file is nutch-0.8/nutch-0.8.war -- Sami Siren
Re: Feedparser 0.6 fork source code
Hi Jeremy, I've uploaded the fork-src to my USC website. Here is the URL: http://www-scf.usc.edu/~mattmann/feedparser-src-fork.tar.gz I'll leave the file up there for a few days at least, so feel free to grab it at your leisure. Thanks, Chris On 8/8/06 4:55 PM, HUYLEBROECK Jeremy RD-ILAB-SSF [EMAIL PROTECTED] wrote: Chris (or anyone having it), could you share again the source code of the common-feedparser fork used in nutch? The zip file you shared a year ago is not on your site anymore. Thanks! Jeremy.
Re: Starting Nutch in init.d?
Guys, Sorry, I misspoke: the issue was actually: NUTCH-210, not NUTCH-245. You can view the issue at: http://issues.apache.org/jira/browse/NUTCH-210 Cheers, Chris On 7/28/06 10:29 AM, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Guys, In 0.8, it's even easier than that: Since NUTCH-245, we now have an official context.xml file that is built when the war target is executed. So, check the build directory for a nutch.xml file. Copy /path/to/nutch/build/nutch.xml to /path/to/tomcat/home/conf/Catalina/localhost/ (in Tomcat 5.x), or to /path/to/tomcat/home/webapps (in Tomcat 4.x). Then, edit the nutch.xml file to point to the location of your nutch WAR. Inside of nutch.xml, you'll be able to set the dynamic properties of the application without having to worry about the unpacked WAR file, or anything else. Hope that helps! Cheers, Chris On 7/28/06 9:50 AM, Matthew Holt [EMAIL PROTECTED] wrote: You don't need to cd to the nutch directory for the startup script. All you need to do is edit the nutch-site.xml that is found within the nutch servlet and include a searcher directory property that tells tomcat where to look for the crawl db. So if you have nutch 0.8, edit the file TOMCAT_PATH/webapps/NUTCH_DIR/WEB-INF/classes/nutch-site.xml and include the following: property namesearcher.dir/name value/your_index_folder_path/value /property I believe the your_index_folder_path is the path to your crawl directory. However, if that doesn't work, make it the path to the index folder within your crawl directory. Now, save that and make sure your script just starts tomcat on init and everything should work fine for you. Matt Bill Goffe wrote: I'd like to start Nutch automatically when I reboot. I wrote a real rough script (see below) that works on my Debian system when the system is up, but I get nothing on a reboot (and the links are set to the /etc/init.d/nutch). Any hints, ideas, or suggestions? I checked the FAQ and the archive but didn't see anything. In addition, it would be great to get messages going into /var/log to help figure out what is going on but I've had no luck doing that. Thanks, Bill ## Start and stop Nutch. Note how specific it is to ## (i) Tomcat (typically $CATALINA_HOME/bin/shutdown.sh ## or $CATALINA_HOME/bin/startup.sh) and (ii) the ## directory with the most recent fetch results. ## PATH stuff PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games PATH=$PATH:/usr/local/SUNWappserver/bin CLASSPATH=/usr/local/SUNWappserver/jdk/jre/lib JAVA_HOME=/usr/local/SUNWappserver/jdk CATALINA_HOME=/usr/local/jakarta-tomcat-5 JAVA_OPTS=-Xmx1024m -Xms512m case $1 in start) cd /home/bgoffe/nc/40 ## start in correct directory /usr/local/jakarta-tomcat-5/bin/startup.sh ;; stop) /usr/local/jakarta-tomcat-5/bin/shutdown.sh ;; force-reload|restart) /usr/local/jakarta-tomcat-5/bin/shutdown.sh cd /home/bgoffe/nc/40 /usr/local/jakarta-tomcat-5/bin/startup.sh ;; *) echo Usage: /etc/init.d/nutch {start|stop|force-reload|restart} exit 1 ;; esac exit 0 __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Blogger RSS Parsing Error
Hi Mike, The RSS parser for Nutch is based on Kevin Burton's commons-feedparser in the Jakarta Sandbox. Here is the documentation for that feedparser: http://jakarta.apache.org/commons/sandbox/feedparser/ You might want to post to the commons-feedparser email list asking him about your RSS question: he's the real RSS guru, and I bet you he could help you out. As for your guess that it's probably an unrecognized tag, I think you're probably right. Now the question is, your fetch isn't failing because of this, right? I mean, I see in the RSS parser that line 116 (the call to the parse function) is within a try/catch block, so what you are pasting below is just the output of the stack trace, right? Anyways, good luck on your problem! Cheers, Chris -- View this message in context: http://www.nabble.com/Blogger-RSS-Parsing-Error-t1462722.html#a3953532 Sent from the Nutch - User forum at Nabble.com.
Re: Same Error (Version 0.8)
Hi Mike, Well one thing that I notice off the bat is that you specify the alias tag in nutch-site.xml (or maybe this was a typo when you posted the message). If it wasn't, the alias tag should go into $NUTCH_HOME/conf/parse-plugins.xml, the same place where you mapped the mimeTypes to plugin ids. Second, I would ask that you verify that the following are true: 1. you have a plugin called microformats-hreview located in $NUTCH_HOME/src/plugin/microformats-hreview 2. the plugin microformats-hreview has a plugin.xml file 3. the implementation id attribute inside of the plugin.xml file for the microformats-hreview plugin is set to the value org.apache.nutch.microformats.hreview.HReviewParser Check on those things and let me know what you find out. We'll get to the bottom of this. Cheers, Chris -- View this message in context: http://www.nabble.com/Xml--t1050112.html#a3882468 Sent from the Nutch - User forum at Nabble.com.
Re: Same Error (Version 0.8)
Hi Mike, Another thing is: are you making sure that your plugin is being built? That is, did you add an entry in $NUTCH_HOME/src/build.xml for your plugin, underneath the the deploy tag (at least)? This will cause your plugin to be built when the rest of the plugins are built, and then copied to $NUTCH_HOME/build, which is where the plugin repository will look for the runtime for plugins. Your plugin might not be loaded because of that. Please check and let us know. Cheers, Chris On 4/12/06 8:56 AM, mikeyc [EMAIL PROTECTED] wrote: Chris / Jerome, Ok. So, now the error message is gone, but my plugin doesn't seem to be getting called (not seeing any of my messages). As listed below, I updated my plugin.xml (similar to microformats-reltag) and removed any entries in the parse-plugins.xml file. Any ideas? Again, thanks for helping me work through these issues - didn't have half as many with version 0.7. ;) -Mike -- View this message in context: http://www.nabble.com/Xml--t1050112.html#a3884328 Sent from the Nutch - User forum at Nabble.com. __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Same Error (Version 0.8)
Hi Mike, Could you post the snippet from your nutch-site.xml where you enable plugin: org.apache.nutch.xxx.xxx.xxx. Could you also be more specific and post the entire name of the plugin that it printed in your log file? This warning message basically means that there was an entry in the parse-plugins.xml file for your plugin org.apache.nutch.xxx.xxx.xxx, but it was never enabled in nutch-site.xml, (or nutch-default.xml). Thanks, Chris -- View this message in context: http://www.nabble.com/Xml--t1050112.html#a3875572 Sent from the Nutch - User forum at Nabble.com.
Re: Nutch and Hadoop Tutorial Finished
Hi Dennis, Thanks for your hard work. Where exactly on the wiki is the tutorial? I'm not seeing it. Cheers, Chris On 3/20/06 2:52 PM, Dennis Kubes [EMAIL PROTECTED] wrote: The NutchHadoop tutorial is now up on the wiki. Dennis -Original Message- From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED] Sent: Monday, March 20, 2006 12:49 PM To: nutch-user@lucene.apache.org Subject: RE: Nutch and Hadoop Tutorial Finished Sorry. Go to http://wiki.apache.org/nutch/ and click on the login link at the top of the page. You'll have to create yourself an account and then when you go back to the wiki front page, you can edit it. I went ahead and created a link on the Front page called NutchHadoopTutorial (in the Administration section). If you click on that link, you'll be prompted to create a new page. Create a blank one and paste in your tutorial. You'll probably want to play with the formatting. There are help links on the wiki that explain how to format pages. If you have any trouble, just shout. Jake. -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Monday, March 20, 2006 1:37 PM To: nutch-user@lucene.apache.org Subject: RE: Nutch and Hadoop Tutorial Finished Not to act dumb, but how do I add it to the wiki? Dennis -Original Message- From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED] Sent: Monday, March 20, 2006 12:20 PM To: nutch-user@lucene.apache.org Subject: RE: Nutch and Hadoop Tutorial Finished Dennis, How 'bout the wiki. Jake. -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Monday, March 20, 2006 1:01 PM To: nutch-user@lucene.apache.org Subject: Nutch and Hadoop Tutorial Finished All, I have finished a lengthy tutorial on how to setup a distributed implementation of nutch and hadoop. Should I post it on this list or is there a better place for it? Dennis __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: project vitality?
Hi Richard, IMHO, if you don't parse something correctly, you cannnot rely on the results. Good, we're on the same page here. We have all parsed things where you leave a comma out and the parse results are wrong. If there was a bug in nutches html parsing would that be a big deal? Yes, it would be. HTML is the foundation for the web. Its content is the most pervasive out there (as you allude to below). Howabout if it parsed the text in a particular tag out of order? I'm wondering what that has to do with anything? You may want to read up on Lucene (http://lucene.apache.org/). Lucene is the underlying text search api (and index format) that Nutch is built on top of, and I'm wondering if it cares about the order in which a piece of text is given to it? Pdf is unfortunately not html where you can parse the file sequentially and get an accurate result, Gonna have to disagree with you on this. You're making a general statement that's not true across the board. I would assert that in many cases, you can still get an accurate result. What about a PDF research paper? Do you care about what order the text comes in if you're just doing general Google like search. When I go to Google and type grid computing papers, do I care that grid computing comes before some text within the research paper? Possibly, but mainly I care that grid computing was an emphasized phrase within the text. Now, your definition of emphasized may not just be that it's the first text that appears in the paper in the title say: you may just care that the frequency of grid computing in the paper is relatively higher than a certain threshold compared to other terms. On the other hand, the fact that grid computing is in the title and comes first in the PDF may mean a lot to you. in That's the nature of trying to extract structure out of inherently unstructured content. I'm not saying that the structure or order of text within a document is never useful: I agree that in a lot of cases, it can help you to infer what values are associated with what fields you want to index, etc. All I'm saying is that it's certainly a subset of the greater functionality of just doing free text search, so you shouldn't generalize and that that you can't parse a PDF sequentially and obtain good results. but its use is second most ubiquotous. PDFBox is not a PDF parsing frmaework either. It has some pdf parsing algorithms, that aren't being used. Google does a good job parsing pdf, nutch has to do if its ogin to compete. Can you show that Google's PDF parsing capability is any better than Nutch's using accepted evaluation methods for PDF? How about some real use cases and real results? Until we could see such numbers, I'm hesitant to believe what you're saying is true. If it is though, then I'm sure that the community would welcome any updates to the PDF parsing plugin that expedite its improvement. Cheers, Chris -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Saturday, March 04, 2006 4:10 PM To: nutch-user@lucene.apache.org Subject: Re: project vitality? Hello, I've been following this conversation for the past week and decided that I'd go ahead and chime in now. I think that honestly this whole thread of discussion needs to be taken off list, because it doesn't really have anything to do with the use of Nutch: what it boils down to is a list of complaints, requests for improvements and what not. Nutch's goal is to be a large-scale, open source search engine: it's not a PDF parsing framework, nor is it as thoroughly documented as some commercial software -- although I've ran into many commercial software products that don't have the same quality of documentation that Nutch even has now in its nascent stages. Now that I have said that, I want to express my feeling that it's hard when it takes a week to figure out that invertlinks only applies to version 0.8. and when you ask to become a volunteer, you are met with no response. You don't need to ask to become a volunteer: just do it. As Doug said, create a patch, submit the patch to JIRA and let the community look at it. Change something on the Wiki if you don't think that the documentation is particularly well there. Use Nutch to do whatever you like, and if you feel that you contributed something that is applicable to a broader community outside of your domain, let people know about it. If it's really cool, I wouldn't worry about people ignoring you: they'll come around. It's also frustrating when you share some heard earned insights into something that nutch needs to work on, like pdf parsing, and your comments don't get a single good response from the nutch dev team. The nutch dev team isn't focused on PDF parsing. Nutch is a search engine framework, and to Nutch, a PDF parser is a black box that conforms to a standard parsing interface that can be swapped out as technology evolves. Right
Re: Which version of rss does parse-rss plugin support?
Hi, the contentTitle will be a concatenation of the titles of the RSS Channels that we've parsed. So the titles of the RSS Channels are what delivered for indexing, right? They're certainly part of it, but not the only part. The concatenation of the titles of the RSS Channels are what is delivered for the title portion of indexing. If I want the indexer to include more information about a rss file (such as item descriptions), can I just concatenate them to the contentTitle? They're already there. There is a variable called index text: ultimately that variable includes the item descriptions, along with the channel descriptions. That, along with the title portion of indexing is the full set of textual data delivered by the parser for indexing. So, it already includes that information. Check out lines 137, and 161 in the parser to see what I mean. Also, check out lines 204-207, which are: ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, contentTitle.toString(), outlinks, content.getMetadata()); parseData.setConf(this.conf); return new ParseImpl(indexText.toString(), parseData); You can see that the return from the Parser, i.e., the ParseImpl, includes both the indexText, along with the parse data (that contains the title text). Now, if you wanted to add any other metadata gleaned from the RSS to the title text, or the content text, you can always modify the code to do that in your own environment. The RSS Parser plugin returns a full channel model and item model that can be extended and used for those purposes. Hope that helps! Cheers, Chris 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type of the RSS that you are parsing, which you'll have little or no control over. Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: 盖世豪侠 [mailto:[EMAIL PROTECTED] Sent: Saturday, February 04, 2006 11:40 PM To: nutch-user@lucene.apache.org Subject: Re: Which version of rss does parse-rss plugin support? Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just to let you know, there are some people in the group that are working on a solution to addressing this. Hope that helps. Cheers, Chris On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps
RE: Which version of rss does parse-rss plugin support?
Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type of the RSS that you are parsing, which you'll have little or no control over. Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: 盖世豪侠 [mailto:[EMAIL PROTECTED] Sent: Saturday, February 04, 2006 11:40 PM To: nutch-user@lucene.apache.org Subject: Re: Which version of rss does parse-rss plugin support? Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just to let you know, there are some people in the group that are working on a solution to addressing this. Hope that helps. Cheers, Chris On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps. Thanks, Chris On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周 星驰岂是池中物,喜剧天 分 既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既 得千里马,又失千里马, 当 然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星 驰岂是池中物,喜剧天分既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂 是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一 展风采。无线既得千里马,又失千里马,当然后悔莫及。
Re: Which version of rss does parse-rss plugin support?
Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps. Thanks, Chris On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然 后悔莫及。
RE: indexing issue
Hi Raghavendra, Pop open your $NUTCH_HOME/conf/parse-plugins.xml file. Look for the mimeType name=* portion of the file. Now, look at the parser tag underneath it. Change that parser id to the one you want to use for your default parser, i.e., in your case, parse-msword. Hope that helps! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: Raghavendra Prabhu [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 01, 2006 8:19 AM To: nutch-user@lucene.apache.org Subject: indexing issue Hi I have got some files also How do i use some parser as the default Currently the text parser does not work fine for the file type which i have If i want to make the doc (word) parser as the default one (In a sense if no parser is found ,word should be used as the default processor and not the text parse) How do i do it ? Rgds Prabhu
RE: indexing issue
Hi Prabhu, And also in the cached page , i get frequent errors for file system Is it because of the content-type bug (which you are working on) Not sure, what errors are you getting? I fixed a bug in cached.jsp that had to do with an absolute versus relative link (see NUTCH-112). Jerome C committed that a while back. Was your problem with cached.jsp having to do with absolute versus relative links? Thanks, Chris Rgds Prabhu On 2/1/06, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Raghavendra, Pop open your $NUTCH_HOME/conf/parse-plugins.xml file. Look for the mimeType name=* portion of the file. Now, look at the parser tag underneath it. Change that parser id to the one you want to use for your default parser, i.e., in your case, parse-msword. Hope that helps! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: Raghavendra Prabhu [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 01, 2006 8:19 AM To: nutch-user@lucene.apache.org Subject: indexing issue Hi I have got some files also How do i use some parser as the default Currently the text parser does not work fine for the file type which i have If i want to make the doc (word) parser as the default one (In a sense if no parser is found ,word should be used as the default processor and not the text parse) How do i do it ? Rgds Prabhu
Re: resource pool for nutchbean
Hi Raghavendra, I think that this is a good idea. What about a commons-pool (http://jakarta.apache.org/commmons/pool/) implementation? The nutch bean pool could be built using the basic API classes from this package... Cheers, Chris On 1/5/06 1:43 PM, Raghavendra Prabhu [EMAIL PROTECTED] wrote: What i am saying is NutchBean is not instantiated in the servlet context and garbage collection. The Server has a way of allocation NutchBean to users who request from its base and give it to them . It must also free the NutchBeans either periodically or when the number of nutchbeans have reached a size Raghavendra Prabhu On 1/6/06, Raghavendra Prabhu [EMAIL PROTECTED] wrote: No i dont think so What i am suggesting is we have nutch beans instantiated and we store it . Whenever an user comes and searches ,he will be given a NutchBean . After he searches he returns it to the pool and during the same time when some one searches he would get the same bean (note new bean is not created ) Only if a bean is not available ,does a new bean get created . This makes it faster as different users share the same NutchBean and it does not create a new nutchbean Note:NutchBean is shared across different users whereas right now it is only for a single user and garabage collected Here we control the NutchBean instantiation and we have to come up with a way to free it . On 1/6/06, Byron Miller [EMAIL PROTECTED] wrote: If i'm not mistaken doesn't the opensearch servlet get around this issue? You could then post process the xml through a stylesheet/css or your favorite scripting language. -byron --- Raghavendra Prabhu [EMAIL PROTECTED] wrote: Right now Whenever an user comes and searches ,a NutchBean is created We should have a mechanism where this nutchbean is pooled .I mean is created and stored so that it can be given to the user Immediately after the user has used the Nutch Bean ,he returns it back (example at orkut ,we get a message saying doughnut not available) This will make search result faster and more efficient Only when paraller users are there will nutchbeans get created Any comments on the above issue Rgds Prabhu
Re: resource pool for nutchbean
Sounds great. Could you create an issue in JIRA (http://issues.apache.org/jira/browse/NUTCH) issue about this, and mark it as an improvement. That way we can track progress on it, and attach patches and progress. Thanks, Chris On 1/5/06 1:56 PM, Raghavendra Prabhu [EMAIL PROTECTED] wrote: Ya we shud do this . It will considerably improve performance We shud start building upon this . Rgds Raghavendra Prabhu On 1/6/06, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Raghavendra, I think that this is a good idea. What about a commons-pool (http://jakarta.apache.org/commmons/pool/) implementation? The nutch bean pool could be built using the basic API classes from this package... Cheers, Chris On 1/5/06 1:43 PM, Raghavendra Prabhu [EMAIL PROTECTED] wrote: What i am saying is NutchBean is not instantiated in the servlet context and garbage collection. The Server has a way of allocation NutchBean to users who request from its base and give it to them . It must also free the NutchBeans either periodically or when the number of nutchbeans have reached a size Raghavendra Prabhu On 1/6/06, Raghavendra Prabhu [EMAIL PROTECTED] wrote: No i dont think so What i am suggesting is we have nutch beans instantiated and we store it . Whenever an user comes and searches ,he will be given a NutchBean . After he searches he returns it to the pool and during the same time when some one searches he would get the same bean (note new bean is not created ) Only if a bean is not available ,does a new bean get created . This makes it faster as different users share the same NutchBean and it does not create a new nutchbean Note:NutchBean is shared across different users whereas right now it is only for a single user and garabage collected Here we control the NutchBean instantiation and we have to come up with a way to free it . On 1/6/06, Byron Miller [EMAIL PROTECTED] wrote: If i'm not mistaken doesn't the opensearch servlet get around this issue? You could then post process the xml through a stylesheet/css or your favorite scripting language. -byron --- Raghavendra Prabhu [EMAIL PROTECTED] wrote: Right now Whenever an user comes and searches ,a NutchBean is created We should have a mechanism where this nutchbean is pooled .I mean is created and stored so that it can be given to the user Immediately after the user has used the Nutch Bean ,he returns it back (example at orkut ,we get a message saying doughnut not available) This will make search result faster and more efficient Only when paraller users are there will nutchbeans get created Any comments on the above issue Rgds Prabhu
Re: Crawling blogs and RSS
Hi Miguel, Actually it's not out of priority, unfortunately because of the generic nature of the mime type text/xml. Turns out that a lot of RSS comes back as configured by the web server with the content type text/xml, even though it's recommended that application/rss+xml be used as the mime type for RSS. Most web server admins don't really spend the time configuring this mime type correctly in their web server. Further, if you go look at the IANA list of mime types, there really isn't a mime type specified for RSS (although RDF has applicaction/rdf+xml, which is sometimes used to identify RSS as well). So when I coded up the parse-plugins.xml file, I just noted the fact that text/xml isn't really the standard mime type for rss, it's just the mime type for any type of XML document, i.e., something that starts out with ?xml version=., which can conform to * any * XML Schema or DTD as specified, which means identifying a document as text/xml doesn't really get you anywhere unfortunately. That's what I set the parse-text plugin to be the highest priority for text/xml, as in my mind it was most suited to handle the generic nature of XML documents. I listed parse-html as 2nd in priority because XHTML is becoming more popular and a pervasive form of content. Finally, parse-rss is last, well, because, I think it should be. :-) If you think about it, parse-rss is really only meant to handle RSS feeds, which may, or may not, come back with the mime type text/xml. So, to answer your question, yes, parse-rss is last in the default parse-plugins file. However, this doesn't mean it has to be that way in your file. You are free to modify this list. Remember that order matters, in fact, the order that the plugin comes underneath a mime type specifies its order of preference to be used during parsing. You can find the full specification of this at: http://wiki.apache.org/nutch/ParserFactoryImprovementProposal/ which was authored by myself, Jerome Charron, and Sebastien LeCallonec jointly. One part of fixing this problem is correct mime type identification for document types, which I know that Jerome is working on an update to, and will soon have a new mime type registry committed to Nutch. The other part of this however, is deeper than just correct mime type identification. It has to do with understanding the appropriate DTD or XML Schema that an XML document conforms to. Only then will we understand the right parser to call for an XML document. This could be handled in a number of ways, off the top of my head, 2 ways come to mind: 1. Having a generic text/xml reading plugin than could parse out the DTD/or XML Schema used by an XML document, and then call the right sub XML parsing plugin, that knew how to handle that DTD or schema 2. Adding an attribute to the plugin.xml file that specifies the DTD or Schema that an XML Parsing Plugin supports, and then doing the resolution in a decentralized fashion whenever the mime type text/xml is encountered Anyways, I have been thinking about this for a while, and will start working on a proposal and solution in the near future. For now, if you like, you could create a JIRA issue about this as a wish or improvement to be worked on in the (near) future. FYI, here are a few interesting articles on the subject: http://spazioinwind.libero.it/pierfederici/blog/56.html http://www.rassoc.com/gregr/weblog/archive.aspx?post=662 Thanks, Chris On 10/18/05 9:36 AM, Miguel A Paraz [EMAIL PROTECTED] wrote: Hi, I'm trying to set up Nutch to crawl blogs. For nutch-site.xml, I added parse-rss to plugin.includes: valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|rss)|index-more |query-(basic|site|url)/value and set db.ignore.internal.links to false. I noticed that in parse-plugins.xml: mimeType name=text/xml plugin id=parse-text / plugin id=parse-html / plugin id=parse-rss / /mimeType is this by order of priority, and parse-rss is last? I tried injecting a single URL, my blog feed which is text/xml: http://migs.paraz.com/w/feed/ It apparently isn't parsed. Thanks in advance. __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
RE: [Nutch-general] RE: RSS Feed Parser
Hi Jeff, Okay, here is the link to commons-feedparser source that includes my modifications: http://www-scf.usc.edu/~mattmann/feedparser-0.6-fork-src.zip Thanks! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: Jeff Bowden [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 24, 2005 10:45 PM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [Nutch-general] RE: RSS Feed Parser Yes please, that would be great. I couldn't even figure out where to find the 0.6 version of feedparser, much less your patches to it. Chris Mattmann wrote: Hi Jeff, commons-feedparser-fork was a branched off version of the feedparser 0.6 base code that I made, which removed some of the specific jar files that were part of standard 0.6 feedparser distro that conflicted with the jar files included in Nutch's lib directory. Specifically, I changed it so that the core jaxen libraries that the feed parser relied on weren't dom4j, but in fact were jdom (see postings on the Nutch list around March 2005 between John X, Stefan G. and I). This required changing about 9 or 10 of the source files for the feedparser to use the jdom Node classes rather than the dom4j. If you like, I can put up a link to the feedparser forked code on my website, and post the link to the list. Thanks, Chris On 8/24/05 2:04 PM, American Jeff Bowden [EMAIL PROTECTED] wrote: Where can I obtain the source of commons-feedparser-0.6-fork.jar? It doesn't appear to be in commons svn or on the feedparser site. Chris Mattmann wrote: Hi Zaheed, Thanks for the nice comments. I've went ahead and wrote an HTML page that summarizes what I sent to Zaheed with respect to installing the parse- rss plugin. You can find the small guide here: http://www-scf.usc.edu/~mattmann/parse-rss-install.html Thanks, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: Zaheed Haque [mailto:[EMAIL PROTECTED] Sent: Thursday, August 11, 2005 11:49 AM To: nutch-user@lucene.apache.org Subject: RSS Feed Parser Hello: I am realy hoping that Chris Mattmann RSS parser will make it to the release 0.7. http://issues.apache.org/jira/browse/NUTCH-30 I got it working from last nights SVN. I believe newbie users like me would benefit very much having it as a part of the distribution. +1 for this plugin! Thanks Chris for solving my problem!! -- Best Regards Zaheed Haque --- SF.Net email is Sponsored by the Better Software Conference EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile Plan-Driven Development * Managing Projects Teams * Testing QA Security * Process Improvement Measurement * http://www.sqe.com/bsce5sf ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. --- SF.Net email is Sponsored by the Better Software Conference EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile Plan-Driven Development * Managing Projects Teams * Testing QA Security * Process Improvement
RE: [Nutch-general] RE: RSS Feed Parser
Hi Jeff, Yup, that's correct. Thanks, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: American Jeff Bowden [mailto:[EMAIL PROTECTED] Sent: Thursday, August 25, 2005 12:37 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: [Nutch-general] RE: RSS Feed Parser I notice that build.xml still creates commons-feedparser-0.5.0-RC1.jar but I'll assume you're just renaming it manually to -0.6-fork. Thanks. Chris Mattmann wrote: Hi Jeff, Okay, here is the link to commons-feedparser source that includes my modifications: http://www-scf.usc.edu/~mattmann/feedparser-0.6-fork-src.zip Thanks! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: Jeff Bowden [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 24, 2005 10:45 PM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [Nutch-general] RE: RSS Feed Parser Yes please, that would be great. I couldn't even figure out where to find the 0.6 version of feedparser, much less your patches to it. Chris Mattmann wrote: Hi Jeff, commons-feedparser-fork was a branched off version of the feedparser 0.6 base code that I made, which removed some of the specific jar files that were part of standard 0.6 feedparser distro that conflicted with the jar files included in Nutch's lib directory. Specifically, I changed it so that the core jaxen libraries that the feed parser relied on weren't dom4j, but in fact were jdom (see postings on the Nutch list around March 2005 between John X, Stefan G. and I). This required changing about 9 or 10 of the source files for the feedparser to use the jdom Node classes rather than the dom4j. If you like, I can put up a link to the feedparser forked code on my website, and post the link to the list. Thanks, Chris On 8/24/05 2:04 PM, American Jeff Bowden [EMAIL PROTECTED] wrote: Where can I obtain the source of commons-feedparser-0.6-fork.jar? It doesn't appear to be in commons svn or on the feedparser site. Chris Mattmann wrote: Hi Zaheed, Thanks for the nice comments. I've went ahead and wrote an HTML page that summarizes what I sent to Zaheed with respect to installing the parse- rss plugin. You can find the small guide here: http://www-scf.usc.edu/~mattmann/parse-rss-install.html Thanks, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: Zaheed Haque [mailto:[EMAIL PROTECTED] Sent: Thursday, August 11, 2005 11:49 AM To: nutch-user@lucene.apache.org Subject: RSS Feed Parser Hello: I am realy hoping that Chris Mattmann RSS parser will make it to the release 0.7. http://issues.apache.org/jira/browse/NUTCH-30 I got it working from last nights SVN. I believe newbie users like me would benefit very much having it as a part of the distribution. +1 for this plugin! Thanks Chris for solving my problem!! -- Best Regards Zaheed Haque --- SF.Net email is Sponsored by the Better Software Conference EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile Plan-Driven Development * Managing Projects Teams * Testing QA
Re: Chris Mattmann's RSS plugin? NUTCH-30
Hi Andrzej, At the time that I was working diligently on this plugin (April/May), I had done some thorough research into finding what I felt would be the most flexible, reliable way to parse RSS files. The RSS feed parser out of the jakarta-commmons sandbox was what I found, and I stand by it. I understand your concerns however about its reliance on several libraries, but it just comes with the territory in this case. However, as noted in: http://issues.apache.org/jira/browse/NUTCH-30 by Kevin Burton, when feedparser 2.0 comes out, the reliance on the external libraries will be removed, so I think that by adopting the feedparser based plugin right now, we have a clear upgrade path that leads us to the plugin's independence of external libraries, without changing (much of) the underlying source code. That's my two cents. Thanks! Cheers, Chris Mattmann On 7/20/05 11:58 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Hi, Does anyone know why Chris Mattmann's RSS plugin ( http://issues.apache.org/jira/browse/NUTCH-30 ) wasn't put in the repository, and whether there are plans to revive it and include it? That's probably my fault. I was almost ready to import it, but then during the final review I hesitated - I'm wary of pulling in so many dependencies. Then other things got in the way, and I sort of dropped it for the moment... If there's no way to parse RSS reliably other than using these dozens of libraries, so be it. Is this the case? __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
RE: benchmarking
Hi there Jay, Here are some numbers that a colleague and I presented in my graduate computer science seminar class on search engines in the Spring 05' semester at USC. The numbers measure the efficiency and scalability of several of the plugin content extractors for Nutch (PDF, WORD, RSS, etc.). The tests were performed on a RedHat Linux 7.3 Box, with 1.3 GB RAM, and a 10 GB HD, and a Pentium III 500 Mhz processor. The presentation is geared towards the parse-rss plugin that I wrote, although they should give you an idea of the other content extractors too. Hope they help, here's the link to the presentation: http://baron.pagemewhen.com:8080/~chris/RSS-Nutch-Eval.ppt Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion Laboratory Pasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: webmaster [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 20, 2005 8:02 PM To: nutch-user@lucene.apache.org Subject: benchmarking hey could some of you post your speeds (sorting,indexing, pages a sec/documents a sec) and system specs I'm trying to compile a database of which of nutches functions are better suited to run on what hardware. also if any of you have a sun box could you post its specs and some of the info for database sorting speeds and indexing speed, anything that uses full cpu. whats everyones pages a sec top score??? e-mail me @ [EMAIL PROTECTED] I'll post a webpage with the results Thanks, -Jay Pound
RE: benchmarking
Hi Jay, One quick note on the previous presentation link that I sent out. It mentions in the presentation that Nutch does not have a syndication feed capability. At the time of the presentation (April 2005), Nutch was in the early stages of having this capability through the opensearch API. As I understand it, Nutch has this capability now? So, if it does, just wanted to qualify the bullet in the presentation. Take care, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion Laboratory Pasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: webmaster [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 20, 2005 8:03 PM To: nutch-user@lucene.apache.org Subject: benchmarking hey could some of you post your speeds (sorting,indexing, pages a sec/documents a sec) and system specs I'm trying to compile a database of which of nutches functions are better suited to run on what hardware. also if any of you have a sun box could you post its specs and some of the info for database sorting speeds and indexing speed, anything that uses full cpu. whats everyones pages a sec top score??? e-mail me @ [EMAIL PROTECTED] I'll post a webpage with the results Thanks, -Jay Pound