Re: Next Generation Nutch

2008-04-11 Thread Chris Mattmann
Hi Otis,

Thanks for your comments. My responses inline below:

 
 Hm, I have to say I'm not sure if I agree 100% with part 1.  I think it would
 be great to have such flexibility, but I wonder if trying to achieve it would
 be over-engineering.  Do people really need that?  I don't know, maybe!  If
 they do, then ignore my comment. :)

Well, in the past, at least in my experience, this is exactly what has paid
off for us. Having the flexibility to architect a system that isn't tied to
the underlying technology. We once had a situation at JPL where a software
product was using CORBA for its underlying middleware implementation
framework. This (previously free) CORBA solution turned into a 30K/year
licensed solution, at the direction of the vendor in a 1 week timeframe.
Because we had architected and engineered our software system to be
independent of the underlying middleware substrate, we were able to switch
over to a free, Java-RMI based solution in the matter of a weekend.

Of course, this is typically bound to certain classes of underlying
substrates, and middleware solutions (e.g., it would be difficult to switch
out certain middlewares with vastly different architectural styles, say, if
we were trying to switch from CORBA to a P2P based solution like JXTA), but
all I'm saying is that it would be great if we didn't have to dictate to a
potential Nutch 2.0 user that to use our scalable, open source search engine
solution, you have to change from a JMS house to a Hadoop house. It would be
nice to say that we've architected Nutch 2.0 to be independent of the
underlying middleware provider. Of course, we can provide a default
implementation based on the existing Hadoop substrate, but we should provide
interfaces, data components, and architectural guidelines to be able to
change to say, a Nutch solution over XML-RPC, or Web-Services, or JMS,
without breaking the core architecture. Right now, I'm convinced that can't
be done, or in other words, it's too hard to tease the Hadoop notions out of
Nutch as it exists today.

 
 I'm curious about 2. - could you please explain a little what you mean by too
 tied to the underlying
 orchestration process and infrastructure.?

What I mean by this is that the Fetcher/Fetcher2 dictates the orchestration
process for crawling: there is no separate, independent Nutch crawler.
Fetcher2 itself is a MapRunnable job (e.g., a term from the Hadoop
vocabulary). In my mind, the crawler process needs to be a separate
subsystem in Nutch, independent of the underlying middleware substrate (kind
of like I'm suggesting above). As an example: how would we take the existing
Nutch Fetcher2, and run it over JMS? Or XML-RPC? Or RMI?

So, I guess that's all I'm saying -- the Nutch 2.0 architecture should be
clearly insulated from the underlying middleware technology. That's my main
concern moving forward.

Hope that helps to explain my point of view. :) If not, let me know and I
would be happy to chat more about it. Thanks!

Cheers,
 Chris


 
 Thanks,
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 - Original Message 
 From: Chris Mattmann [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Friday, April 11, 2008 9:10:30 PM
 Subject: Re: Next Generation Nutch
 
 Hi Dennis,
 
 Thanks for putting this together. I think that it's also important to add to
 this list the ability to cleanly separate out the following major
 components:
 
 1. The underlying distributed computing infrastructure (e.g., why does it
 have to be Hadoop? Shouldn't Nutch be able to run on top of JMS, or XML-RPC,
 or what about even grid computing technologies, and web services? Hadoop can
 certainly be _the_ core implementation of the underlying substrate, but the
 ability to change this out should be a lot easier than it currently is. Read
 on below to see what I mean.)
 
 2. The crawler. Right now I think it's much too tied to the underlying
 orchestration process and infrastructure.
 
 3. The data structures. You do mention this below, but I would add to it
 that the data structures for Nutch should be simple POJOs and not have any
 tie to the underlying infrastructure (e.g., no need for Writeable methods,
 etc.)
 
 I think that with these types of guiding principles above, along with what
 you mention below, there is the potential here to generate a really
 flexible, reusable architecture, that, when folks come along and mention,
 Oh I've written Crawler XXX, how do I integrate it into Nutch, we don't
 have to come back and say that the entire system has to be changed; or even
 worse, that it cannot be done at all.
 
 My 2 cents,
  Chris
  
 
 
 On 4/11/08 2:59 PM, Dennis Kubes [EMAIL PROTECTED] wrote:
 
 I have been thinking about a next generation Nutch for a while now, had
 some talks with some of the other committers, and have gotten around to
 putting some thoughts / requirements down on paper.  I wanted to run
 these by the community and get feedback.  This message

Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

2008-04-04 Thread Chris Mattmann
Hi Bradford,

 I'm running Nutch 0.9 and Hadoop on 5 new, fast servers connected to a
 multiple T-3 line. Although it works fine, the fetch portion of the
 crawls seems to be awfully slow. The status message at one point is
 157 pages, 1 errors, 1.7 pages/s, 487 kb/s. Less than one page a
 second seems to be awfully slow, given the environment I'm in. Is it a
 configuration issue? I'm using 200 threads per fetcher. I've also
 tried only 10 threads :)

There are other parameters that control the speed of the fetch. What is your
value for speculative execution? I remember seeing something on the list
that this should parameter should be turned off to optimize fetch speed.
Give that a try, and let me know how it works out.

 I'm also seeing my hadoop.logs rapidly filled with the error message
 mentioned in [NUTCH-618], which states:
 
 2008-03-06 08:07:20,659 WARN org.apache.tika.mime.MimeTypesReader:
 Invalid media type alias: text/xml
 org.apache.tika.mime.MimeTypeException: Media type alias already
 exists: text/xml
 
 Is this impacting the performance? I've tried removing
 conf/tika-mimetypes.xml on all my machines, but that doesn't seem to
 resolve the error message.

Though definitely annoying I am fairly sure it's not directly affecting your
performance since the message is a simple WARNING that a media type detected
has been added multiple times to the time mime types registry. I certainly
need to address this issue though, so thanks for giving me some motivation.

Let me know what the results of the speculative execution adjustment is.
Also, it may help to vocalize (here on the list) any other configuration
adjustments you have (or will have) made.

HTH,
 Chris

 
 Much thanks in advance :)
 
 Cheers,
 Bradford

__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: Tika Error ?

2008-02-14 Thread Chris Mattmann
Hi Emmanuel,

Could you please post your /data/sengine/search/conf/tika-mimetypes.xml
file?

Thanks, 
 Chris



On 2/14/08 6:07 AM, Emmanuel [EMAIL PROTECTED] wrote:

 Hi Guys,
 
 I've updated my nutch version to use the latest trunk with the new TIKA jar.
 
 I run a crawl and i've got a lot of error like that
 2008-02-14 22:02:51,494 INFO  conf.Configuration - found resource
 tika-mimetypes.xml at file:/data/sengine/search/conf/tika-mimetypes.xml
 2008-02-14 22:02:51,499 WARN  mime.MimeTypesReader - Invalid media type
 alias: text/xml
 org.apache.tika.mime.MimeTypeException: Media type alias already exists:
 text/xml
 at org.apache.tika.mime.MimeTypes.addAlias(MimeTypes.java:312)
 at org.apache.tika.mime.MimeType.addAlias(MimeType.java:238)
 at org.apache.tika.mime.MimeTypesReader.readMimeType(
 MimeTypesReader.java:168)
 at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
 :138)
 at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
 :121)
 at org.apache.tika.mime.MimeTypesFactory.create(
 MimeTypesFactory.java:56)
 at org.apache.nutch.util.MimeUtil.init(MimeUtil.java:58)
 at org.apache.nutch.protocol.Content.init(Content.java:85)
 at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
 HttpBase.java:226)
 at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java
 :523)
 2008-02-14 22:02:51,500 WARN  mime.MimeTypesReader - Invalid media type
 alias: application/x-dosexec;exe
 org.apache.tika.mime.MimeTypeException: Invalid media type alias:
 application/x-dosexec;exe
 at org.apache.tika.mime.MimeType.addAlias(MimeType.java:242)
 at org.apache.tika.mime.MimeTypesReader.readMimeType(
 MimeTypesReader.java:168)
 at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
 :138)
 at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java
 :121)
 at org.apache.tika.mime.MimeTypesFactory.create(
 MimeTypesFactory.java:56)
 at org.apache.nutch.util.MimeUtil.init(MimeUtil.java:58)
 at org.apache.nutch.protocol.Content.init(Content.java:85)
 at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(
 HttpBase.java:226)
 at org.apache.nutch.fetcher.Fetcher2$FetcherThread.run(Fetcher2.java
 :523)
 
 Is that normal ?
 Do i miss something ?

__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: java.lang.NoClassDefFoundError: org/apache/tika/mime/MimeTypeException in cached.jsp

2008-02-04 Thread Chris Mattmann
Hi Mubey,

I think that this has been identified as a potential bug. Please file a JIRA
issue:

http://issues.apache.org/jira/browse/NUTCH

And I (or any of the other developers) would be happy to investigate it for
you. I saw some chatter on the mailing lists the other day regarding this
and one of the other developers suggested that the tika jar is probably not
being copied over into the Nutch WAR file.

I'll check this out, but please, in the meantime file the bug report so that
we have a record of it moving forward.

Thanks!

Cheers,
 Chris



On 2/4/08 11:10 AM, Mubey N. [EMAIL PROTECTED] wrote:

 I am using the latest trunk. Whenever I search something in it and
 click on the cached link, I get this error from cached.jsp:-
 
 java.lang.NoClassDefFoundError: org/apache/tika/mime/MimeTypeException
 java.lang.Class.forName0(Native Method)
 java.lang.Class.forName(Class.java:247)
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:524)
 org.apache.hadoop.io.WritableName.getClass(WritableName.java:72)
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1405)
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1360)
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1349)
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1344)
 org.apache.hadoop.io.MapFile$Reader.init(MapFile.java:254)
 org.apache.hadoop.io.MapFile$Reader.init(MapFile.java:242)
 org.apache.hadoop.mapred.MapFileOutputFormat.getReaders(MapFileOutputFormat.ja
 va:91)
 org.apache.nutch.searcher.FetchedSegments$Segment.getReaders(FetchedSegments.j
 ava:90)
 org.apache.nutch.searcher.FetchedSegments$Segment.getContent(FetchedSegments.j
 ava:68)
 org.apache.nutch.searcher.FetchedSegments.getContent(FetchedSegments.java:139)
 org.apache.nutch.searcher.NutchBean.getContent(NutchBean.java:346)
 org.apache.jsp.cached_jsp._jspService(cached_jsp.java:112)
 org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:393
)
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320)
 org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
 
 Is this a known bug in Nutch-1.0 Development version or is it a
 mistake at my end?

__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: java.lang.NoClassDefFoundError Nutch 0.9

2007-11-08 Thread Chris Mattmann
Hi Karthik,

 The default ant target for Nutch is job.

 You can do one of the following:

 type 'ant clean' first, to remove your working class information
 type 'ant' to call the default target ('job'), or explicitly call 'ant job'

 That should fix your issue.

Thanks!

Cheers,
  Chris



On 11/8/07 12:12 PM, karthik085 [EMAIL PROTECTED] wrote:

 
 Hi, 
 
 I got nutch from svn tags - release0.9 - but can't get rid of this problem.
 I did
 ant compile
 ant jar
 ant war
 All of them build successfully with different versions of ant - 1.6.5 and
 1.7.0
 
 When running nutch crawl - I get
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/nutch/crawl/Crawl
 
 Even tried some solutions as explained in the forum - changing ant versions,
 adding classpath(doesn't matter - nutch script overrides ) - but none of
 them worked.
 
 How do I get rid of this problem?
 
 Thanks,
 Karthik

__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: Indexing Feeds Blog Posts with Nutch

2007-10-15 Thread Chris Mattmann
Hi Pike,

 Parse-rss indexes the whole feed, whereas the feed plugin takes advantage
of NUTCH-443, which allows Parsers to return multiple Parse objects, which
indexes each item in the feed as its own record.

HTH,
  Chris



On 10/15/07 7:25 AM, Pike [EMAIL PROTECTED] wrote:

 Hi
 
 I have this with all results: what is indexed
 seems to be 1 record per feed, containing a
 parsed version of the content including all its items,
 with sometimes bits of xml and html markup in it.
 
 I was assuming this is the intended behaviour ?
 
 It may well be the intended behaviour, but it's not the behaviour I
 want.  
 
 Indeed, it is not what I expected either. Chris,
 can you confirm this is the idea ? Did you ever
 consider indexing separate items ?
 
 curious,
 *pike

__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: Indexing Feeds Blog Posts with Nutch

2007-10-15 Thread Chris Mattmann
Hi Brian,

 Sorry for taking so long to reply. Here ya go:

 Do you have any URLs for feeds that are reliably parsed and indexed by
 the feed parser? 

 I haven't tested/used this plugin in a quite a while. There was someone on
the nutch-user list before, nutch.newbie, that was doing quite a bit of feed
parsing. Nutch.newbie, if you're still around, could you send Brian a list
of feeds that you were testing on?

 Does it actually index atom at present?  There's
 something in the code that looks for application/rss+xml as the mime
 type.

 AFAIK, the plugin does in fact index atom. The plugin itself is built on
top of the underlying ROME toolkit if I remember correctly.

HTH,
  Chris

 
 Brian Ulicny
 
 
 On Thu, 11 Oct 2007 15:23:04 -0700, Chris Mattmann
 [EMAIL PROTECTED] said:
 Hi Rick,
 
  Glad to hear that you're interested in using Nutch!
 
  There are currently 2 plugins that parse feeds and get them indexed:
 
  parse-rss - older, but gets the job done
  feed - newer, and takes advantage of the ability to parse/index feeds in
 one step, rather than in many
 
  There are other idiosyncrasies about each of these plugins so feel free
  to
 ask specific questions to the main developers of each of them. The
 parse-rss
 plugin was primarily developed by me, and the feed plugin was primarily
 developed by Do#287;acan Güney, another Nutch committer like myself.
 
  As for the error that you're getting below, it's due to the fact that
  Nutch
 can't reliable differentiate between the mime type of different XML
 content.
 So, to Nutch, even though it's a .rss file, its mime type is
 application/xml. Because the mime type, though a true mime type of the
 file,
 is not the preferred mime type (application/rss+xml, or the like), Nutch
 has
 trouble finding the appropriate parser to parse the content. For
 instance,
 according to parse-plugins.xml (a file in your $NUTCH_HOME/conf
 directory),
 the parse-rss plugin and the feed plugin are registered to parse
 application/rss+xml, but not application/xml.
 
 The current trunk version of Nutch recently had a fix committed for this
 very issue (http://issues.apache.org/jira/browse/NUTCH-562).
 
  If you have any more specific questions, I'd be happy to answer them.
 
 Thanks!
 
 Cheers,
   Chris
 
 
 
 On 10/11/07 9:14 AM, Rick Moynihan [EMAIL PROTECTED] wrote:
 
 Hi all,
 
 I've recently downloaded Nutch v0.9, to experiment in searching blog
 posts and RSS/Atom feeds.  So far I have managed to get it to
 successfully crawl, index and search some websites.
 
 I am now starting my investigations to use Nutch to crawl/index/search
 news/blog feeds.  And have included the parse-rss plugin which appears
 to ship in the plugins/ directory by pasting the following into my
 nutch-site.xml file:
 
 property
nameplugin.includes/name
 valueprotocol-http|urlfilter-regex|parse-(rss|text|html)|index-basic|query
 -(
 basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
 /v
 alue
 /property
 
 However some feeds appear to return the following error (apparently
 because they are being returned with a mime-type of application/xml.
 
 Error parsing: http://example.com/.rss: failed(2,200):
 org.apache.nutch.parse.ParseException: parser not found for
 contentType=application/xml url=http://example.com/.rss
 
 It also appears when searching that the returned results point to the
 matching feed rather than the matching item.  Is there a way around
 this?  Or am I best parsing out the item urls (e.g. via a shell script)
 somehow adding them to the crawlist and indexing the HTML as normal?
 
 Also, if anyone is using Nutch to index blogs/feeds, then I'd be
 interested in how you have it configured.
 
 Thanks again,
 
 __
 Chris Mattmann, Ph.D.
 [EMAIL PROTECTED]
 Cognizant Development Engineer
 Early Detection Research Network Project
 
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266B Mailstop:  171-246
 ___
 
 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.
 
 

__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: Indexing Feeds Blog Posts with Nutch

2007-10-11 Thread Chris Mattmann
Hi Rick,

 Glad to hear that you're interested in using Nutch!

 There are currently 2 plugins that parse feeds and get them indexed:

 parse-rss - older, but gets the job done
 feed - newer, and takes advantage of the ability to parse/index feeds in
one step, rather than in many

 There are other idiosyncrasies about each of these plugins so feel free to
ask specific questions to the main developers of each of them. The parse-rss
plugin was primarily developed by me, and the feed plugin was primarily
developed by Doğacan Güney, another Nutch committer like myself.

 As for the error that you're getting below, it's due to the fact that Nutch
can't reliable differentiate between the mime type of different XML content.
So, to Nutch, even though it's a .rss file, its mime type is
application/xml. Because the mime type, though a true mime type of the file,
is not the preferred mime type (application/rss+xml, or the like), Nutch has
trouble finding the appropriate parser to parse the content. For instance,
according to parse-plugins.xml (a file in your $NUTCH_HOME/conf directory),
the parse-rss plugin and the feed plugin are registered to parse
application/rss+xml, but not application/xml.

The current trunk version of Nutch recently had a fix committed for this
very issue (http://issues.apache.org/jira/browse/NUTCH-562).

 If you have any more specific questions, I'd be happy to answer them.

Thanks!

Cheers,
  Chris



On 10/11/07 9:14 AM, Rick Moynihan [EMAIL PROTECTED] wrote:

 Hi all,
 
 I've recently downloaded Nutch v0.9, to experiment in searching blog
 posts and RSS/Atom feeds.  So far I have managed to get it to
 successfully crawl, index and search some websites.
 
 I am now starting my investigations to use Nutch to crawl/index/search
 news/blog feeds.  And have included the parse-rss plugin which appears
 to ship in the plugins/ directory by pasting the following into my
 nutch-site.xml file:
 
 property
nameplugin.includes/name
 valueprotocol-http|urlfilter-regex|parse-(rss|text|html)|index-basic|query-(
 basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/v
 alue
 /property
 
 However some feeds appear to return the following error (apparently
 because they are being returned with a mime-type of application/xml.
 
 Error parsing: http://example.com/.rss: failed(2,200):
 org.apache.nutch.parse.ParseException: parser not found for
 contentType=application/xml url=http://example.com/.rss
 
 It also appears when searching that the returned results point to the
 matching feed rather than the matching item.  Is there a way around
 this?  Or am I best parsing out the item urls (e.g. via a shell script)
 somehow adding them to the crawlist and indexing the HTML as normal?
 
 Also, if anyone is using Nutch to index blogs/feeds, then I'd be
 interested in how you have it configured.
 
 Thanks again,

__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: Suggested fixes to http://wiki.apache.org/nutch/WritingPluginExample-0.9

2007-07-19 Thread Chris Mattmann
Hi Jasper,

 As I understand it, you can make these updates yourself. Sign up for a wiki
account and then login with your username/password and you can update the
page yourself.

 Thanks!

Cheers,
 Chris



On 7/19/07 10:10 AM, Jasper Kamperman [EMAIL PROTECTED]
wrote:

 Hi,
 
 I spent several hours hunting down two small issues in the otherwise
 very well done example. To prevent others from running into this I'd
 like to share them here. I don't know how to get in touch with
 RicardoJMendez who maintains the page -- anyone does, please forward
 this to him.
 
 I did the exercise after checking out lucene/nutch/branches/
 branch-0.9 from SVN
 
 1.  The package declaration at the top of TestRecommendedParser.java
 is wrong, it reads:
 
 package org.apache.nutch;
 
 but it should be:b
 
 package org.apache.nutch.parse.recommended;
 
 2.  Per ../build-plugin.xml the property for the location of the test
 data is not test.input but test.data so the line that initializes
 testDir
 should read:
 
 private static final File testDir = new File(System.getProperty
 (test.data));
 
 Hope this helps,
 
 Jasper




Re: nutch-09 start problem

2007-04-12 Thread Chris Mattmann
Hi Ratnesh,

 I'm not sure that declaring Nutch 0.9 an unstable version is an entirely
appropriate label -- it's been through several stress tests by the
committers so far, and it seems to be performing well enough -- so much so
that we decided it was worthwhile to make a release of it :). I believe that
the user's problem below had to do with not running Nutch using JDK 5 (now,
a requirement).

Cheers,
  Chris



On 4/12/07 6:13 AM, Ratnesh,V2Solutions India
[EMAIL PROTECTED] wrote:

 
 I thnk that nutch-0.9 is unstable version , and it's not for developement
 purpose but I am not sure enough.
 we have used nutch-0.8.1. and it's working fine without an error .
 and What I feel that you will get enough support from the list if you have
 nutch-0.8 since most of us have used with this version.
 
 But' it's encouraging working with new version , so I will appreciate if you
 update your work on the list if you come up with any solution so that others
 can take a help from this.
 
 Thnx
 Ratnesh,V2Solutions India
 
 
 Dima Mazmanov wrote:
 
 Hi,
 I tried to setup nutch 0.9, but when I execute my script I get following
 error.
 
 
 Exception in thread main java.lang.UnsupportedClassVersionError:
 org/apache/hadoop/util/PlatformName (Unsupported major.minor version 49.0)
 at java.lang.ClassLoader.defineClass0(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:537)
 at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:123)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:251)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:55)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:194)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:187)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:289)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:274)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:235)
 at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:302)
 Exception in thread main java.lang.UnsupportedClassVersionError:
 org/apache/nutch/indexer/IndexMerger (Unsupported major.minor version
 49.0)
 at java.lang.ClassLoader.defineClass0(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:537)
 at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:123)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:251)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:55)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:194)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:187)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:289)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:274)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:235)
 at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:302)
 
 How can I solve it?
 Thanks
 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Nutch 0.9 officially released!

2007-04-05 Thread Chris Mattmann
Hi Folks,

 After some hard work from all folks involved, we've managed to push out
Apache Nutch, release 0.9. This is the second release of Nutch based
entirely on the underlying Hadoop platform. This release includes several
critical bug fixes, as well as key speedups described in more detail at Sami
Siren's blog:

 http://blog.foofactory.fi/2007/03/twice-speed-half-size.html

 See the list of changes made in this version:

 http://www.apache.org/dist/lucene/nutch/CHANGES-0.9.txt

The release is available here.

 http://www.apache.org/dyn/closer.cgi/lucene/nutch/

 Special thanks to (in no particular order): Andrzej Bialecki, Dennis Kubes,
Sami Siren, and the rest of the Nutch development team for providing lots of
help along the way, and for allowing me to be the release manager! Enjoy the
new release!

Cheers,
  Chris




Re: Problem crawling/fetching using https

2007-01-25 Thread Chris Mattmann
Folks,

 I've went ahead and added the following comment in nutch-default.xml:

 In order to use HTTPS please enable protocol-httpclient, but be aware of
possible intermittent problems with the underlying commons-httpclient
library.

 Hopefully this will help folks in the future with this.

Thanks!

Cheers,
  Chris



On 1/24/07 3:29 PM, Chris Mattmann [EMAIL PROTECTED] wrote:

 Hi Guys,
 
  Yep, I couldn't remember exactly what the issues were. Thanks for digging
 that up, Andrzej. So, yeah, anyways it may make sense to update
 nutch-site.xml with the comment below, with performance problems replaced
 with intermittent problems with the underlying commons-httpclient library.
 
  If you guys agree, I'll add the comment to nutch-site...
 
 Cheers,
   Chris
 
 
 
 On 1/24/07 3:10 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 
 Michael Wechner wrote:
 ok. So what about adding a comment to nutch-site.xml, e.g.
 
 !-- NOTE: In order to use https please add protocol-httpclient, but
 be aware of possible performance problems! --
 
 They were not performance problems. There were some issues related to
 using multiple threads, which would sometimes cause the httpclient
 library to fail. There was also a logging message produce in the
 internals of httpclient that was difficult to turn off - but now that we
 are using log4j this should be straightforward. There was a bug in
 chunked encoding handling that would cause hangs.
 
 There were also other intermittent problems with this library, so after
 much deliberation we decided to leave the simpler plugin as the default ...
 
 These issues may have been solved in a newer version of httpclient library.
 
 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group
 
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 ___
 
 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.
 
 




Re: Problem crawling/fetching using https

2007-01-24 Thread Chris Mattmann
Hi Michi,

 I am pretty sure that in order to support https, you need to enable the
protocol-httpclient plugin, which is based on commons-httpclient. There
isn't a protocol-https plugin as far as I know. Try that and see if that
fixes your issue.

Thanks!

Cheers,
 Chris



On 1/24/07 2:29 PM, Michael Wechner [EMAIL PROTECTED] wrote:

 Hi
 
 I try to fetch data from a website using https, whereas I have added
 
 valuenutch-extensionpoints|protocol-file|protocol-http|protocol-https
 
 to nutch-site.xml
 
 but still receive the following error
 
 fetch of https://www.foo.bar/ failed with:
 org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https
 
 Is there anything else one has to do?
 
 I am using Nutch 0.8.x
 
 Thanks
 
 Michi




Re: Problem crawling/fetching using https

2007-01-24 Thread Chris Mattmann
Hi Michi,

 Btw, wouldn't it make sense to add protocol-httpclient as default,
 because I guess
 I am not the only one trying to fetch pages using https?

Indeed. The issue with this was in fact that some time ago, the powers that
be decided that it probably made sense to make protocol-httpclient the
default. However, due to some performance issues with the underlying
commons-httpclient Apache library (I think), it was decided to go with
protocol-http, which turned out to be must faster/more reliable, etc, at the
expense of not natively supporting HTTPS. I wonder what the user community
thinks about this now though? What do other people think? Have the issues
with protocol-httpclient gone away, such that it makes sense to enable it
again? 


Cheers,
  Chris

 
 Thanks again
 
 Michi
 
 Thanks!
 
 Cheers,
 Chris
 
 
 
 On 1/24/07 2:29 PM, Michael Wechner [EMAIL PROTECTED] wrote:
 
  
 
 Hi
 
 I try to fetch data from a website using https, whereas I have added
 
 valuenutch-extensionpoints|protocol-file|protocol-http|protocol-https
 
 to nutch-site.xml
 
 but still receive the following error
 
 fetch of https://www.foo.bar/ failed with:
 org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=https
 
 Is there anything else one has to do?
 
 I am using Nutch 0.8.x
 
 Thanks
 
 Michi

 
 
 
 
  
 
 




Re: Problem crawling/fetching using https

2007-01-24 Thread Chris Mattmann
Hi Guys,

 Yep, I couldn't remember exactly what the issues were. Thanks for digging
that up, Andrzej. So, yeah, anyways it may make sense to update
nutch-site.xml with the comment below, with performance problems replaced
with intermittent problems with the underlying commons-httpclient library.

 If you guys agree, I'll add the comment to nutch-site...

Cheers,
  Chris



On 1/24/07 3:10 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Michael Wechner wrote:
 ok. So what about adding a comment to nutch-site.xml, e.g.
 
 !-- NOTE: In order to use https please add protocol-httpclient, but
 be aware of possible performance problems! --
 
 They were not performance problems. There were some issues related to
 using multiple threads, which would sometimes cause the httpclient
 library to fail. There was also a logging message produce in the
 internals of httpclient that was difficult to turn off - but now that we
 are using log4j this should be straightforward. There was a bug in
 chunked encoding handling that would cause hangs.
 
 There were also other intermittent problems with this library, so after
 much deliberation we decided to leave the simpler plugin as the default ...
 
 These issues may have been solved in a newer version of httpclient library.

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: Indexing xml documents on local file system

2006-11-27 Thread Chris Mattmann
Hi Thorsten

On 11/27/06 4:00 AM, Thorsten Scherler
[EMAIL PROTECTED] wrote:

 
 Reading the wiki and the docu I get the impression I need to write my
 own implementation of an indexer/searcher plugin, which is able to
 filter/index crucial filter information such as summary year=2006
 number=209 date=27-10-2006 section=1, organisation
 name=Consejería de Economia y Hacienda and disposition
 type=Resolución .

 Yes, you may need to write your own parse, indexer and searcher plugins,
however, I am currently working on getting the parse-xml plugin into the
Nutch sources. The parse-xml plugin includes an indexing filter for the
fields that are extracted by the xml parser. The XML parser is configurable
to custom schemas and fields that need to be extracted.

 This plugin is available currently in JIRA, attached to this issue:

http://issues.apache.org/jira/browse/NUTCH-185

I am working hard to get this plugin ported to the latest trunk source, and
ready to be committed to the sources. I hope to attach a patch within the
next week that brings this plugin up to date, and gets the code ready for
prime-time (formatting, public javadocs, etc.). Once I attach the patch, you
may find that you only need to write your searcher plugin. Then again, in
the interest of time, you may go the route for writing your own set of
plugins. In that case, you can find examples of how to write the
parse/index/query plugins, by looking at the Nutch source, in subversion,
available here:

Parse plugins: 
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/plugin/parse-*
Index plugins: 
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/plugin/index-*
Query plugins: 
http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/plugin/query-*


 
 Still being a newbie to nutch I would appreciate the opinion of
 experienced devs whether nutch is the right choice and if so how I
 should start. 

I think that you could do this with Nutch, and if you do, for free, you get:

Crawling
Parsing/Indexing
Search Webapp, and RSS based OpenSearch servlet

You could also do this with Lucene, but I think you may find that you end up
maintaining more code, and having to rewrite existing functionality
available within Nutch.

Just my 2 cents...

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: java.lang.NullPointerException

2006-10-11 Thread Chris Mattmann
Hi there,

 You need to set your http.agent.name property within
$NUTCH_HOME/conf/nutch-default.xml.

HTH,
  Chris



On 10/11/06 3:57 AM, Guruprasad Iyer [EMAIL PROTECTED] wrote:

 Hello,
 
 I have Nutch 0.8.1 installed on linux (FC3) along with java 1.5.0_07. When I
 run the crawl command I get the above error.
 
 Here is a snapshot of the log file-
 
 2006-10-11 15:39:16,234 FATAL api.RobotRulesParser - Agent we advertise
 (null) not listed first in 'http.robots.agents' property!
 
 and it says   fetcher.Fetcher - fetch of   the site  failed with:
 java.lang.NullPointerException
 
 Can anybody help? Thanks.




Re: java.lang.NullPointerException

2006-10-11 Thread Chris Mattmann
Hi Guruprasad,

  The property should be set to the agent name that you would like to appear
identifying your organization when your Nutch crawling agent visits websites
during its crawl. You could set it to foo/bar and it would work fine, but
you probably want to think of an appropriate identifying name and then set
it to that.

Cheers,
  Chris



On 10/11/06 8:36 AM, Guruprasad Iyer [EMAIL PROTECTED] wrote:

 Hi Chris,
 
 Thanks for the reply. But, what value should I set it to? Can you help me on
 this?
 
 Thanks once again.
 
 Cheers,
 Guruprasad
 
 On 10/11/06, Chris Mattmann [EMAIL PROTECTED] wrote:
 
 Hi there,
 
 You need to set your http.agent.name property within
 $NUTCH_HOME/conf/nutch-default.xml.
 
 HTH,
   Chris
 
 
 
 On 10/11/06 3:57 AM, Guruprasad Iyer [EMAIL PROTECTED] wrote:
 
 Hello,
 
 I have Nutch 0.8.1 installed on linux (FC3) along with java 1.5.0_07.
 When I
 run the crawl command I get the above error.
 
 Here is a snapshot of the log file-
 
 2006-10-11 15:39:16,234 FATAL api.RobotRulesParser - Agent we advertise
 (null) not listed first in 'http.robots.agents' property!
 
 and it says   fetcher.Fetcher - fetch of   the site  failed with:
 java.lang.NullPointerException
 
 Can anybody help? Thanks.
 
 
 
 




Re: rss integration

2006-09-11 Thread Chris Mattmann
Hi Ernesto,

  You need to make sure that the links inside of the RSS files that are
getting indexed are not filtered out by your url filter. For instance, say
you had an RSS file that had the following links:

http://foo.com/news/
http://foo.bar.com/sports/
http://bar.foo.com/breaking/news/highlights

Well, you would need in your url filter to add support for each of the
different host names and paths that you would be indexing. So, in your
example below, I'm pretty sure that your URL filter below limits you to only
those 2 domains, rss.cnn.com and www.cnn.com. I think that if you chanted
your filter, for example to:

+^http://([a-z0-9]*\.)*cnn.com/

That might help. Ensure that the links present in the CNN RSS files fall
within the *.cnn.com domain, otherwise, update your url filter accordingly.

 More specific comments below:

On 9/10/06 11:23 PM, Ernesto De Santis [EMAIL PROTECTED]
wrote:

 Hi Chris
 
 Thanks for your response.
 But I can't do that it works.
 
 All times it indexes the whole channel as one Document.
 
 I did these steps (to index a cnn channel):
 
 1- write in my seed file, with just one seed:
 
 http://rss.cnn.com/rss/cnn_topstories.rss

Good, that's the right thing to do.

 
 2- include the parser:
 
 In the file nutch-default.xml, tag plugin.includes, I include the rss
 parser:
   
 valueprotocol-http|urlfilter-regex|parse-(rss|text|html|js)|index-basic|quer
 y-(basic|site|url)|summary-basic|scoring-opic|index-url-category/value
 

Perfect.

 3- Accept cnn hosts
 
 In the file crawl.urlfilter.txt I wrote:
 +^http://rss.cnn.com/
 +^http://www.cnn.com/

See my comments above here. I think that you need to change these.

 
 Then I run the crawler, but always I get an index with once Document.
 I try some things more, without successes... (like set
 db.ignore.internal.links to false, change the mimetype parsers order, I
 did read some problem about that in a post yours)
 
 Do you know what I'm forgetting?
 
 How can I be sure if parser-rss is parsing some content?
 Can I get some log about that?

Yup, there should be some information in the nutch.log file. Do a grep for
parse-rss or RSSParser in the log file.

 
 About outlinks, I don't understand what I must do with them. I need do
 something with outlinks after parser-rss work?

Nope. Outlinks are links coming out of a page of content. So, if there are 5
links in a web page, or an RSS document, then there are 5 so-called
Outlinks in Nutch terminology. During the parsing phase, as content is
parsed individually, Nutch requires a parser to append any Outlinks found in
a particular piece of content and return them back to the Fetcher so that
they too can be crawled.


HTH,
  Chris

 
 Thanks a lot ... again.
 Ernesto.
 
 Chris Mattmann escribió:
 Hi Ernesto,
 
  The RSSParser in Nutch does in fact index the individual item links: they
 are added as Outlinks during each iteration in which the RSSParser is
 called. Both the channel text and the item text are indexed. Also, since
 each Item link is added as an Outlink to the list of returned Outlinks,
 Nutch is able to crawl many urls that can come out of a single RSS feed.
 
 HTH,
   Chris
 
 
 
 On 9/10/06 5:54 PM, Ernesto De Santis [EMAIL PROTECTED]
 wrote:
 
   
 Hi all
 
 I'm trying to integrate a rss and atom source to my nutch index.
 I see that nutch has a RSSParser, but it seems that index the whole
 source as one source, right?
 
 I want to index each item separately.
 Some body do it? What's the best approach.
 
 I hope about do a external process to add Document's to nutch(lucene)
 index using a rss fetcher like Rome. The negative point about it, is
 that it isn't integrated with nutch.
 
 I don't know details of nutch core to hack it, I don't know if is
 possible to integrate it in nutch.
 
 Thanks a lot!
 Ernesto.
 
 
 
 
 __
 Preguntá. Respondé. Descubrí.
 Todo lo que querías saber, y lo que ni imaginabas,
 está en Yahoo! Respuestas (Beta).
 ¡Probalo ya! 
 http://www.yahoo.com.ar/respuestas
 
 
 
 
 
   
 
 
 
 
 __
 Preguntá. Respondé. Descubrí.
 Todo lo que querías saber, y lo que ni imaginabas,
 está en Yahoo! Respuestas (Beta).
 ¡Probalo ya! 
 http://www.yahoo.com.ar/respuestas
 




Re: intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

2006-08-30 Thread Chris Mattmann
Hi there Tomi,


On 8/30/06 12:25 PM, Tomi NA [EMAIL PROTECTED] wrote:

 I'm attempting to crawl a single samba mounted share. During testing,
 I'm crawling like this:
 
 ./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20
 
 I'm using luke 0.6 to query and analyze the index.
 
 PROBLEMS
 
 1.) search by file type doesn't work
 I expected that a search file type:pdf would have returned a list of
 files on the local filesystem, but it does not.

I believe that the keyword is type, so your query should be type:pdf
(without the quotes). I'm not positive about this either, but I believe you
have to give the fully qualified mimeType, as in application/pdf. Not
definitely sure about that though so you should experiment.

Additionally, in order for the mimeTypes to be indexed properly, you need to
have the index-more plugin enabled. Check your
$NUTCH_HOME/conf/nutch-site.xml, and look for the property plugin.includes
and make sure that the index-more plugin is enabled there.

 
 2.) invalid nutch file type detection
 I see the following in the hadoop.log:
 ---
 2006-08-30 15:12:07,766 WARN  parse.ParseUtil - Unable to successfully
 parse content file:/mnt/bobdocs/acta.zip of type application/zip
 2006-08-30 15:12:07,766 WARN  fetcher.Fetcher - Error parsing:
 file:/mnt/bobdocs/acta.zip: failed(2,202): Content truncated at
 1024000 bytes. Parser can't handle incomplete pdf file.
 ---
 acta.zip is a .zip file, not a .pdf. Don't have any idea why this happens.

This may result from the contentType returned by the web server for
acta.zip. Check the web server that the file is hosted on, and see what
the server responds for the contentType for that file.

Additionally, you may want to check if magic is enabled for mimeTypes. This
allows the mimeType to be sensed through the use of hex codes compared with
the beginning of each file.

 
 3.) Why is the TextParser mapped to application/pdf and what has that
 have to do with indexing a .txt file?
 -
 2006-08-30 15:12:02,593 INFO  fetcher.Fetcher - fetching
 file:/mnt/bobdocs/popis-vg-procisceni.txt
 2006-08-30 15:12:02,916 WARN  parse.ParserFactory -
 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
 contentType application/pdf via parse-plugins.xml, but its plugin.xml
 file does not claim to support contentType: application/pdf
 -

The TextParser * was * enabled as a last resort sort of means of extracting
* some * content from a PDF file, that is, if the parse-pdf plugin wasn't
enabled, or it failed for some reason. Since parse-text is the 2nd option
for parsing PDF files, there most likely was some sort of error in the
original PDF parser. The way that the ParserFactory works now is that it
iterates through a preference list of parsers (specified in
$NUTCH_HOME/conf/parse-plugins.xml), and tries to parse the underlying
content. The first successful parse is returned back to the Fetcher.

 
 4.) Some .doc files can't be indexed, although I can open them via
 openoffice 2 with no problems
 -
 2006-08-30 15:12:02,991 WARN  parse.ParseUtil - Unable to successfully
 parse content file:/mnt/bobdocs/cards2005.doc of type
 application/msword
 2006-08-30 15:12:02,991 WARN  fetcher.Fetcher - Error parsing:
 file:/mnt/bobdocs/cards2005.doc: failed(2,0): Can't be handled as
 micrsosoft document. java.lang.StringIndexOutOfBoundsException: String
 in
 dex out of range: -1024
 -

What version of MS Word were you trying to index? I believe that the POI
library used by the word parser can only handle certain versions of MS Word
documents, although I'm not positive about this.


As for 5 and 6 I'm not entirely sure about those problems. I wish you luck
in solving both of them though, and hope what I said above helps you out.

Thanks!

Cheers,
  Chris

 
 5.) MoreIndexingFilter doesn't seem to work
 The relevant part of the hadoop.log file:
 -
 2006-08-30 15:13:40,235 WARN  more.MoreIndexingFilter -
 file:/mnt/bobdocs/EU2007-2013.pdforg.apache.nutch.util.mime.MimeTypeException:
 The type can not be null or empty
 -
 This happens with other file types, as well:
 -
 2006-08-30 15:13:54,697 WARN  more.MoreIndexingFilter -
 file:/mnt/bobdocs/popis-vg-procisceni.txtorg.apache.nutch.util.mime.MimeTypeEx
 ception:
 The type can not be null or empty
 -
 
 6.) At the moment, I'm crawling the same directory (/mnt/bobdocs), the
 crawl process seems to be stuck in an infinite loop and I have no way
 of knowing what's going on as the .log isn't flushed until the process
 finishes.
 
 
 ENVIRONMENT
 
 logs/hadoop.log inspection reveals things like this:
 
 My (relevant) crawl settings are:
 
 -
   namedb.max.anchor.length/name
   value511/value
 
   namedb.max.outlinks.per.page/name
   value-1/value
 
   namefetcher.server.delay/name
   value0/value
 
   namefetcher.threads.fetch/name
   value5/value
 
   namefetcher.verbose/name
   valuetrue/value
 
   namefile.content.limit/name
   

Re: RSS search by nutch

2006-08-28 Thread Chris Mattmann
Hi there Dima,

  I'm not exactly sure what you mean by real time, but there is an RSS
Parsing plugin in Nutch that can parse RSS feeds that Nutch encounters
during its crawl. You can enable parse-rss by opening up
$NUTCH_HOME/conf/nutch-site.xml, and searching for the property
plugin.includes. For the value of plugin.includes, ensure that there is
an entry for parse-rss somewhere in that property value.

HTH,
  Chris


On 8/28/06 10:44 AM, Dima Gritsenko [EMAIL PROTECTED] wrote:

 Hi, 
 
 Does nutch have a class for searching incoming RSS feeds in real time?
 Thank you. 
 Dima. 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: RSS search by nutch

2006-08-28 Thread Chris Mattmann
Hi Jeremy,

On 8/28/06 10:18 AM, HUYLEBROECK Jeremy RD-ILAB-SSF
[EMAIL PROTECTED] wrote:

 
 The Nutch Feed/RSS plugin (parse-rss) only allows you to search the
 entire channel/feed text, not items individually.

Actually, this isn't entirely the case. parse-rss actually indexes the item
text (see line 148 in RSSParser.java) as well. Additionally, parse-rss adds
the individual item links to the Outlinks (see lines 161 and 163 in
RSSParser.java) , and they get crawled as well, in addition to the channel
text (see line 123 in RSSParser.java) and channel outlink (see lines 130 and
132 in RSSParser.java).

 You'll have to develop your own if it's what you are trying to do.
 I also found that the feedparse library used by parse-rss doesn't read
 properly all formats and I myself moved to the ROME library for now.

I haven't really noticed any formats not really handled by
commons-feedparser. What formats have you noticed that it doesn't handle?



Cheers,
  Chris


 
 
 -Original Message-
 From: Dima Gritsenko [mailto:[EMAIL PROTECTED]
 Sent: Monday, August 28, 2006 10:44 AM
 To: nutch-user@lucene.apache.org
 Subject: RSS search by nutch
 
 Hi, 
 
 Does nutch have a class for searching incoming RSS feeds in real time?
 Thank you. 
 Dima. 




Re: Speeding up compilation without compiling plugins

2006-08-25 Thread Chris Mattmann
Hi Michael,

 I believe that there is an ant task called compile-core. If you just
type:

# ant compile-core

Rather than:

# ant

You should be good to go.

HTH,
  Chris



On 8/25/06 5:48 AM, Michael Wechner [EMAIL PROTECTED] wrote:

 Hi
 
 How can I disable the compiling of all plugins such that I can speedup
 overall compile when I just did changes within the core?
 
 Thanks
 
 Michi




Re: [Nutch-0.8] Missing WAR file

2006-08-12 Thread Chris Mattmann
Hi Guys,

On 8/12/06 9:27 AM, Hou Keat Lee [EMAIL PROTECTED] wrote:

 Hi,
 
 May be I'm missing something here.
 
 If the packaged WAR file is suppose to be used, how does nutch links back to
 my crawling results and indexes?

Another option for this would be to use the generated nutch.xml file that
appears in the build directory (e.g., $NUTCH_HOME/build) when you run the
ant war command. Since NUTCH-210, this context.xml file is generated and
allows you to adapt the runtime parameters (e.g., index dir) without
touching the nutch.war file. Instead of placing nutch.war in
/path/to/tomcat/webapps/, place nutch.xml in there (for Tomcat 4.x), and in
/path/to/tomcat/conf/Catalina/localhost/ (for Tomcat 5.x).

Cheers,
  Chris


 
 Also, after deploying the WAR file, I've encountered some permission error
 when trying to do a search. What are the permission required for the search?
 
 Thanks.
 
 See below the exception thrown:
 ==
 
 *exception*
 
 org.apache.jasper.JasperException: access denied
 (java.util.PropertyPermission user.dir read)
 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:372
)
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
 org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
 ava:25)
 java.lang.reflect.Method.invoke(Method.java:585)
 org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
 java.security.AccessController.doPrivileged(Native Method)
 javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
 org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272)
 org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161)
 
 *root cause*
 
 java.security.AccessControlException: access denied
 (java.util.PropertyPermission user.dir read)
 java.security.AccessControlContext.checkPermission(AccessControlContext.java:2
 64)
 java.security.AccessController.checkPermission(AccessController.java:427)
 java.lang.SecurityManager.checkPermission(SecurityManager.java:532)
 java.lang.SecurityManager.checkPropertyAccess(SecurityManager.java:1285)
 java.lang.System.getProperty(System.java:627)
 org.apache.hadoop.fs.LocalFileSystem.init(LocalFileSystem.java:31)
 org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:99)
 org.apache.hadoop.fs.FileSystem.get(FileSystem.java:86)
 org.apache.nutch.searcher.NutchBean.init(NutchBean.java:94)
 org.apache.nutch.searcher.NutchBean.init(NutchBean.java:83)
 org.apache.nutch.searcher.NutchBean.get(NutchBean.java:70)
 org.apache.jsp.search_jsp._jspService(search_jsp.java:104)
 org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324
)
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
 org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
 ava:25)
 java.lang.reflect.Method.invoke(Method.java:585)
 org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:243)
 java.security.AccessController.doPrivileged(Native Method)
 javax.security.auth.Subject.doAsPrivileged(Subject.java:517)
 org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:272)
 org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:161)
 
 
 ==
 
 On 8/13/06, Sami Siren [EMAIL PROTECTED] wrote:
 
 Hou Keat Lee wrote:
 Hi all,
 
 I'm trying out the nutch on my Ubuntu box.
 
 I've managed to follow the tutorial for Nutch v0.8 and manage to follow
 the
 steps to perform crawling.
 
 However, when the crawl completed I didn't see the expected WAR. Is
 there
 something wrong with the crawling and thus the WAR file is not created
 automatically? I've taken a look at the log and didn't see anything
 wrong.
 
 The .war file is not generated during crawling but is distributed as
 part of the released nutch-0.8.tar.gz package.
 
 Location of file is nutch-0.8/nutch-0.8.war
 
 --
   Sami Siren
 
 




Re: Feedparser 0.6 fork source code

2006-08-08 Thread Chris Mattmann
Hi Jeremy,

 I've uploaded the fork-src to my USC website. Here is the URL:

http://www-scf.usc.edu/~mattmann/feedparser-src-fork.tar.gz

I'll leave the file up there for a few days at least, so feel free to grab
it at your leisure.

Thanks,
  Chris


On 8/8/06 4:55 PM, HUYLEBROECK Jeremy RD-ILAB-SSF
[EMAIL PROTECTED] wrote:

  
 Chris (or anyone having it),
 could you share again the source code of the common-feedparser fork used
 in nutch?
 The zip file you shared a year ago is not on your site anymore.
 
 Thanks!
 Jeremy.




Re: Starting Nutch in init.d?

2006-07-28 Thread Chris Mattmann
Guys,

 Sorry, I misspoke: the issue was actually: NUTCH-210, not NUTCH-245.

You can view the issue at: http://issues.apache.org/jira/browse/NUTCH-210

Cheers,
  Chris



On 7/28/06 10:29 AM, Chris Mattmann [EMAIL PROTECTED] wrote:

 Hi Guys,
 
  In 0.8, it's even easier than that: Since NUTCH-245, we now have an
 official context.xml file that is built when the war target is executed. So,
 check the build directory for a nutch.xml file. Copy
 /path/to/nutch/build/nutch.xml to
 /path/to/tomcat/home/conf/Catalina/localhost/ (in Tomcat 5.x), or to
 /path/to/tomcat/home/webapps (in Tomcat 4.x). Then, edit the nutch.xml file
 to point to the location of your nutch WAR. Inside of nutch.xml, you'll be
 able to set the dynamic properties of the application without having to
 worry about the unpacked WAR file, or anything else.
 
 Hope that helps!
 
 Cheers,
   Chris
 
 
 On 7/28/06 9:50 AM, Matthew Holt [EMAIL PROTECTED] wrote:
 
 You don't need to cd to the nutch directory for the startup script. All
 you need to do is edit the nutch-site.xml that is found within the nutch
 servlet and include a searcher directory property that tells tomcat
 where to look for the crawl db.
 
 So if you have nutch 0.8, edit the file
 TOMCAT_PATH/webapps/NUTCH_DIR/WEB-INF/classes/nutch-site.xml and include
 the following:
 
 property
 namesearcher.dir/name
 value/your_index_folder_path/value
   /property
 
 
 I believe the your_index_folder_path is the path to your crawl
 directory.  However, if that doesn't work, make it the path to the index
 folder within your crawl directory.
 
 Now, save that and make sure your script just starts tomcat on init and
 everything should work fine for you.
 
 Matt
 
 
 Bill Goffe wrote:
 I'd like to start Nutch automatically when I reboot. I wrote a real rough
 script (see below) that works on my Debian system when the system is up,
 but I get nothing on a reboot (and the links are set to the
 /etc/init.d/nutch).  Any hints, ideas, or suggestions? I checked the FAQ
 and the archive but didn't see anything. In addition, it would be great to
 get messages going into /var/log to help figure out what is going on but
 I've had no luck doing that.
 
 Thanks,
 
Bill
 
 ## Start and stop Nutch. Note how specific it is to
 ## (i) Tomcat (typically $CATALINA_HOME/bin/shutdown.sh
 ## or $CATALINA_HOME/bin/startup.sh) and (ii) the
 ## directory with the most recent fetch results.
 
 ## PATH stuff
 PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games
 PATH=$PATH:/usr/local/SUNWappserver/bin
 CLASSPATH=/usr/local/SUNWappserver/jdk/jre/lib
 JAVA_HOME=/usr/local/SUNWappserver/jdk
 CATALINA_HOME=/usr/local/jakarta-tomcat-5
 JAVA_OPTS=-Xmx1024m -Xms512m
 
 case $1 in
 start)
   cd /home/bgoffe/nc/40  ## start in correct directory
   /usr/local/jakarta-tomcat-5/bin/startup.sh
   ;;
 
 stop)
  /usr/local/jakarta-tomcat-5/bin/shutdown.sh
  ;;
 
 force-reload|restart)
   /usr/local/jakarta-tomcat-5/bin/shutdown.sh
   cd /home/bgoffe/nc/40
   /usr/local/jakarta-tomcat-5/bin/startup.sh
   ;;
 
 *)
 echo Usage: /etc/init.d/nutch {start|stop|force-reload|restart}
 exit 1
 ;;
 
 esac
 
 exit 0
 
   
 
 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group
 
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 ___
 
 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.
 
 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: Blogger RSS Parsing Error

2006-04-17 Thread Chris Mattmann

Hi Mike,

  The RSS parser for Nutch is based on Kevin Burton's commons-feedparser in
the Jakarta Sandbox. Here is the documentation for that feedparser:

http://jakarta.apache.org/commons/sandbox/feedparser/

You might want to post to the commons-feedparser email list asking him about
your RSS question: he's the real RSS guru, and I bet you he could help you
out.

  As for your guess that it's probably an unrecognized tag, I think you're
probably right. Now the question is, your fetch isn't failing because of
this, right? I mean, I see in the RSS parser that line 116 (the call to the
parse function) is within a try/catch block, so what you are pasting below
is just the output of the stack trace, right?

Anyways, good luck on your problem!

Cheers,
  Chris

--
View this message in context: 
http://www.nabble.com/Blogger-RSS-Parsing-Error-t1462722.html#a3953532
Sent from the Nutch - User forum at Nabble.com.



Re: Same Error (Version 0.8)

2006-04-12 Thread Chris Mattmann

Hi Mike,

 Well one thing that I notice off the bat is that you specify the alias tag
in nutch-site.xml (or maybe this was a typo when you posted the message). If
it wasn't, the alias tag should go into $NUTCH_HOME/conf/parse-plugins.xml,
the same place where you mapped the mimeTypes to plugin ids. Second, I would
ask that you verify that the following are true:

1. you have a plugin called microformats-hreview located in
$NUTCH_HOME/src/plugin/microformats-hreview

2. the plugin microformats-hreview has a plugin.xml file

3. the implementation id attribute inside of the plugin.xml file for the
microformats-hreview plugin is set to the value
org.apache.nutch.microformats.hreview.HReviewParser

Check on those things and let me know what you find out. We'll get to the
bottom of this.

Cheers,
  Chris

--
View this message in context: http://www.nabble.com/Xml--t1050112.html#a3882468
Sent from the Nutch - User forum at Nabble.com.



Re: Same Error (Version 0.8)

2006-04-12 Thread Chris Mattmann
Hi Mike,

 Another thing is: are you making sure that your plugin is being built? That
is, did you add an entry in $NUTCH_HOME/src/build.xml for your plugin,
underneath the the deploy tag (at least)? This will cause your plugin to
be built when the rest of the plugins are built, and then copied to
$NUTCH_HOME/build, which is where the plugin repository will look for the
runtime for plugins. Your plugin might not be loaded because of that. Please
check and let us know.

Cheers,
  Chris



On 4/12/06 8:56 AM, mikeyc [EMAIL PROTECTED] wrote:

 
 Chris / Jerome,
 Ok.  So, now the error message is gone, but my plugin doesn't seem to be
 getting called (not seeing any of my messages).  As listed below, I updated
 my plugin.xml (similar to microformats-reltag) and removed any entries in
 the parse-plugins.xml file.
 
 Any ideas?  
 
 Again, thanks for helping me work through these issues - didn't have half as
 many with version 0.7. ;)
 
 -Mike
 --
 View this message in context:
 http://www.nabble.com/Xml--t1050112.html#a3884328
 Sent from the Nutch - User forum at Nabble.com.
 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: Same Error (Version 0.8)

2006-04-11 Thread Chris Mattmann

Hi Mike,

   Could you post the snippet from your nutch-site.xml where you enable
plugin: org.apache.nutch.xxx.xxx.xxx. Could you also be more specific and
post the entire name of the plugin that it printed in your log file? This
warning message basically means that there was an entry in the
parse-plugins.xml file for your plugin org.apache.nutch.xxx.xxx.xxx, but it
was never enabled in nutch-site.xml, (or nutch-default.xml).

Thanks,
  Chris

--
View this message in context: http://www.nabble.com/Xml--t1050112.html#a3875572
Sent from the Nutch - User forum at Nabble.com.



Re: Nutch and Hadoop Tutorial Finished

2006-03-20 Thread Chris Mattmann
Hi Dennis,

 Thanks for your hard work. Where exactly on the wiki is the tutorial? I'm
not seeing it.

Cheers,
  Chris



On 3/20/06 2:52 PM, Dennis Kubes [EMAIL PROTECTED] wrote:

 The NutchHadoop tutorial is now up on the wiki.
 
 Dennis 
 
 -Original Message-
 From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 20, 2006 12:49 PM
 To: nutch-user@lucene.apache.org
 Subject: RE: Nutch and Hadoop Tutorial Finished
 
 Sorry.  Go to http://wiki.apache.org/nutch/ and click on the
 login link at the top of the page.  You'll have to create yourself an
 account and then when you go back to the wiki front page, you can edit
 it.
 
 I went ahead and created a link on the Front page called
 NutchHadoopTutorial (in the Administration section).  If you click on
 that link, you'll be prompted to create a new page.  Create a blank one
 and paste in your tutorial.  You'll probably want to play with the
 formatting.  There are help links on the wiki that explain how to format
 pages.
 
 If you have any trouble, just shout.
 Jake.
 
 -Original Message-
 From: Dennis Kubes [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 20, 2006 1:37 PM
 To: nutch-user@lucene.apache.org
 Subject: RE: Nutch and Hadoop Tutorial Finished
 
 Not to act dumb, but how do I add it to the wiki?
 
 Dennis 
 
 -Original Message-
 From: Vanderdray, Jacob [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 20, 2006 12:20 PM
 To: nutch-user@lucene.apache.org
 Subject: RE: Nutch and Hadoop Tutorial Finished
 
 Dennis,
 
 How 'bout the wiki.
 
 Jake.
 
 -Original Message-
 From: Dennis Kubes [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 20, 2006 1:01 PM
 To: nutch-user@lucene.apache.org
 Subject: Nutch and Hadoop Tutorial Finished
 
  All,
 
 I have finished a lengthy tutorial on how to setup a distributed
 implementation of nutch and hadoop.  Should I post it on this list or is
 there a better place for it?
 
 Dennis
 
 
 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: project vitality?

2006-03-04 Thread Chris Mattmann
Hi Richard,

 IMHO, if you don't parse something correctly, you cannnot rely on the
 results.  

Good, we're on the same page here.

 We have all parsed things where you leave a comma out and the parse
 results are wrong.  If there was a bug in nutches html parsing would
 that be a big deal?

Yes, it would be. HTML is the foundation for the web. Its content is the
most pervasive out there (as you allude to below).

 Howabout if it parsed the text in a particular tag
 out of order?  

I'm wondering what that has to do with anything? You may want to read up on
Lucene (http://lucene.apache.org/). Lucene is the underlying text search api
(and index format) that Nutch is built on top of, and I'm wondering if it
cares about the order in which a piece of text is given to it?

 Pdf is unfortunately not html where you can parse the
 file sequentially and get an accurate result,

Gonna have to disagree with you on this. You're making a general statement
that's not true across the board. I would assert that in many cases, you can
still get an accurate result. What about a PDF research paper? Do you care
about what order the text comes in if you're just doing general Google
like search. When I go to Google and type grid computing papers, do I
care that grid computing comes before some text within the research paper?
Possibly, but mainly I care that grid computing was an emphasized phrase
within the text. Now, your definition of emphasized may not just be that
it's the first text that appears in the paper in the title say: you may just
care that the frequency of grid computing in the paper is relatively
higher than a certain threshold compared to other terms. On the other hand,
the fact that grid computing is in the title and comes first in the PDF
may mean a lot to you. in That's the nature of trying to extract structure
out of inherently unstructured content. I'm not saying that the structure or
order of text within a document is never useful: I agree that in a lot of
cases, it can help you to infer what values are associated with what fields
you want to index, etc. All I'm saying is that it's certainly a subset of
the greater functionality of just doing free text search, so you shouldn't
generalize and that that you can't parse a PDF sequentially and obtain good
results.

 but its use is second most
 ubiquotous.  PDFBox is not a PDF parsing frmaework either.  It has some
 pdf parsing algorithms, that aren't being used.  Google does a good job
 parsing pdf, nutch has to do if its ogin to compete.

Can you show that Google's PDF parsing capability is any better than Nutch's
using accepted evaluation methods for PDF? How about some real use cases and
real results? Until we could see such numbers, I'm hesitant to believe what
you're saying is true. If it is though, then I'm sure that the community
would welcome any updates to the PDF parsing plugin that expedite its
improvement.

Cheers,
  Chris



 
 
 
 
 -Original Message-
 From: Chris Mattmann [mailto:[EMAIL PROTECTED]
 Sent: Saturday, March 04, 2006 4:10 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: project vitality?
 
 
 Hello,
 
  I've been following this conversation for the past week and decided
 that I'd go ahead and chime in now. I think that honestly this whole
 thread of discussion needs to be taken off list, because it doesn't
 really have anything to do with the use of Nutch: what it boils down
 to is a list of complaints, requests for improvements and what not.
 Nutch's goal is to be a large-scale, open source search engine: it's not
 a PDF parsing framework, nor is it as thoroughly documented as some
 commercial software -- although I've ran into many commercial software
 products that don't have the same quality of documentation that Nutch
 even has now in its nascent stages.
 
 Now that I have said that, I want to express my feeling that it's hard
 
 when it takes a week to figure out that invertlinks only applies to
 version 0.8. and when you ask to become a volunteer, you are met with
 no response.
 
 You don't need to ask to become a volunteer: just do it. As Doug said,
 create a patch, submit the patch to JIRA and let the community look at
 it. Change something on the Wiki if you don't think that the
 documentation is particularly well there. Use Nutch to do whatever you
 like, and if you feel that you contributed something that is applicable
 to a broader community outside of your domain, let people know about it.
 If it's really cool, I wouldn't worry about people ignoring you: they'll
 come around.
 
 It's also frustrating when you share some heard earned insights into
 something that nutch needs to work on, like pdf parsing, and your
 comments don't get a single good response from the nutch dev team.
 
 The nutch dev team isn't focused on PDF parsing. Nutch is a search
 engine framework, and to Nutch, a PDF parser is a black box that
 conforms to a standard parsing interface that can be swapped out as
 technology evolves. Right

Re: Which version of rss does parse-rss plugin support?

2006-02-10 Thread Chris Mattmann
Hi,


   the contentTitle will be a concatenation of the titles of the RSS Channels
 that we've parsed.
   So the titles of the RSS Channels are what delivered for indexing, right?

They're certainly part of it, but not the only part. The concatenation of
the titles of the RSS Channels are what is delivered for the title portion
of indexing.

   If I want the indexer to include more information about a rss file (such
 as item descriptions), can I just concatenate them to the contentTitle?

They're already there. There is a variable called index text: ultimately
that variable includes the item descriptions, along with the channel
descriptions. That, along with the title portion of indexing is the full
set of textual data delivered by the parser for indexing. So, it already
includes that information. Check out lines 137, and 161 in the parser to see
what I mean. Also, check out lines 204-207, which are:

ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS,

contentTitle.toString(), outlinks, content.getMetadata());

parseData.setConf(this.conf);

return new ParseImpl(indexText.toString(), parseData);

You can see that the return from the Parser, i.e., the ParseImpl, includes
both the indexText, along with the parse data (that contains the title
text).

Now, if you wanted to add any other metadata gleaned from the RSS to the
title text, or the content text, you can always modify the code to do that
in your own environment. The RSS Parser plugin returns a full channel model
and item model that can be extended and used for those purposes.

Hope that helps!

Cheers,
  Chris


 
 
 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道:
 
 Hi there,
 
   That should work: however, the biggest problem will be making sure that
 text/xml is actually the content type of the RSS that you are parsing,
 which you'll have little or no control over.
 
 Check out this previous post of mine on the list to get a better idea of
 what the real issue is:
 
 http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html
 
 G'luck!
 
 Cheers,
 Chris
 
 
 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group
 
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 Phone:  818-354-8810
 ___
 
 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.
 
 -Original Message-
 From: 盖世豪侠 [mailto:[EMAIL PROTECTED]
 Sent: Saturday, February 04, 2006 11:40 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Which version of rss does parse-rss plugin support?
 
 Hi Chris
 
 
 How do I change the plugin.xml? For example, if I want to crawl rss
 files
 end with xml, just add a new element?
 
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=rss/
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=xml/
 
 Am I right?
 
 
 
 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
 
 Hi there,
 Sure it will, you just have to configure it to do that. Pop over to
 $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there
 there
 is
 an attribute called pathSuffix. Change that to handle whatever type
 of
 rss
 file you want to crawl. That will work locally. For web-based crawls,
 you
 need to make sure that the content type being returned for your RSS
 content
 matches the content type specified in the plugin.xml file that
 parse-rss
 claims to support.
 
 Note that you might not have * a lot * of success with being able to
 control the content type for rss files returned by web servers. I've
 seen
 a
 LOT of inconsistency among the way that they're configured by the
 administrators, etc. However, just to let you know, there are some
 people
 in
 the group that are working on a solution to addressing this.
 
 Hope that helps.
 
 Cheers,
 Chris
 
 
 
 On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
 
 Hi *Chris,*
 
 The files of RSS 1.0 have a postfix of rdf. So willthe parser
 recognize
 it
 automatically as a rss file?
 
 
 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
 
 Hi there,
 
 parse-rss is based on commons-feedparser
 (http://jakarta.apache.org/commons/sandbox/feedparser). From the
 feedparser
 website:
 
 ...commons-feedparser supports all versions of RSS (0.9, 0.91,
 0.92,
 1.0,
 and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc
 extension
 and RSS 1.0 modules capability...
 
 Hope that helps

RE: Which version of rss does parse-rss plugin support?

2006-02-05 Thread Chris Mattmann
Hi there,

   That should work: however, the biggest problem will be making sure that
text/xml is actually the content type of the RSS that you are parsing,
which you'll have little or no control over. 

Check out this previous post of mine on the list to get a better idea of
what the real issue is:

http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html

G'luck!

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
Phone:  818-354-8810
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

 -Original Message-
 From: 盖世豪侠 [mailto:[EMAIL PROTECTED]
 Sent: Saturday, February 04, 2006 11:40 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Which version of rss does parse-rss plugin support?
 
 Hi Chris
 
 
 How do I change the plugin.xml? For example, if I want to crawl rss files
 end with xml, just add a new element?
 
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=rss/
   implementation id=org.apache.nutch.parse.rss.RSSParser
   class=org.apache.nutch.parse.rss.RSSParser
   contentType=application/rss+xml
   pathSuffix=xml/
 
 Am I right?
 
 
 
 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
 
  Hi there,
  Sure it will, you just have to configure it to do that. Pop over to
  $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there
  is
  an attribute called pathSuffix. Change that to handle whatever type of
  rss
  file you want to crawl. That will work locally. For web-based crawls,
 you
  need to make sure that the content type being returned for your RSS
  content
  matches the content type specified in the plugin.xml file that parse-rss
  claims to support.
 
  Note that you might not have * a lot * of success with being able to
  control the content type for rss files returned by web servers. I've
 seen
  a
  LOT of inconsistency among the way that they're configured by the
  administrators, etc. However, just to let you know, there are some
 people
  in
  the group that are working on a solution to addressing this.
 
  Hope that helps.
 
  Cheers,
  Chris
 
 
 
  On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
 
   Hi *Chris,*
  
   The files of RSS 1.0 have a postfix of rdf. So willthe parser
 recognize
  it
   automatically as a rss file?
  
  
   在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道:
  
   Hi there,
  
   parse-rss is based on commons-feedparser
   (http://jakarta.apache.org/commons/sandbox/feedparser). From the
   feedparser
   website:
  
   ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92,
  1.0,
   and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc
  extension
   and RSS 1.0 modules capability...
  
   Hope that helps.
  
   Thanks,
   Chris
  
  
   On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:
  
   I see the test file is of version 0.91.
   Does the plugin support higher versions like 1.0 or 2.0?
  
   --
   《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周
星驰岂是池中物,喜剧天
  分 既
   然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既
得千里马,又失千里马,
  当 然
   后悔莫及。
  
  
  
  
  
   --
   《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星
驰岂是池中物,喜剧天分既
   然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得
千里马,又失千里马,当然
   后悔莫及。
 
 
 
 
 
 --
 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂
是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一
 展风采。无线既得千里马,又失千里马,当然后悔莫及。



Re: Which version of rss does parse-rss plugin support?

2006-02-03 Thread Chris Mattmann
Hi there,

  parse-rss is based on commons-feedparser
(http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser
website:

...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0,
and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension
and RSS 1.0 modules capability...

Hope that helps.

Thanks,
  Chris


On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote:

 I see the test file is of version 0.91.
 Does the plugin support higher versions like 1.0 or 2.0?
 
 --
 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既
 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然
 后悔莫及。




RE: indexing issue

2006-02-01 Thread Chris Mattmann
Hi Raghavendra,

  Pop open your $NUTCH_HOME/conf/parse-plugins.xml file. Look for the
mimeType name=* portion of the file. Now, look at the parser tag
underneath it. Change that parser id to the one you want to use for your
default parser, i.e., in your case, parse-msword.

Hope that helps!

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


 -Original Message-
 From: Raghavendra Prabhu [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, February 01, 2006 8:19 AM
 To: nutch-user@lucene.apache.org
 Subject: indexing issue
 
 Hi
 
 I have got some files also
 
 How do i use some parser as the default
 
 Currently the text parser does not work fine for the file type which i
 have
 
 If i want to make the doc (word) parser as the default one (In a sense if
 no
 parser is found ,word should be used as the default processor and not the
 text parse)
 
 How do i do it ?
 
 Rgds
 Prabhu



RE: indexing issue

2006-02-01 Thread Chris Mattmann
Hi Prabhu,

 And also in the cached page , i get frequent errors for file system
 
 Is it because of the content-type bug (which you are working on)

Not sure, what errors are you getting? I fixed a bug in cached.jsp that had
to do with an absolute versus relative link (see NUTCH-112). Jerome C
committed that a while back. Was your problem with cached.jsp having to do
with absolute versus relative links?

Thanks,
  Chris

 
 
 Rgds
 
 Prabhu
 
 On 2/1/06, Chris Mattmann [EMAIL PROTECTED] wrote:
 
  Hi Raghavendra,
 
  Pop open your $NUTCH_HOME/conf/parse-plugins.xml file. Look for the
  mimeType name=* portion of the file. Now, look at the parser tag
  underneath it. Change that parser id to the one you want to use for your
  default parser, i.e., in your case, parse-msword.
 
  Hope that helps!
 
  Cheers,
  Chris
 
 
  __
  Chris A. Mattmann
  [EMAIL PROTECTED]
  Staff Member
  Modeling and Data Management Systems Section (387)
  Data Management Systems and Technologies Group
 
  _
  Jet Propulsion LaboratoryPasadena, CA
  Office: 171-266BMailstop:  171-246
  ___
 
  Disclaimer:  The opinions presented within are my own and do not reflect
  those of either NASA, JPL, or the California Institute of Technology.
 
 
   -Original Message-
   From: Raghavendra Prabhu [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, February 01, 2006 8:19 AM
   To: nutch-user@lucene.apache.org
   Subject: indexing issue
  
   Hi
  
   I have got some files also
  
   How do i use some parser as the default
  
   Currently the text parser does not work fine for the file type which i
   have
  
   If i want to make the doc (word) parser as the default one (In a sense
  if
   no
   parser is found ,word should be used as the default processor and not
  the
   text parse)
  
   How do i do it ?
  
   Rgds
   Prabhu
 
 



Re: resource pool for nutchbean

2006-01-05 Thread Chris Mattmann
Hi Raghavendra, 

I think that this is a good idea. What about a commons-pool
(http://jakarta.apache.org/commmons/pool/) implementation? The nutch bean
pool could be built using the basic API classes from this package...

Cheers,
  Chris


On 1/5/06 1:43 PM, Raghavendra Prabhu [EMAIL PROTECTED] wrote:

 What i am saying is NutchBean is not instantiated in the servlet context and
 garbage collection.
 
 The Server has a way of allocation NutchBean to users who request from its
 base and give it to them .
 
 It must also free the NutchBeans either periodically or when the number of
 nutchbeans have reached a size
 
 Raghavendra Prabhu
 
 
 On 1/6/06, Raghavendra Prabhu [EMAIL PROTECTED] wrote:
 
 No i dont think so
 
 What i am suggesting is we have nutch beans instantiated and we store it .
 
 Whenever an user comes and searches ,he will be given a NutchBean .
 
 
 After he searches he returns it to the pool and during the same time when
 some one searches he would get the same bean (note new bean is not created )
 
 Only if a bean is not available ,does a new bean get created .
 
 This makes it faster as different users share the same NutchBean and it
 does not create a new nutchbean
 
 Note:NutchBean is shared across different users whereas right now it is
 only for a single user and garabage collected
 
 Here we control the NutchBean instantiation and we have to come up with a
 way to free it .
 
 
 
 
 On 1/6/06, Byron Miller [EMAIL PROTECTED] wrote:
 
 If i'm not mistaken doesn't the opensearch servlet get
 around this issue? You could then post process the xml
 through a stylesheet/css or your favorite scripting
 language.
 
 -byron
 
 --- Raghavendra Prabhu [EMAIL PROTECTED] wrote:
 
 Right now
 
 Whenever an user comes and searches ,a  NutchBean is
 created
 
 We should have a mechanism where this nutchbean is
 pooled .I mean is created
 and stored so that it can be given to the user
 
 Immediately after the  user has used the Nutch Bean
 ,he returns it back
 
 (example at orkut ,we get a message saying doughnut
 not available)
 
 This will make search result faster and more
 efficient
 
 Only when paraller users are there will nutchbeans
 get created
 
 Any comments on the above issue
 
 
 Rgds
 Prabhu
 
 
 
 





Re: resource pool for nutchbean

2006-01-05 Thread Chris Mattmann
Sounds great. Could you create an issue in JIRA
(http://issues.apache.org/jira/browse/NUTCH) issue about this, and mark it
as an improvement. That way we can track progress on it, and attach
patches and progress.

Thanks,
  Chris



On 1/5/06 1:56 PM, Raghavendra Prabhu [EMAIL PROTECTED] wrote:

 Ya we shud do this .
 
 It will considerably improve performance
 
 We shud start building upon this .
 
 Rgds
 Raghavendra Prabhu
 
 
 On 1/6/06, Chris Mattmann [EMAIL PROTECTED] wrote:
 
 Hi Raghavendra,
 
 I think that this is a good idea. What about a commons-pool
 (http://jakarta.apache.org/commmons/pool/) implementation? The nutch bean
 pool could be built using the basic API classes from this package...
 
 Cheers,
 Chris
 
 
 On 1/5/06 1:43 PM, Raghavendra Prabhu [EMAIL PROTECTED] wrote:
 
 What i am saying is NutchBean is not instantiated in the servlet context
 and
 garbage collection.
 
 The Server has a way of allocation NutchBean to users who request from
 its
 base and give it to them .
 
 It must also free the NutchBeans either periodically or when the number
 of
 nutchbeans have reached a size
 
 Raghavendra Prabhu
 
 
 On 1/6/06, Raghavendra Prabhu [EMAIL PROTECTED] wrote:
 
 No i dont think so
 
 What i am suggesting is we have nutch beans instantiated and we store
 it .
 
 Whenever an user comes and searches ,he will be given a NutchBean .
 
 
 After he searches he returns it to the pool and during the same time
 when
 some one searches he would get the same bean (note new bean is not
 created )
 
 Only if a bean is not available ,does a new bean get created .
 
 This makes it faster as different users share the same NutchBean and it
 does not create a new nutchbean
 
 Note:NutchBean is shared across different users whereas right now it is
 only for a single user and garabage collected
 
 Here we control the NutchBean instantiation and we have to come up with
 a
 way to free it .
 
 
 
 
 On 1/6/06, Byron Miller [EMAIL PROTECTED] wrote:
 
 If i'm not mistaken doesn't the opensearch servlet get
 around this issue? You could then post process the xml
 through a stylesheet/css or your favorite scripting
 language.
 
 -byron
 
 --- Raghavendra Prabhu [EMAIL PROTECTED] wrote:
 
 Right now
 
 Whenever an user comes and searches ,a  NutchBean is
 created
 
 We should have a mechanism where this nutchbean is
 pooled .I mean is created
 and stored so that it can be given to the user
 
 Immediately after the  user has used the Nutch Bean
 ,he returns it back
 
 (example at orkut ,we get a message saying doughnut
 not available)
 
 This will make search result faster and more
 efficient
 
 Only when paraller users are there will nutchbeans
 get created
 
 Any comments on the above issue
 
 
 Rgds
 Prabhu
 
 
 
 
 
 
 
 




Re: Crawling blogs and RSS

2005-10-18 Thread Chris Mattmann
Hi Miguel,

 Actually it's not out of priority, unfortunately because of the generic
nature of the mime type text/xml. Turns out that a lot of RSS comes back
as configured by the web server with the content type text/xml, even
though it's recommended that application/rss+xml be used as the mime type
for RSS. Most web server admins don't really spend the time configuring this
mime type correctly in their web server. Further, if you go look at the IANA
list of mime types, there really isn't a mime type specified for RSS
(although RDF has applicaction/rdf+xml, which is sometimes used to identify
RSS as well). 

 So when I coded up the parse-plugins.xml file, I just noted the fact that
text/xml isn't really the standard mime type for rss, it's just the mime
type for any type of XML document, i.e., something that starts out with
?xml version=., which can conform to * any * XML Schema or DTD as
specified, which means identifying a document as text/xml doesn't really get
you anywhere unfortunately. That's what I set the parse-text plugin to be
the highest priority for text/xml, as in my mind it was most suited to
handle the generic nature of XML documents. I listed parse-html as 2nd in
priority because XHTML is becoming more popular and a pervasive form of
content. Finally, parse-rss is last, well, because, I think it should be.
:-) If you think about it, parse-rss is really only meant to handle RSS
feeds, which may, or may not, come back with the mime type text/xml.

So, to answer your question, yes, parse-rss is last in the default
parse-plugins file. However, this doesn't mean it has to be that way in your
file. You are free to modify this list. Remember that order matters, in
fact, the order that the plugin comes underneath a mime type specifies its
order of preference to be used during parsing. You can find the full
specification of this at:

http://wiki.apache.org/nutch/ParserFactoryImprovementProposal/

which was authored by myself, Jerome Charron, and Sebastien LeCallonec
jointly. 

One part of fixing this problem is correct mime type identification for
document types, which I know that Jerome is working on an update to, and
will soon have a new mime type registry committed to Nutch. The other part
of this however, is deeper than just correct mime type identification. It
has to do with understanding the appropriate DTD or XML Schema that an XML
document conforms to. Only then will we understand the right parser to
call for an XML document. This could be handled in a number of ways, off the
top of my head, 2 ways come to mind:

1. Having a generic text/xml reading plugin than could parse out the
DTD/or XML Schema used by an XML document, and then call the right sub XML
parsing plugin, that knew how to handle that DTD or schema

2. Adding an attribute to the plugin.xml file that specifies the DTD or
Schema that an XML Parsing Plugin supports, and then doing the resolution in
a decentralized fashion whenever the mime type text/xml is encountered

Anyways, I have been thinking about this for a while, and will start working
on a proposal and solution in the near future. For now, if you like, you
could create a JIRA issue about this as a wish or improvement to be
worked on in the (near) future.


FYI, here are a few interesting articles on the subject:

http://spazioinwind.libero.it/pierfederici/blog/56.html
http://www.rassoc.com/gregr/weblog/archive.aspx?post=662

Thanks,
  Chris



On 10/18/05 9:36 AM, Miguel A Paraz [EMAIL PROTECTED] wrote:

 Hi,
 I'm trying to set up Nutch to crawl blogs.
 
 For nutch-site.xml, I added parse-rss to plugin.includes:
 valueprotocol-httpclient|urlfilter-regex|parse-(text|html|js|rss)|index-more
 |query-(basic|site|url)/value
 
 
 and set db.ignore.internal.links to false.
 
 I noticed that in parse-plugins.xml:
 
 mimeType name=text/xml
 plugin id=parse-text /
 plugin id=parse-html /
 plugin id=parse-rss /
 /mimeType
 
 is this by order of priority, and parse-rss is last?
 
 I tried injecting a single URL, my blog feed which is text/xml:
 http://migs.paraz.com/w/feed/
 
 It apparently isn't parsed.
 
 Thanks in advance.

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
 
 





RE: [Nutch-general] RE: RSS Feed Parser

2005-08-25 Thread Chris Mattmann
Hi Jeff,

 Okay, here is the link to commons-feedparser source that includes my
modifications:

http://www-scf.usc.edu/~mattmann/feedparser-0.6-fork-src.zip


Thanks!

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



 -Original Message-
 From: Jeff Bowden [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 24, 2005 10:45 PM
 To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: Re: [Nutch-general] RE: RSS Feed Parser
 
 Yes please, that would be great.  I couldn't even figure out where to
 find the 0.6 version of feedparser, much less your patches to it.
 
 Chris Mattmann wrote:
 
 Hi Jeff,
 
commons-feedparser-fork was a branched off version of the feedparser
 0.6
 base code that I made, which removed some of the specific jar files that
 were part of standard 0.6 feedparser distro that conflicted with the jar
 files included in Nutch's lib directory. Specifically, I changed it so
 that
 the core jaxen libraries that the feed parser relied on weren't dom4j,
 but
 in fact were jdom (see postings on the Nutch list around March 2005
 between
 John X, Stefan G. and I). This required changing about 9 or 10 of the
 source
 files for the feedparser to use the jdom Node classes rather than the
 dom4j.
 
 If you like, I can put up a link to the feedparser forked code on my
 website, and post the link to the list.
 
 Thanks,
   Chris
 
 
 
 On 8/24/05 2:04 PM, American Jeff Bowden [EMAIL PROTECTED]
 wrote:
 
 
 
 Where can I obtain the source of commons-feedparser-0.6-fork.jar?  It
 doesn't appear to be in commons svn or on the feedparser site.
 
 Chris Mattmann wrote:
 
 
 
 Hi Zaheed,
 
 Thanks for the nice comments. I've went ahead and wrote an HTML page
 that
 summarizes what I sent to Zaheed with respect to installing the parse-
 rss
 plugin. You can find the small guide here:
 
 http://www-scf.usc.edu/~mattmann/parse-rss-install.html
 
 
 Thanks,
  Chris
 
 
 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group
 
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 ___
 
 Disclaimer:  The opinions presented within are my own and do not
 reflect
 those of either NASA, JPL, or the California Institute of Technology.
 
 
 
 
 
 
 
 -Original Message-
 From: Zaheed Haque [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 11, 2005 11:49 AM
 To: nutch-user@lucene.apache.org
 Subject: RSS Feed Parser
 
 Hello:
 
 I am realy hoping that Chris Mattmann RSS parser will make it to the
 release 0.7.
 
 http://issues.apache.org/jira/browse/NUTCH-30
 
 I got it working from last nights SVN. I believe newbie users like me
 would benefit very much having it as a part of the distribution. +1
 for this plugin!
 
 Thanks Chris for solving my problem!!
 --
 Best Regards
 Zaheed Haque
 
 
 
 
 
 ---
 SF.Net email is Sponsored by the Better Software Conference  EXPO
 September 19-22, 2005 * San Francisco, CA * Development Lifecycle
 Practices
 Agile  Plan-Driven Development * Managing Projects  Teams * Testing 
 QA
 Security * Process Improvement  Measurement *
 http://www.sqe.com/bsce5sf
 ___
 Nutch-general mailing list
 Nutch-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nutch-general
 
 
 
 
 
 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group
 
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 ___
 
 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.
 
 
 
 
 
 
 
 ---
 SF.Net email is Sponsored by the Better Software Conference  EXPO
 September 19-22, 2005 * San Francisco, CA * Development Lifecycle
 Practices
 Agile  Plan-Driven Development * Managing Projects  Teams * Testing 
 QA
 Security * Process Improvement

RE: [Nutch-general] RE: RSS Feed Parser

2005-08-25 Thread Chris Mattmann
Hi Jeff,
 
  Yup, that's correct.

Thanks,
 Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


 -Original Message-
 From: American Jeff Bowden [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 25, 2005 12:37 PM
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: Re: [Nutch-general] RE: RSS Feed Parser
 
 I notice that build.xml still creates commons-feedparser-0.5.0-RC1.jar
 but I'll assume you're just renaming it manually to -0.6-fork.
 
 Thanks.
 
 
 Chris Mattmann wrote:
 
 Hi Jeff,
 
  Okay, here is the link to commons-feedparser source that includes my
 modifications:
 
 http://www-scf.usc.edu/~mattmann/feedparser-0.6-fork-src.zip
 
 
 Thanks!
 
 Cheers,
   Chris
 
 
 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group
 
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 ___
 
 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.
 
 
 
 
 
 -Original Message-
 From: Jeff Bowden [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 24, 2005 10:45 PM
 To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: Re: [Nutch-general] RE: RSS Feed Parser
 
 Yes please, that would be great.  I couldn't even figure out where to
 find the 0.6 version of feedparser, much less your patches to it.
 
 Chris Mattmann wrote:
 
 
 
 Hi Jeff,
 
   commons-feedparser-fork was a branched off version of the feedparser
 
 
 0.6
 
 
 base code that I made, which removed some of the specific jar files
 that
 were part of standard 0.6 feedparser distro that conflicted with the
 jar
 files included in Nutch's lib directory. Specifically, I changed it so
 
 
 that
 
 
 the core jaxen libraries that the feed parser relied on weren't dom4j,
 
 
 but
 
 
 in fact were jdom (see postings on the Nutch list around March 2005
 
 
 between
 
 
 John X, Stefan G. and I). This required changing about 9 or 10 of the
 
 
 source
 
 
 files for the feedparser to use the jdom Node classes rather than the
 
 
 dom4j.
 
 
 If you like, I can put up a link to the feedparser forked code on my
 website, and post the link to the list.
 
 Thanks,
  Chris
 
 
 
 On 8/24/05 2:04 PM, American Jeff Bowden [EMAIL PROTECTED]
 wrote:
 
 
 
 
 
 Where can I obtain the source of commons-feedparser-0.6-fork.jar?  It
 doesn't appear to be in commons svn or on the feedparser site.
 
 Chris Mattmann wrote:
 
 
 
 
 
 Hi Zaheed,
 
 Thanks for the nice comments. I've went ahead and wrote an HTML page
 
 
 that
 
 
 summarizes what I sent to Zaheed with respect to installing the
 parse-
 
 
 rss
 
 
 plugin. You can find the small guide here:
 
 http://www-scf.usc.edu/~mattmann/parse-rss-install.html
 
 
 Thanks,
 Chris
 
 
 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group
 
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 ___
 
 Disclaimer:  The opinions presented within are my own and do not
 
 
 reflect
 
 
 those of either NASA, JPL, or the California Institute of Technology.
 
 
 
 
 
 
 
 
 
 -Original Message-
 From: Zaheed Haque [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 11, 2005 11:49 AM
 To: nutch-user@lucene.apache.org
 Subject: RSS Feed Parser
 
 Hello:
 
 I am realy hoping that Chris Mattmann RSS parser will make it to the
 release 0.7.
 
 http://issues.apache.org/jira/browse/NUTCH-30
 
 I got it working from last nights SVN. I believe newbie users like
 me
 would benefit very much having it as a part of the distribution. +1
 for this plugin!
 
 Thanks Chris for solving my problem!!
 --
 Best Regards
 Zaheed Haque
 
 
 
 
 
 
 ---
 SF.Net email is Sponsored by the Better Software Conference  EXPO
 September 19-22, 2005 * San Francisco, CA * Development Lifecycle
 
 
 Practices
 
 
 Agile  Plan-Driven Development * Managing Projects  Teams * Testing
 
 
 
 QA

Re: Chris Mattmann's RSS plugin? NUTCH-30

2005-07-21 Thread Chris Mattmann
Hi Andrzej,

  At the time that I was working diligently on this plugin (April/May), I
had done some thorough research into finding what I felt would be the most
flexible, reliable way to parse RSS files. The RSS feed parser out of the
jakarta-commmons sandbox was what I found, and I stand by it. I understand
your concerns however about its reliance on several libraries, but it just
comes with the territory in this case. However, as noted in:
http://issues.apache.org/jira/browse/NUTCH-30  by Kevin Burton, when
feedparser 2.0 comes out, the reliance on the external libraries will be
removed, so I think that by adopting the feedparser based plugin right now,
we have a clear upgrade path that leads us to the plugin's independence of
external libraries, without changing (much of) the underlying source code.

That's my two cents.

Thanks!

Cheers,
  Chris Mattmann



On 7/20/05 11:58 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 [EMAIL PROTECTED] wrote:
 Hi,
 
 Does anyone know why Chris Mattmann's RSS plugin (
 http://issues.apache.org/jira/browse/NUTCH-30 ) wasn't put in the
 repository, and whether there are plans to revive it and include it?
 
 That's probably my fault. I was almost ready to import it, but then
 during the final review I hesitated - I'm wary of pulling in so many
 dependencies. Then other things got in the way, and I sort of dropped it
 for the moment...
 
 If there's no way to parse RSS reliably other than using these dozens of
 libraries, so be it. Is this the case?

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
 
 





RE: benchmarking

2005-07-20 Thread Chris Mattmann
Hi there Jay,

 Here are some numbers that a colleague and I presented in my graduate
computer science seminar class on search engines in the Spring 05' semester
at USC. The numbers measure the efficiency and scalability of several of the
plugin content extractors for Nutch (PDF, WORD, RSS, etc.). The tests were
performed on a RedHat Linux 7.3 Box, with 1.3 GB RAM, and a 10 GB HD, and a
Pentium III 500 Mhz processor. 

 The presentation is geared towards the parse-rss plugin that I wrote,
although they should give you an idea of the other content extractors too.

Hope they help, here's the link to the presentation:

http://baron.pagemewhen.com:8080/~chris/RSS-Nutch-Eval.ppt

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion Laboratory   Pasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

 -Original Message-
 From: webmaster [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, July 20, 2005 8:02 PM
 To: nutch-user@lucene.apache.org
 Subject: benchmarking
 
 hey could some of you post your speeds (sorting,indexing, pages a
 sec/documents a sec) and system specs I'm trying to compile a database of
 which of nutches functions are better suited to run on what hardware. also
 if
 any of you have a sun box could you post its specs and some of the info
 for
 database sorting speeds and indexing speed, anything that uses full cpu.
 whats everyones pages a sec top score??? e-mail me @
 [EMAIL PROTECTED]
 I'll post a webpage with the results
 Thanks,
 -Jay Pound



RE: benchmarking

2005-07-20 Thread Chris Mattmann
Hi Jay,

 One quick note on the previous presentation link that I sent out. It
mentions in the presentation that Nutch does not have a syndication feed
capability. At the time of the presentation (April 2005), Nutch was in the
early stages of having this capability through the opensearch API. As I
understand it, Nutch has this capability now? So, if it does, just wanted to
qualify the bullet in the presentation.

Take care,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion Laboratory   Pasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


 -Original Message-
 From: webmaster [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, July 20, 2005 8:03 PM
 To: nutch-user@lucene.apache.org
 Subject: benchmarking
 
 hey could some of you post your speeds (sorting,indexing, pages a
 sec/documents a sec) and system specs I'm trying to compile a database of
 which of nutches functions are better suited to run on what hardware. also
 if
 any of you have a sun box could you post its specs and some of the info
 for
 database sorting speeds and indexing speed, anything that uses full cpu.
 whats everyones pages a sec top score??? e-mail me @
 [EMAIL PROTECTED]
 I'll post a webpage with the results
 Thanks,
 -Jay Pound