RE: benchmarking

2005-07-20 Thread Chris Mattmann
Hi there Jay, Here are some numbers that a colleague and I presented in my graduate computer science seminar class on search engines in the Spring 05' semester at USC. The numbers measure the efficiency and scalability of several of the plugin content extractors for Nutch (PDF, WORD, RSS, etc.).

RE: benchmarking

2005-07-20 Thread Chris Mattmann
Hi Jay, One quick note on the previous presentation link that I sent out. It mentions in the presentation that Nutch does not have a syndication feed capability. At the time of the presentation (April 2005), Nutch was in the early stages of having this capability through the opensearch API. As I

Re: Chris Mattmann's RSS plugin? NUTCH-30

2005-07-21 Thread Chris Mattmann
, so I think that by adopting the feedparser based plugin right now, we have a clear upgrade path that leads us to the plugin's independence of external libraries, without changing (much of) the underlying source code. That's my two cents. Thanks! Cheers, Chris Mattmann On 7/20/05 11:58 PM

RE: [Nutch-general] RE: RSS Feed Parser

2005-08-25 Thread Chris Mattmann
PROTECTED] Subject: Re: [Nutch-general] RE: RSS Feed Parser Yes please, that would be great. I couldn't even figure out where to find the 0.6 version of feedparser, much less your patches to it. Chris Mattmann wrote: Hi Jeff, commons-feedparser-fork was a branched off version

RE: [Nutch-general] RE: RSS Feed Parser

2005-08-25 Thread Chris Mattmann
.jar but I'll assume you're just renaming it manually to -0.6-fork. Thanks. Chris Mattmann wrote: Hi Jeff, Okay, here is the link to commons-feedparser source that includes my modifications: http://www-scf.usc.edu/~mattmann/feedparser-0.6-fork-src.zip Thanks! Cheers

Re: Crawling blogs and RSS

2005-10-18 Thread Chris Mattmann
Hi Miguel, Actually it's not out of priority, unfortunately because of the generic nature of the mime type text/xml. Turns out that a lot of RSS comes back as configured by the web server with the content type text/xml, even though it's recommended that application/rss+xml be used as the mime

Re: resource pool for nutchbean

2006-01-05 Thread Chris Mattmann
Hi Raghavendra, I think that this is a good idea. What about a commons-pool (http://jakarta.apache.org/commmons/pool/) implementation? The nutch bean pool could be built using the basic API classes from this package... Cheers, Chris On 1/5/06 1:43 PM, Raghavendra Prabhu [EMAIL PROTECTED]

Re: resource pool for nutchbean

2006-01-05 Thread Chris Mattmann
: Ya we shud do this . It will considerably improve performance We shud start building upon this . Rgds Raghavendra Prabhu On 1/6/06, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Raghavendra, I think that this is a good idea. What about a commons-pool (http://jakarta.apache.org

RE: indexing issue

2006-02-01 Thread Chris Mattmann
Hi Raghavendra, Pop open your $NUTCH_HOME/conf/parse-plugins.xml file. Look for the mimeType name=* portion of the file. Now, look at the parser tag underneath it. Change that parser id to the one you want to use for your default parser, i.e., in your case, parse-msword. Hope that helps!

RE: indexing issue

2006-02-01 Thread Chris Mattmann
committed that a while back. Was your problem with cached.jsp having to do with absolute versus relative links? Thanks, Chris Rgds Prabhu On 2/1/06, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Raghavendra, Pop open your $NUTCH_HOME/conf/parse-plugins.xml file. Look

Re: Which version of rss does parse-rss plugin support?

2006-02-03 Thread Chris Mattmann
Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS

RE: Which version of rss does parse-rss plugin support?

2006-02-05 Thread Chris Mattmann
that helps. Cheers, Chris On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse

Re: Which version of rss does parse-rss plugin support?

2006-02-10 Thread Chris Mattmann
a full channel model and item model that can be extended and used for those purposes. Hope that helps! Cheers, Chris 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type

Re: project vitality?

2006-03-04 Thread Chris Mattmann
? Until we could see such numbers, I'm hesitant to believe what you're saying is true. If it is though, then I'm sure that the community would welcome any updates to the PDF parsing plugin that expedite its improvement. Cheers, Chris -Original Message- From: Chris Mattmann

Re: Nutch and Hadoop Tutorial Finished

2006-03-20 Thread Chris Mattmann
Hi Dennis, Thanks for your hard work. Where exactly on the wiki is the tutorial? I'm not seeing it. Cheers, Chris On 3/20/06 2:52 PM, Dennis Kubes [EMAIL PROTECTED] wrote: The NutchHadoop tutorial is now up on the wiki. Dennis -Original Message- From: Vanderdray, Jacob

Re: Same Error (Version 0.8)

2006-04-11 Thread Chris Mattmann
Hi Mike, Could you post the snippet from your nutch-site.xml where you enable plugin: org.apache.nutch.xxx.xxx.xxx. Could you also be more specific and post the entire name of the plugin that it printed in your log file? This warning message basically means that there was an entry in the

Re: Same Error (Version 0.8)

2006-04-12 Thread Chris Mattmann
Hi Mike, Well one thing that I notice off the bat is that you specify the alias tag in nutch-site.xml (or maybe this was a typo when you posted the message). If it wasn't, the alias tag should go into $NUTCH_HOME/conf/parse-plugins.xml, the same place where you mapped the mimeTypes to plugin

Re: Same Error (Version 0.8)

2006-04-12 Thread Chris Mattmann
Hi Mike, Another thing is: are you making sure that your plugin is being built? That is, did you add an entry in $NUTCH_HOME/src/build.xml for your plugin, underneath the the deploy tag (at least)? This will cause your plugin to be built when the rest of the plugins are built, and then copied to

Re: Blogger RSS Parsing Error

2006-04-17 Thread Chris Mattmann
Hi Mike, The RSS parser for Nutch is based on Kevin Burton's commons-feedparser in the Jakarta Sandbox. Here is the documentation for that feedparser: http://jakarta.apache.org/commons/sandbox/feedparser/ You might want to post to the commons-feedparser email list asking him about your RSS

Re: Starting Nutch in init.d?

2006-07-28 Thread Chris Mattmann
Guys, Sorry, I misspoke: the issue was actually: NUTCH-210, not NUTCH-245. You can view the issue at: http://issues.apache.org/jira/browse/NUTCH-210 Cheers, Chris On 7/28/06 10:29 AM, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Guys, In 0.8, it's even easier than that: Since NUTCH

Re: Feedparser 0.6 fork source code

2006-08-08 Thread Chris Mattmann
Hi Jeremy, I've uploaded the fork-src to my USC website. Here is the URL: http://www-scf.usc.edu/~mattmann/feedparser-src-fork.tar.gz I'll leave the file up there for a few days at least, so feel free to grab it at your leisure. Thanks, Chris On 8/8/06 4:55 PM, HUYLEBROECK Jeremy

Re: [Nutch-0.8] Missing WAR file

2006-08-12 Thread Chris Mattmann
Hi Guys, On 8/12/06 9:27 AM, Hou Keat Lee [EMAIL PROTECTED] wrote: Hi, May be I'm missing something here. If the packaged WAR file is suppose to be used, how does nutch links back to my crawling results and indexes? Another option for this would be to use the generated nutch.xml file

Re: Speeding up compilation without compiling plugins

2006-08-25 Thread Chris Mattmann
Hi Michael, I believe that there is an ant task called compile-core. If you just type: # ant compile-core Rather than: # ant You should be good to go. HTH, Chris On 8/25/06 5:48 AM, Michael Wechner [EMAIL PROTECTED] wrote: Hi How can I disable the compiling of all plugins such

Re: RSS search by nutch

2006-08-28 Thread Chris Mattmann
Hi there Dima, I'm not exactly sure what you mean by real time, but there is an RSS Parsing plugin in Nutch that can parse RSS feeds that Nutch encounters during its crawl. You can enable parse-rss by opening up $NUTCH_HOME/conf/nutch-site.xml, and searching for the property plugin.includes.

Re: RSS search by nutch

2006-08-28 Thread Chris Mattmann
Hi Jeremy, On 8/28/06 10:18 AM, HUYLEBROECK Jeremy RD-ILAB-SSF [EMAIL PROTECTED] wrote: The Nutch Feed/RSS plugin (parse-rss) only allows you to search the entire channel/feed text, not items individually. Actually, this isn't entirely the case. parse-rss actually indexes the item text (see

Re: intranet crawl problems: mime types; .doc-related exceptions; really, really slow crawl + possible infinite loop

2006-08-30 Thread Chris Mattmann
Hi there Tomi, On 8/30/06 12:25 PM, Tomi NA [EMAIL PROTECTED] wrote: I'm attempting to crawl a single samba mounted share. During testing, I'm crawling like this: ./bin/nutch crawl urls -dir crawldir4 -depth 2 -topN 20 I'm using luke 0.6 to query and analyze the index. PROBLEMS

Re: rss integration

2006-09-11 Thread Chris Mattmann
Thanks a lot ... again. Ernesto. Chris Mattmann escribió: Hi Ernesto, The RSSParser in Nutch does in fact index the individual item links: they are added as Outlinks during each iteration in which the RSSParser is called. Both the channel text and the item text are indexed. Also

Re: java.lang.NullPointerException

2006-10-11 Thread Chris Mattmann
Hi there, You need to set your http.agent.name property within $NUTCH_HOME/conf/nutch-default.xml. HTH, Chris On 10/11/06 3:57 AM, Guruprasad Iyer [EMAIL PROTECTED] wrote: Hello, I have Nutch 0.8.1 installed on linux (FC3) along with java 1.5.0_07. When I run the crawl command I get

Re: java.lang.NullPointerException

2006-10-11 Thread Chris Mattmann
identifying name and then set it to that. Cheers, Chris On 10/11/06 8:36 AM, Guruprasad Iyer [EMAIL PROTECTED] wrote: Hi Chris, Thanks for the reply. But, what value should I set it to? Can you help me on this? Thanks once again. Cheers, Guruprasad On 10/11/06, Chris Mattmann [EMAIL

Re: Indexing xml documents on local file system

2006-11-27 Thread Chris Mattmann
Hi Thorsten On 11/27/06 4:00 AM, Thorsten Scherler [EMAIL PROTECTED] wrote: Reading the wiki and the docu I get the impression I need to write my own implementation of an indexer/searcher plugin, which is able to filter/index crucial filter information such as summary year=2006 number=209

Re: Problem crawling/fetching using https

2007-01-24 Thread Chris Mattmann
Hi Michi, I am pretty sure that in order to support https, you need to enable the protocol-httpclient plugin, which is based on commons-httpclient. There isn't a protocol-https plugin as far as I know. Try that and see if that fixes your issue. Thanks! Cheers, Chris On 1/24/07 2:29 PM,

Re: Problem crawling/fetching using https

2007-01-24 Thread Chris Mattmann
Hi Michi, Btw, wouldn't it make sense to add protocol-httpclient as default, because I guess I am not the only one trying to fetch pages using https? Indeed. The issue with this was in fact that some time ago, the powers that be decided that it probably made sense to make protocol-httpclient

Re: Problem crawling/fetching using https

2007-01-24 Thread Chris Mattmann
Hi Guys, Yep, I couldn't remember exactly what the issues were. Thanks for digging that up, Andrzej. So, yeah, anyways it may make sense to update nutch-site.xml with the comment below, with performance problems replaced with intermittent problems with the underlying commons-httpclient library.

Re: Problem crawling/fetching using https

2007-01-25 Thread Chris Mattmann
! Cheers, Chris On 1/24/07 3:29 PM, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Guys, Yep, I couldn't remember exactly what the issues were. Thanks for digging that up, Andrzej. So, yeah, anyways it may make sense to update nutch-site.xml with the comment below, with performance

Nutch 0.9 officially released!

2007-04-05 Thread Chris Mattmann
Hi Folks, After some hard work from all folks involved, we've managed to push out Apache Nutch, release 0.9. This is the second release of Nutch based entirely on the underlying Hadoop platform. This release includes several critical bug fixes, as well as key speedups described in more detail at

Re: nutch-09 start problem

2007-04-12 Thread Chris Mattmann
Hi Ratnesh, I'm not sure that declaring Nutch 0.9 an unstable version is an entirely appropriate label -- it's been through several stress tests by the committers so far, and it seems to be performing well enough -- so much so that we decided it was worthwhile to make a release of it :). I

Re: Suggested fixes to http://wiki.apache.org/nutch/WritingPluginExample-0.9

2007-07-19 Thread Chris Mattmann
Hi Jasper, As I understand it, you can make these updates yourself. Sign up for a wiki account and then login with your username/password and you can update the page yourself. Thanks! Cheers, Chris On 7/19/07 10:10 AM, Jasper Kamperman [EMAIL PROTECTED] wrote: Hi, I spent several

Re: Indexing Feeds Blog Posts with Nutch

2007-10-11 Thread Chris Mattmann
them to the crawlist and indexing the HTML as normal? Also, if anyone is using Nutch to index blogs/feeds, then I'd be interested in how you have it configured. Thanks again, __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer

Re: Indexing Feeds Blog Posts with Nutch

2007-10-15 Thread Chris Mattmann
not the behaviour I want. Indeed, it is not what I expected either. Chris, can you confirm this is the idea ? Did you ever consider indexing separate items ? curious, *pike __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early

Re: Indexing Feeds Blog Posts with Nutch

2007-10-15 Thread Chris Mattmann
itself is built on top of the underlying ROME toolkit if I remember correctly. HTH, Chris Brian Ulicny On Thu, 11 Oct 2007 15:23:04 -0700, Chris Mattmann [EMAIL PROTECTED] said: Hi Rick, Glad to hear that you're interested in using Nutch! There are currently 2 plugins

Re: java.lang.NoClassDefFoundError Nutch 0.9

2007-11-08 Thread Chris Mattmann
, Karthik __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B

Re: java.lang.NoClassDefFoundError: org/apache/tika/mime/MimeTypeException in cached.jsp

2008-02-04 Thread Chris Mattmann
or is it a mistake at my end? __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B

Re: Tika Error ?

2008-02-14 Thread Chris Mattmann
? __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop

Re: Slow Crawl Speed and Tika Error Media type alias already exists: text/xml

2008-04-04 Thread Chris Mattmann
__ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246

Re: Next Generation Nutch

2008-04-11 Thread Chris Mattmann
! Cheers, Chris Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Chris Mattmann [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Friday, April 11, 2008 9:10:30 PM Subject: Re: Next Generation Nutch Hi Dennis, Thanks