Re: Question about crawl expectations

2006-04-25 Thread Shawn Gervais
Jason Camp wrote: Hi, I'm trying to gage whether one crawl server is performing well, and I'm having a tough time trying to determine if I could increase settings to gain faster crawls, or if I'm approaching the max the server can handle. The server is a dual AMD Althon 2200 with 2GB of ram

Re: java.io.IOException: No input directories specified in

2006-04-25 Thread Peter Swoboda
Seems to be a bit better, doesn't it? bash-3.00$ bin/nutch crawl urls -dir crawled -depht 2 060425 110124 parsing jar:file:/home/../nutch-nightly/lib/hadoop-0.1-dev.jar!/hadoop-default.xml 060425 110124 parsing file:/home/../nutch-nightly/conf/nutch-default.xml 060425 110124 parsing

Re: java.io.IOException: No input directories specified in

2006-04-25 Thread Zaheed Haque
jar:file:/home/../nutch-nightly/lib/hadoop-0.1-dev.jar!/hadoop-default.xml Somehow you are still using hadoop-0.1 and not 0.1.1. I am not sure if this update will solve your problem but it might. With the config I sent you, I could, crawl-index-serach so there must be something else.. I am not

Re: java.io.IOException: No input directories specified in

2006-04-25 Thread Peter Swoboda
Sorry, my mistake. changed to 0.1.1 results: bash-3.00$ bin/nutch crawl urls -dir crawled -depht 2 060425 113831 parsing jar:file:/home/../nutch-nightly/lib/hadoop-0.1.1.jar!/hadoop-default.xml 060425 113831 parsing file:/home/../nutch-nightly/conf/nutch-default.xml 060425 113832 parsing

Re: Restrictive searching approaches?

2006-04-25 Thread Andrew Libby
Iv'e applied the patch in the ticket linked to below. I browesed the patch to try to figure out how to use this plugin, and I'm having troubles trying to get it working. Before I get into the details, if someone has a source of information describing how nutch starts up and initializes plugins

Re: Restrictive searching approaches?

2006-04-25 Thread jay jiang
Shouldn't that be subcollection:wiki instead? Also I assumed you had subcollection added to plugin.includes in the config file (nutch-site.xml). Andrew Libby wrote: Iv'e applied the patch in the ticket linked to below. I browesed the patch to try to figure out how to use this plugin, and

Re: Restrictive searching approaches?

2006-04-25 Thread Andrew Libby
Okay, I sort of answered my own question on this. From looking at the index with luke, it seemed that I can use the url field to restrict searches. I found that the main categories in my site had url field values that were equal to the top level path in the url. So I just add search terms

How to get Text and Parse data for URL

2006-04-25 Thread Dennis Kubes
Can somebody direct me on how to get the stored text and parse metadata for a given url? Dennis

Re: How to get Text and Parse data for URL

2006-04-25 Thread Doug Cutting
NutchBean.getContent() and NutchBean.getParseData() do this, but require a HitDetails instance. In the non-distributed case, the only required field of the HitDetails for these calls is url. In the distributed case, the segment field must also be provided, so that the request can be routed

Re: How to get Text and Parse data for URL

2006-04-25 Thread Dennis Kubes
That got me started. I think that I am not fully understanding the role the segments directory and its contents play. It looks like it holds parse text and parse data in map files, but what is the content folder (also a map file)? And is the segments contents used once the index is created?

Re: How to get Text and Parse data for URL

2006-04-25 Thread Dennis Kubes
Truly I am just not understanding the concept of a segment. Dennis Kubes wrote: That got me started. I think that I am not fully understanding the role the segments directory and its contents play. It looks like it holds parse text and parse data in map files, but what is the content folder

Re: How to get Text and Parse data for URL

2006-04-25 Thread Andrzej Bialecki
Dennis Kubes wrote: Can somebody direct me on how to get the stored text and parse metadata for a given url? From a single segment, or from a set of segments? From a single segment: please see how SegmentReader.get() does this (although it's a bit obscured by the fact that it uses multiple

Re: yes, a European nutch meeting is also planed :)

2006-04-25 Thread Ian Holsman
Any other Australian's out there? I created a frappr map http://www.frappr.com/nutch if people are interested.. regards Ian On 24/04/2006, at 8:17 AM, Stefan Groschupf wrote: Berlin would work for me too. Just so we plan it well in advance ... ( 4 weeks). The wizards of open source

Re: How to get Text and Parse data for URL

2006-04-25 Thread Doug Cutting
Dennis Kubes wrote: I think that I am not fully understanding the role the segments directory and its contents play. A segment is simply a set of urls fetched in the same round, and data associated with these urls. The content subdirectory contains the raw http content. The parse-text