incremental nutch crawl on remote machine

2010-04-21 Thread Piet van Remortel
Hi all I'm new to Nutch, and turned to it to obtain a setup along the following lines: We want a remote machine, running nutch (?), that we can incrementally feed URLs to, and access the index and raw content of the crawled version of those URLs. It seems to me that nutch is what we need, but I

AbstractMethodError for cyberneko parser

2010-04-21 Thread Harry Nutch
Hi, I am running the latest version for nutch. While crawling one particular site I get a AbstractMethodError in the cyberneko plugin for all of it pages when doing a Fetch. As i understand, this has to do because of difference between the runtime and compile version. However, I am running it

Re: Retrieving the term vectors of a document in Nutch

2010-04-21 Thread voltman
House Less wrote: Hello everyone, I am quite new to development with Nutch, so you must forgive my question if it is amateurish. I asked it at the Lucene Java user mailing list and Grant Ingersoll referred me to this list. After some reading of Luke's source code, I found to my

Re: AbstractMethodError for cyberneko parser

2010-04-21 Thread Harry Nutch
Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to fix the problem. On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch harrynu...@gmail.com wrote: Hi, I am running the latest version for nutch. While crawling one particular site I get a AbstractMethodError in the cyberneko

Re: how to parse html files while crawling

2010-04-21 Thread Ankit Dangi
To convert the Nutch's crawled data which is stored in segments to human readable and interpretable forms, you will have to look at the 'segread' command (which was earlier 'readseg'). It reads and exports the segment data. Details at Nutch Wiki:

Re: AbstractMethodError for cyberneko parser

2010-04-21 Thread Julien Nioche
Hi Harry, Could you try using parse-tika instead and see if you are getting the same problem? I gather from your email that you are using Nutch 1.1 or the SVN version, so parse-tika should be used by default. Have you deactivated it? Thanks Julien On 21 April 2010 11:58, Harry Nutch

Re: Format of the Nutch Results

2010-04-21 Thread nachonieto3
Thank you a lot! Now I'm working on that, I have some doubts more...I'm not able to run the command readseg...I've been consulting some help forum and the basic synthesis is readseg path of the file withe the segments I have the segments in this path: D:\nutch-0.9\crawl-20100420112025\segments

Re: how to parse html files while crawling

2010-04-21 Thread nachonieto3
Thank you a lot! Now I'm working on that, I have some doubts more...I'm not able to run the command readseg...I've been consulting some help forum and the basic synthesis is readseg path of the file withe the segments I have the segments in this path: D:\nutch-0.9\crawl-20100420112025\segments

RE: Hadoop Disk Error

2010-04-21 Thread Joshua J Pavel
I get the same error on a filesystem with 10 GB (disk space is a commodity here). The final crawl when it succeeds on my Windows machine is 93 MB, so I really hope it doesn't need more than 10 GB to even pull down and parse the first URL. Is there something concerning threading that could

Re: Hadoop Disk Error

2010-04-21 Thread Julien Nioche
Joshua, Could you try using Nutch 1.1 RC1 (see http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/)? Could you also try separating the fetching and parsing steps? e.g fetch first as you already do then parse the fetched segment (instead of parsing while refetching) Your crawl is fairly

Is there some arbitrary limit on content stored for use by summaries?

2010-04-21 Thread Tim Redding
Hey, We have a long page that appears in the search results but the summary never contains the search terms. Why is this? If we move the text containing the search terms up the page they get displayed in the summary so it's obviously related to some limit imposed somewhere. I've looked

specify nutchConfiguration File

2010-04-21 Thread Jan Philippe Wimmer
Hi, how do i set up a specific crawldb in my own Java App? I tried to do it like the following snip: Configuration nutchConf = NutchConfiguration.create(); //nutchConf.addResource(new Path(prop.getProperty(nutchPath))); Path configPath = new

Re: Hadoop Disk Error

2010-04-21 Thread Joshua J Pavel
Using 1.1, it looks like the same error at first: threads = 10 depth = 5 indexer=lucene Injector: starting Injector: crawlDb: crawl-20100421175011/crawldb Injector: urlDir: /projects/events/search/apache-nutch-1.1/cmrolg-even/urls Injector: Converting injected urls to crawl db entries. Exception

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-21 Thread joshua paul
YES - I forgot to include that... robots.txt is fine. it is wide open: ### # # sample robots.txt file for this website # # addresses all robots by using wild card * User-agent: * # # list folders robots are not allowed to index #Disallow: /tutorials/404redirect/

RE: Is there some arbitrary limit on content stored for use by summaries?

2010-04-21 Thread Arkadi.Kosmynin
Hi Tim, I would think that this parameter is related to the problem you describe, but the default value should allow indexing pages of the size you mention. Did you change this parameter? Regards, Arkadi property nameindexer.max.tokens/name value1/value description The maximum

April Seattle Hadoop/Scalability/NoSQL Meetup: Cassandra, Science, More!

2010-04-21 Thread Bradford Stephens
Hey there! Wanted to let you all know about our next meetup, April 28th. We've got a killer new venue thanks to Amazon. Check out the details at the link: http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/calendar/13072272/ Our Speakers this month: 1. Nick Dimiduk, Drawn to Scale: Intro to

Re: AbstractMethodError for cyberneko parser

2010-04-21 Thread Harry Nutch
Thanks Julien. I have changed nutch-site.xml to have only parse-(tika) instead of parse-(text | html | js | tika) in plugin.includes property. It works now as it doesn't pick up any other parser besides tika. On Wed, Apr 21, 2010 at 7:42 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote:

Re: Format of the Nutch Results

2010-04-21 Thread Harry Nutch
I think you need to specify the individual segment.. bin/nutch readseg -dump crawl-20100420112025/segments/20100422092816 dumpSegmentDirectory On Wed, Apr 21, 2010 at 9:38 PM, nachonieto3 jinietosanc...@gmail.comwrote: Thank you a lot! Now I'm working on that, I have some doubts more...I'm not