date:20100421

Re: Format of the Nutch Results

2010-04-21 Thread Harry Nutch

I think you need to specify the individual segment.. bin/nutch readseg -dump crawl-20100420112025/segments/20100422092816 dumpSegmentDirectory On Wed, Apr 21, 2010 at 9:38 PM, nachonieto3 wrote: > > Thank you a lot! Now I'm working on that, I have some doubts more...I'm not > able to run the comm

Re: AbstractMethodError for cyberneko parser

2010-04-21 Thread Harry Nutch

Thanks Julien. I have changed nutch-site.xml to have only parse-(tika) instead of parse-(text | html | js | tika) in plugin.includes property. It works now as it doesn't pick up any other parser besides tika. On Wed, Apr 21, 2010 at 7:42 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: >

April Seattle Hadoop/Scalability/NoSQL Meetup: Cassandra, Science, More!

2010-04-21 Thread Bradford Stephens

Hey there! Wanted to let you all know about our next meetup, April 28th. We've got a killer new venue thanks to Amazon. Check out the details at the link: http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/calendar/13072272/ Our Speakers this month: 1. Nick Dimiduk, Drawn to Scale: Intro to

RE: Is there some arbitrary limit on content stored for use by summaries?

2010-04-21 Thread Arkadi.Kosmynin

Hi Tim, I would think that this parameter is related to the problem you describe, but the default value should allow indexing pages of the size you mention. Did you change this parameter? Regards, Arkadi indexer.max.tokens 1 The maximum number of tokens that will be indexed for

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

2010-04-21 Thread joshua paul

YES - I forgot to include that... robots.txt is fine. it is wide open: ### # # sample robots.txt file for this website # # addresses all robots by using wild card * User-agent: * # # list folders robots are not allowed to index #Disallow: /tutorials/404redirect/ Disall

Re: Hadoop Disk Error

2010-04-21 Thread Joshua J Pavel

Using 1.1, it looks like the same error at first: threads = 10 depth = 5 indexer=lucene Injector: starting Injector: crawlDb: crawl-20100421175011/crawldb Injector: urlDir: /projects/events/search/apache-nutch-1.1/cmrolg-even/urls Injector: Converting injected urls to crawl db entries. Exception i

specify nutchConfiguration File

2010-04-21 Thread Jan Philippe Wimmer

Hi, how do i set up a specific crawldb in my own Java App? I tried to do it like the following snip: Configuration nutchConf = NutchConfiguration.create(); //nutchConf.addResource(new Path(prop.getProperty("nutchPath"))); Path configPath = new Path("/cygdrive/f/Workspaces/Nu

Is there some arbitrary limit on content stored for use by summaries?

2010-04-21 Thread Tim Redding

Hey, We have a long page that appears in the search results but the summary never contains the search terms. Why is this? If we move the text containing the search terms up the page they get displayed in the summary so it's obviously related to some limit imposed somewhere. I've looked though

Re: Hadoop Disk Error

2010-04-21 Thread Julien Nioche

Joshua, Could you try using Nutch 1.1 RC1 (see http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/)? Could you also try separating the fetching and parsing steps? e.g fetch first as you already do then parse the fetched segment (instead of parsing while refetching) Your crawl is fairly small

RE: Hadoop Disk Error

2010-04-21 Thread Joshua J Pavel

I get the same error on a filesystem with 10 GB (disk space is a commodity here). The final crawl when it succeeds on my Windows machine is 93 MB, so I really hope it doesn't need more than 10 GB to even pull down and parse the first URL. Is there something concerning threading that could intro

Re: how to parse html files while crawling

2010-04-21 Thread nachonieto3

Thank you a lot! Now I'm working on that, I have some doubts more...I'm not able to run the command readseg...I've been consulting some help forum and the basic synthesis is readseg I have the segments in this path: D:\nutch-0.9\crawl-20100420112025\segments The file named crawl-20100420112025

Re: Format of the Nutch Results

2010-04-21 Thread nachonieto3

Thank you a lot! Now I'm working on that, I have some doubts more...I'm not able to run the command readseg...I've been consulting some help forum and the basic synthesis is readseg I have the segments in this path: D:\nutch-0.9\crawl-20100420112025\segments The file named crawl-20100420112025

Re: AbstractMethodError for cyberneko parser

2010-04-21 Thread Julien Nioche

Hi Harry, Could you try using parse-tika instead and see if you are getting the same problem? I gather from your email that you are using Nutch 1.1 or the SVN version, so parse-tika should be used by default. Have you deactivated it? Thanks Julien On 21 April 2010 11:58, Harry Nutch wrote: >

Re: how to parse html files while crawling

2010-04-21 Thread Ankit Dangi

To convert the Nutch's crawled data which is stored in segments to human readable and interpretable forms, you will have to look at the 'segread' command (which was earlier 'readseg'). It reads and exports the segment data. Details at Nutch Wiki: http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutc

Re: AbstractMethodError for cyberneko parser

2010-04-21 Thread Harry Nutch

Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to fix the problem. On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch wrote: > Hi, > > I am running the latest version for nutch. While crawling one particular > site I get a AbstractMethodError in the cyberneko plugin for all of

Re: Retrieving the term vectors of a document in Nutch

2010-04-21 Thread voltman

House Less wrote: > > > Hello everyone, > > I am quite new to development with Nutch, so you must forgive my question > if it is amateurish. I asked it at the Lucene Java user mailing list and > Grant Ingersoll referred me to this list. > > After > some reading of Luke's source code, I found

AbstractMethodError for cyberneko parser

2010-04-21 Thread Harry Nutch

Hi, I am running the latest version for nutch. While crawling one particular site I get a AbstractMethodError in the cyberneko plugin for all of it pages when doing a Fetch. As i understand, this has to do because of difference between the runtime and compile version. However, I am running it afre

Re: Format of the Nutch Results

Re: AbstractMethodError for cyberneko parser

April Seattle Hadoop/Scalability/NoSQL Meetup: Cassandra, Science, More!

RE: Is there some arbitrary limit on content stored for use by summaries?

Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com

Re: Hadoop Disk Error

specify nutchConfiguration File

Is there some arbitrary limit on content stored for use by summaries?

Re: Hadoop Disk Error

RE: Hadoop Disk Error

Re: how to parse html files while crawling

Re: Format of the Nutch Results

Re: AbstractMethodError for cyberneko parser

Re: how to parse html files while crawling

Re: AbstractMethodError for cyberneko parser

Re: Retrieving the term vectors of a document in Nutch

AbstractMethodError for cyberneko parser

17 matches

Site Navigation

Mail list logo

Footer information