date:20100504

Nutch crawled databases

2010-05-04 Thread Renbyna

We are looking for already crawled data for Nutch search engine. Is there anyone willing to share their crawled data? We just want to use the data for testing the performance of Nutch search engine. Even stale data is fine, as long as it is large. We want to avoid the crawling step, which seems

Re: Parsing .ppt, .xls, .rtf and .doc

2010-05-04 Thread nachonieto3

Finally I solved. It was a problem of the URLS that I was trying to analyze. I was trying to crawl and parse links with spaces in them. I mean, this kind of links: http://nutch user/nutch.doc. I solve this problem by changing some things of the URL filter. Thanks by the way. -- View this message

Parsing html

2010-05-04 Thread nachonieto3

Good afternoon, Once I solved my problem with the other formats. Now I'm trying to figure out how to solve another one. I'm able to parse .html format but I get the ParseText in one line. I would like to respect at least the paragraphs of the original document. Anyone know how to do it? Thank

Re: nutch crawl issue

2010-05-04 Thread matthew a. grisius

Hi Chris, It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES and/or javascript. Using the parse-html suggested work around I am able to process my simple test cases such as javadoc which does include simple embedded javascript (of course I can't verify that it is actually

Re: nutch crawl issue

2010-05-04 Thread Mattmann, Chris A (388J)

Hi Matthew, I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at Julien’s patch and see if there is a way to get it committed sooner rather than later. One way to help me do that ― since you already have an environment and set of use cases where this is reproduceable can

Nutch crawled databases

Re: Parsing .ppt, .xls, .rtf and .doc

Parsing html

Re: nutch crawl issue

Re: nutch crawl issue

5 matches

Site Navigation

Mail list logo

Footer information