Nutch crawled databases

2010-05-04 Thread Renbyna
We are looking for already crawled data for Nutch search engine. Is there anyone willing to share their crawled data? We just want to use the data for testing the performance of Nutch search engine. Even stale data is fine, as long as it is large. We want to avoid the crawling step, which seems

Re: Parsing .ppt, .xls, .rtf and .doc

2010-05-04 Thread nachonieto3
Finally I solved. It was a problem of the URLS that I was trying to analyze. I was trying to crawl and parse links with spaces in them. I mean, this kind of links: http://nutch user/nutch.doc. I solve this problem by changing some things of the URL filter. Thanks by the way. -- View this message

Parsing html

2010-05-04 Thread nachonieto3
Good afternoon, Once I solved my problem with the other formats. Now I'm trying to figure out how to solve another one. I'm able to parse .html format but I get the ParseText in one line. I would like to respect at least the paragraphs of the original document. Anyone know how to do it? Thank

Re: nutch crawl issue

2010-05-04 Thread matthew a. grisius
Hi Chris, It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES and/or javascript. Using the parse-html suggested work around I am able to process my simple test cases such as javadoc which does include simple embedded javascript (of course I can't verify that it is actually

Re: nutch crawl issue

2010-05-04 Thread Mattmann, Chris A (388J)
Hi Matthew, I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at Julien’s patch and see if there is a way to get it committed sooner rather than later. One way to help me do that ― since you already have an environment and set of use cases where this is reproduceable can