We are looking for already crawled data for Nutch search engine. Is there
anyone willing to share their crawled data? We just want to use the data for
testing the performance of Nutch search engine. Even stale data is fine, as
long as it is large. We want to avoid the crawling step, which seems
Finally I solved. It was a problem of the URLS that I was trying to analyze.
I was trying to crawl and parse links with spaces in them. I mean, this kind
of links: http://nutch user/nutch.doc.
I solve this problem by changing some things of the URL filter.
Thanks by the way.
--
View this message
Good afternoon,
Once I solved my problem with the other formats. Now I'm trying to figure
out how to solve another one.
I'm able to parse .html format but I get the ParseText in one line. I would
like to respect at least the paragraphs of the original document. Anyone
know how to do it?
Thank
Hi Chris,
It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
and/or javascript. Using the parse-html suggested work around I am able
to process my simple test cases such as javadoc which does include
simple embedded javascript (of course I can't verify that it is actually
Hi Matthew,
I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at
Julien’s patch and see if there is a way to get it committed sooner rather
than later.
One way to help me do that ― since you already have an environment and set
of use cases where this is reproduceable can