Re: nutch crawl issue

2010-05-05 Thread matthew a. grisius
Hi Chris, The 'maven install package' produced this for each target/maven-shared-archive-resources/... file. ... [INFO] [bundle:bundle {execution: default-bundle}] [ERROR] Error building bundle org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not exist:

Re: nutch crawl issue

2010-05-04 Thread matthew a. grisius
Hi Chris, It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES and/or javascript. Using the parse-html suggested work around I am able to process my simple test cases such as javadoc which does include simple embedded javascript (of course I can't verify that it is actually

Re: nutch crawl issue

2010-05-04 Thread Mattmann, Chris A (388J)
Hi Matthew, I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at Julien’s patch and see if there is a way to get it committed sooner rather than later. One way to help me do that ― since you already have an environment and set of use cases where this is reproduceable can

Re: nutch crawl issue

2010-05-03 Thread matthew a. grisius
Hi Chris, Yes, that worked. I caught up on email and noticed that Arpit also mentioned the same thing. Sorry I missed it, thanks to both of you! -m. On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: Hi Matthew, Hi Matthew, There is an open issue with Tika (e.g.

Re: nutch crawl issue

2010-05-03 Thread Mattmann, Chris A (388J)
Hi Matthew, Awesome! Glad it worked. Now my next question how often are you seeing that parse-tika doesn¹t work on HTML files? Is it all HTML that you are trying to process? Or just some of them? Or particular ones (categories of them). The reason I ask is that I¹m trying to determine whether I

Re: nutch crawl issue

2010-05-01 Thread Phil Barnett
This sounds exactly like what I have been experiencing. On Wed, Apr 28, 2010 at 12:39 AM, matthew a. grisius mgris...@comcast.netwrote: using Nutch nightly build nutch-2010-04-27_04-00-28: I am trying to bin/nutch crawl a single html file generated by javadoc and no links are followed. I

Re: nutch crawl issue

2010-05-01 Thread matthew a. grisius
Hi Julien, On Thu, 2010-04-29 at 18:36 +0100, Julien Nioche wrote: Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order

Re: nutch crawl issue

2010-05-01 Thread Mattmann, Chris A (388J)
Hi Matthew, Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files. The reason that I

Re: nutch crawl issue

2010-04-29 Thread matthew a. grisius
in nutch-site.xml I modified plugin.includes parse-(html) works parse-(tika) does not I need to also parse pdfs so I need both features, I tried parse-(html| tika) to see if html would be selected before tika and that did not work. On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:

Re: nutch crawl issue

2010-04-29 Thread arpit khurdiya
if u r using nigthly build, Did u changed d same in parse-plugin.xml?? uncomment this: mimeType name=text/html plugin id=parse-html / /mimeType hopefully this helps u On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius mgris...@comcast.net wrote: in nutch-site.xml I modified

Re: nutch crawl issue

2010-04-29 Thread Julien Nioche
Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files. Could you please open an issue in JIRA

Re: nutch crawl issue

2010-04-28 Thread matthew a. grisius
My subject should've been clearer, e.g. it should've read Nutch 1.1 nightly build crawl issue. Also, I did verify that Nutch 1.0 successfully completes crawling the javadoc html file and can be verified with luke-1.0.1 and searched using command line bin/nutch org.apache.nutch.searcher.NutchBean