Re: nutch crawl issue

2010-05-05 Thread matthew a. grisius
Hi Chris, The 'maven install package' produced this for each target/maven-shared-archive-resources/... file. ... [INFO] [bundle:bundle {execution: default-bundle}] [ERROR] Error building bundle org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not exist: target/maven-shared-archive-r

Re: nutch crawl issue

2010-05-05 Thread Julien Nioche
Hi Matthew, As you can see from the error messages Tika does not know how to parse javascript. There is a legacy javascript parser in Nutch which you can activate in the usual way i.e. specify parse-js in plugin.includes. It generates a lot of spurious URLs but you should give it a try and see if

Re: nutch crawl issue

2010-05-04 Thread Mattmann, Chris A (388J)
Hi Matthew, I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at Julien’s patch and see if there is a way to get it committed sooner rather than later. One way to help me do that ― since you already have an environment and set of use cases where this is reproduceable can yo

Re: nutch crawl issue

2010-05-04 Thread matthew a. grisius
Hi Chris, It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES and/or javascript. Using the parse-html suggested work around I am able to process my simple test cases such as javadoc which does include simple embedded javascript (of course I can't verify that it is actually pars

Re: nutch crawl issue

2010-05-03 Thread Mattmann, Chris A (388J)
Hi Matthew, Awesome! Glad it worked. Now my next question < how often are you seeing that parse-tika doesn¹t work on HTML files? Is it all HTML that you are trying to process? Or just some of them? Or particular ones (categories of them). The reason I ask is that I¹m trying to determine whether I

Re: nutch crawl issue

2010-05-03 Thread matthew a. grisius
Hi Chris, Yes, that worked. I caught up on email and noticed that Arpit also mentioned the same thing. Sorry I missed it, thanks to both of you! -m. On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: > Hi Matthew, > > >> Hi Matthew, > >> > >> There is an open issue with Tika (e

Re: nutch crawl issue

2010-05-01 Thread Mattmann, Chris A (388J)
Hi Matthew, >> Hi Matthew, >> >> There is an open issue with Tika (e.g. >> https://issues.apache.org/jira/browse/TIKA-379) that could explain the >> differences betwen parse-html and parse-tika. Note that you can specify : >> *parse-(html|pdf) *in order to get both HTML and PDF files. > > The re

Re: nutch crawl issue

2010-05-01 Thread matthew a. grisius
Hi Julien, On Thu, 2010-04-29 at 18:36 +0100, Julien Nioche wrote: > Hi Matthew, > > There is an open issue with Tika (e.g. > https://issues.apache.org/jira/browse/TIKA-379) that could explain the > differences betwen parse-html and parse-tika. Note that you can specify : > *parse-(html|pdf) *in

Re: nutch crawl issue

2010-04-30 Thread Phil Barnett
This sounds exactly like what I have been experiencing. On Wed, Apr 28, 2010 at 12:39 AM, matthew a. grisius wrote: > using Nutch nightly build nutch-2010-04-27_04-00-28: > > I am trying to bin/nutch crawl a single html file generated by javadoc > and no links are followed. I verified this with b

Re: nutch crawl issue

2010-04-29 Thread Julien Nioche
Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files. Could you please open an issue in JIRA ht

Re: nutch crawl issue

2010-04-29 Thread arpit khurdiya
if u r using nigthly build, Did u changed d same in parse-plugin.xml?? uncomment this: hopefully this helps u On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius wrote: > in nutch-site.xml I modified plugin.includes > > parse-(html) works > parse-(tika) does not > > I need

Re: nutch crawl issue

2010-04-29 Thread matthew a. grisius
in nutch-site.xml I modified plugin.includes parse-(html) works parse-(tika) does not I need to also parse pdfs so I need both features, I tried parse-(html| tika) to see if html would be selected before tika and that did not work. On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: > u

Re: nutch crawl issue

2010-04-28 Thread matthew a. grisius
My subject should've been clearer, e.g. it should've read Nutch 1.1 nightly build crawl issue. Also, I did verify that Nutch 1.0 successfully completes crawling the javadoc html file and can be verified with luke-1.0.1 and searched using command line bin/nutch org.apache.nutch.searcher.NutchBean j