Hi Chris,
The 'maven install package' produced this for each
target/maven-shared-archive-resources/... file.
...
[INFO] [bundle:bundle {execution: default-bundle}]
[ERROR] Error building bundle
org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not
exist: target/maven-shared-archive-r
Hi Matthew,
As you can see from the error messages Tika does not know how to parse
javascript. There is a legacy javascript parser in Nutch which you can
activate in the usual way i.e. specify parse-js in plugin.includes. It
generates a lot of spurious URLs but you should give it a try and see if
Hi Matthew,
I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at
Julien’s patch and see if there is a way to get it committed sooner rather
than later.
One way to help me do that ― since you already have an environment and set
of use cases where this is reproduceable can yo
Hi Chris,
It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
and/or javascript. Using the parse-html suggested work around I am able
to process my simple test cases such as javadoc which does include
simple embedded javascript (of course I can't verify that it is actually
pars
Hi Matthew,
Awesome! Glad it worked. Now my next question < how often are you seeing
that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
trying to process? Or just some of them? Or particular ones (categories of
them). The reason I ask is that I¹m trying to determine whether I
Hi Chris,
Yes, that worked. I caught up on email and noticed that Arpit also
mentioned the same thing. Sorry I missed it, thanks to both of you!
-m.
On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
> Hi Matthew,
>
> >> Hi Matthew,
> >>
> >> There is an open issue with Tika (e
Hi Matthew,
>> Hi Matthew,
>>
>> There is an open issue with Tika (e.g.
>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the
>> differences betwen parse-html and parse-tika. Note that you can specify :
>> *parse-(html|pdf) *in order to get both HTML and PDF files.
>
> The re
Hi Julien,
On Thu, 2010-04-29 at 18:36 +0100, Julien Nioche wrote:
> Hi Matthew,
>
> There is an open issue with Tika (e.g.
> https://issues.apache.org/jira/browse/TIKA-379) that could explain the
> differences betwen parse-html and parse-tika. Note that you can specify :
> *parse-(html|pdf) *in
This sounds exactly like what I have been experiencing.
On Wed, Apr 28, 2010 at 12:39 AM, matthew a. grisius
wrote:
> using Nutch nightly build nutch-2010-04-27_04-00-28:
>
> I am trying to bin/nutch crawl a single html file generated by javadoc
> and no links are followed. I verified this with b
Hi Matthew,
There is an open issue with Tika (e.g.
https://issues.apache.org/jira/browse/TIKA-379) that could explain the
differences betwen parse-html and parse-tika. Note that you can specify :
*parse-(html|pdf) *in order to get both HTML and PDF files.
Could you please open an issue in JIRA
ht
if u r using nigthly build, Did u changed d same in parse-plugin.xml??
uncomment this:
hopefully this helps u
On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius
wrote:
> in nutch-site.xml I modified plugin.includes
>
> parse-(html) works
> parse-(tika) does not
>
> I need
in nutch-site.xml I modified plugin.includes
parse-(html) works
parse-(tika) does not
I need to also parse pdfs so I need both features, I tried parse-(html|
tika) to see if html would be selected before tika and that did not
work.
On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
> u
My subject should've been clearer, e.g. it should've read Nutch 1.1
nightly build crawl issue.
Also, I did verify that Nutch 1.0 successfully completes crawling the
javadoc html file and can be verified with luke-1.0.1 and searched using
command line bin/nutch org.apache.nutch.searcher.NutchBean j
13 matches
Mail list logo