subject:"Re\: nutch crawl issue"

Re: nutch crawl issue

2010-05-05 Thread matthew a. grisius

Hi Chris,

The 'maven install package' produced this for each
target/maven-shared-archive-resources/... file.

...
[INFO] [bundle:bundle {execution: default-bundle}]
[ERROR] Error building bundle
org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not
exist: target/maven-shared-archive-resources/META-INF/NOTICE~
[ERROR] Error building bundle
org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not
exist: target/maven-shared-archive-resources/META-INF/DEPENDENCIES~
[ERROR] Error building bundle
org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not
exist: target/maven-shared-archive-resources/META-INF/LICENSE~
[ERROR] Error(s) found in bundle configuration
[INFO]

[ERROR] BUILD ERROR
[INFO]

[INFO] Error(s) found in bundle configuration

[INFO]

[INFO] For more information, run Maven with the -e switch
[INFO]

[INFO] Total time: 1 minute 24 seconds
[INFO] Finished at: Wed May 05 23:38:56 EDT 2010
[INFO] Final Memory: 40M/271M
[INFO]


Assuming this was the right thing to do, I renamed each file to match
the missing filename, e.g. rename DEPENDENCIES to DEPENDENCIES~ (and
NOTICE, LICENSE) in each 'target' and re-ran to generate the new jars.
and produce this:

[INFO]

[INFO] Reactor Summary:
[INFO]

[INFO] Apache Tika parent  SUCCESS
[2.261s]
[INFO] Apache Tika core .. SUCCESS
[14.429s]
[INFO] Apache Tika parsers ... SUCCESS
[32.370s]
[INFO] Apache Tika application ... SUCCESS
[34.179s]
[INFO] Apache Tika OSGi bundle ... SUCCESS
[16.081s]
[INFO] Apache Tika ... SUCCESS
[0.237s]
[INFO]

[INFO]

[INFO] BUILD SUCCESSFUL
[INFO]

[INFO] Total time: 1 minute 41 seconds
[INFO] Finished at: Wed May 05 23:43:56 EDT 2010
[INFO] Final Memory: 37M/278M
[INFO]


in plugin/parse-tika I replaced parse-tika.jar and tika-parsers-0.7.jar
with tika-core-0.8-SNAPSHOT.jar and tika-parsers-0.8-SNAPSHOT.jar

in lib/ I replaced tika-core-0.7.jar with tika-core-0.8-SNAPSHOT.jar

I ran bin/nutch crawl and it completed w/o error. All of the javascript
was fetched and appeared to be parsed w/o error. However, and I'm not
sure the correct terminology to use,  no more urls were generated to
fetch than before. So the patch appears to be a step in the right
direction. The problem is with the FRAMESET/FRAME and how javascript is
used to generate content in the FRAMES.

As Julien suggested, I will read the archive deeper and look at the
legacy parse-js. I suppose as a really ugly 'brute force' work around I
could walk the directory tree w/ a perl script and generate a 'seed
list' of html URLs to fetch. Ugh. If you have any more ideas please let
me know how I can help. Thanks.

-m.

On Tue, 2010-05-04 at 21:50 -0700, Mattmann, Chris A (388J) wrote:
 Hi Matthew,
 
 I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at
 Julien’s patch and see if there is a way to get it committed sooner rather
 than later.
 
 One way to help me do that ― since you already have an environment and set
 of use cases where this is reproduceable can you apply TIKA-379 to a local
 checkout of tika trunk (I’ll show you how) and then let me know if that
 fixes parse-tika for you?
 
 Here are the steps:
 
 svn co http://svn.apache.org/repos/asf/lucene/tika/trunk ./tika
 cd tika
 wget http://bit.ly/bXeLkf; (if you don't have SSL support, then manually
 download the linked file)
 patch -p0  TIKA-379-3.patch
 mvn install package
 
 Then grab tika-parsers and tika-core out of the respective tika-core/target
 and tika-parsers/target directories and drop those jars in your
 parse-tika/lib folder, replacing their originals. Then, try your nutch crawl
 again.
 
 See if that works. In the meanwhile, I'll inspect Julien's patch.
 
 Thanks!
 
 Cheers,
 Chris
 
 On 5/4/10 9:02 PM, matthew a. grisius mgris...@comcast.net wrote:
 
  Hi Chris,
  
  It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
  and/or javascript. Using the parse-html suggested work around I am able
  to process my simple test cases such as javadoc which does

Re: nutch crawl issue

2010-05-04 Thread matthew a. grisius

Hi Chris,

It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
and/or javascript. Using the parse-html suggested work around I am able
to process my simple test cases such as javadoc which does include
simple embedded javascript (of course I can't verify that it is actually
parsing it though). I expanded my testing to include two more complex
examples that heavily use HTML FRAMESET/FRAME and more complex
javascript:

134 mb, 11,269 files
1.9 gb, 133,978 files

They both fail at the top level with the similar errors such as:

fetching
http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js
fetching
http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocQuickRefs/DSDocBanner.htm
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
Error parsing:
http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js:
UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type text/javascript
Attempting to finish item from unknown queue:
org.apache.nutch.fetcher.fetcher$fetchi...@1532fc
fetch of
http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js
failed with: java.lang.ArrayIndexOutOfBoundsException: -56
-finishing thread FetcherThread, activeThreads=2

I tried several property settings to mimic the previous work around and
could not solve it. Any suggestions?

So, I'm not sure how to categorize the issues more accurately. I have
many javadoc sets and lots of simple HTML that will now parse, but I
have other examples such as the two mentioned above that won't parse and
therefore can't be crawled. It seems to me to be systematic rather than
exceptional. I cannot believe that I'm the only one who will experience
these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks
for asking.

-m.

On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote:
Hi Matthew,

Awesome! Glad it worked. Now my next question how often are you seeing
that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
trying to process? Or just some of them? Or particular ones (categories of
them). The reason I ask is that I¹m trying to determine whether I should
commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a
systematic thing versus an exception.

Let me know and thanks!

Cheers,
Chris

On 5/3/10 9:04 AM, matthew a. grisius mgris...@comcast.net wrote:

Hi Chris,

Yes, that worked. I caught up on email and noticed that Arpit also
mentioned the same thing. Sorry I missed it, thanks to both of you!

-m.

On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
Hi Matthew,

Hi Matthew,

There is an open issue with Tika (e.g.
https://issues.apache.org/jira/browse/TIKA-379) that could explain the
differences betwen parse-html and parse-tika. Note that you can specify :
*parse-(html|pdf) *in order to get both HTML and PDF files.

The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
PDFs, but has problems with some html. Nutch 1.1 includes more current
PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.

Interesting: well one solution comes to mind. Can you test this out?

* uncomment the lines:

mimeType name=text/html
plugin id=parse-html /
/mimeType

In conf/parse-plugins.xml.

* try your crawl again.

I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
with the attached file

Thanks! Let me know what happens after you uncomment the line above.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

Re: nutch crawl issue

2010-05-04 Thread Mattmann, Chris A (388J)

Hi Matthew,

I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at
Julien’s patch and see if there is a way to get it committed sooner rather
than later.

One way to help me do that ― since you already have an environment and set
of use cases where this is reproduceable can you apply TIKA-379 to a local
checkout of tika trunk (I’ll show you how) and then let me know if that
fixes parse-tika for you?

Here are the steps:

svn co http://svn.apache.org/repos/asf/lucene/tika/trunk ./tika
cd tika
wget http://bit.ly/bXeLkf; (if you don't have SSL support, then manually
download the linked file)
patch -p0  TIKA-379-3.patch
mvn install package

Then grab tika-parsers and tika-core out of the respective tika-core/target
and tika-parsers/target directories and drop those jars in your
parse-tika/lib folder, replacing their originals. Then, try your nutch crawl
again.

See if that works. In the meanwhile, I'll inspect Julien's patch.

Thanks!

Cheers,
Chris

On 5/4/10 9:02 PM, matthew a. grisius mgris...@comcast.net wrote:

 Hi Chris,
 
 It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
 and/or javascript. Using the parse-html suggested work around I am able
 to process my simple test cases such as javadoc which does include
 simple embedded javascript (of course I can't verify that it is actually
 parsing it though). I expanded my testing to include two more complex
 examples that heavily use HTML FRAMESET/FRAME and more complex
 javascript:
 
 134 mb, 11,269 files
 1.9 gb, 133,978 files
 
 They both fail at the top level with the similar errors such as:
 
 fetching
 http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD
 ocCommon.js
 fetching
 http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocQuickRefs/DSDo
 cBanner.htm
 -finishing thread FetcherThread, activeThreads=8
 -finishing thread FetcherThread, activeThreads=7
 -finishing thread FetcherThread, activeThreads=9
 -finishing thread FetcherThread, activeThreads=6
 -finishing thread FetcherThread, activeThreads=5
 -finishing thread FetcherThread, activeThreads=4
 -finishing thread FetcherThread, activeThreads=3
 Error parsing:
 http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD
 ocCommon.js: UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type
 text/javascript
 Attempting to finish item from unknown queue:
 org.apache.nutch.fetcher.fetcher$fetchi...@1532fc
 fetch of
 http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD
 ocCommon.js failed with: java.lang.ArrayIndexOutOfBoundsException: -56
 -finishing thread FetcherThread, activeThreads=2
 
 I tried several property settings to mimic the previous work around and
 could not solve it. Any suggestions?
 
 So, I'm not sure how to categorize the issues more accurately. I have
 many javadoc sets and lots of simple HTML that will now parse, but I
 have other examples such as the two mentioned above that won't parse and
 therefore can't be crawled. It seems to me to be systematic rather than
 exceptional. I cannot believe that I'm the only one who will experience
 these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks
 for asking.
 
 -m.
 
 
 
 On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote:
 Hi Matthew,
 
 Awesome! Glad it worked. Now my next question  how often are you seeing
 that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
 trying to process? Or just some of them? Or particular ones (categories of
 them). The reason I ask is that I¹m trying to determine whether I should
 commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a
 systematic thing versus an exception.
 
 Let me know and thanks!
 
 Cheers,
 Chris
 
 
 On 5/3/10 9:04 AM, matthew a. grisius mgris...@comcast.net wrote:
 
 Hi Chris,
 
 Yes, that worked. I caught up on email and noticed that Arpit also
 mentioned the same thing. Sorry I missed it, thanks to both of you!
 
 -m.
 
 On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
 Hi Matthew,
 
 Hi Matthew,
 
 There is an open issue with Tika (e.g.
 https://issues.apache.org/jira/browse/TIKA-379) that could explain the
 differences betwen parse-html and parse-tika. Note that you can specify :
 *parse-(html|pdf) *in order to get both HTML and PDF files.
 
 The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
 rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
 PDFs, but has problems with some html. Nutch 1.1 includes more current
 PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
 
 Interesting: well one solution comes to mind. Can you test this out?
 
 * uncomment the lines:
 
 mimeType name=text/html
 plugin id=parse-html /
 /mimeType
 
 In conf/parse-plugins.xml.
 
 * try your crawl again.
 
 
 I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
 with the attached file
 
 Thanks!

Re: nutch crawl issue

2010-05-03 Thread matthew a. grisius

Hi Chris,

Yes, that worked. I caught up on email and noticed that Arpit also
mentioned the same thing. Sorry I missed it, thanks to both of you!

-m.

On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
 Hi Matthew,
 
  Hi Matthew,
  
  There is an open issue with Tika (e.g.
  https://issues.apache.org/jira/browse/TIKA-379) that could explain the
  differences betwen parse-html and parse-tika. Note that you can specify :
  *parse-(html|pdf) *in order to get both HTML and PDF files.
  
  The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
  rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
  PDFs, but has problems with some html. Nutch 1.1 includes more current
  PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
 
 Interesting: well one solution comes to mind. Can you test this out?
 
 * uncomment the lines:
 
 mimeType name=text/html
 plugin id=parse-html /
 /mimeType
 
 In conf/parse-plugins.xml.
 
 * try your crawl again.
 
  
  I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
  with the attached file
 
 Thanks! Let me know what happens after you uncomment the line above.
 
 Cheers,
 Chris
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++

Re: nutch crawl issue

2010-05-03 Thread Mattmann, Chris A (388J)

Hi Matthew,

Awesome! Glad it worked. Now my next question  how often are you seeing
that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
trying to process? Or just some of them? Or particular ones (categories of
them). The reason I ask is that I¹m trying to determine whether I should
commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a
systematic thing versus an exception.

Let me know and thanks!

Cheers,
Chris


On 5/3/10 9:04 AM, matthew a. grisius mgris...@comcast.net wrote:

 Hi Chris,
 
 Yes, that worked. I caught up on email and noticed that Arpit also
 mentioned the same thing. Sorry I missed it, thanks to both of you!
 
 -m.
 
 On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
 Hi Matthew,
 
 Hi Matthew,
 
 There is an open issue with Tika (e.g.
 https://issues.apache.org/jira/browse/TIKA-379) that could explain the
 differences betwen parse-html and parse-tika. Note that you can specify :
 *parse-(html|pdf) *in order to get both HTML and PDF files.
 
 The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
 rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
 PDFs, but has problems with some html. Nutch 1.1 includes more current
 PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
 
 Interesting: well one solution comes to mind. Can you test this out?
 
 * uncomment the lines:
 
 mimeType name=text/html
 plugin id=parse-html /
 /mimeType
 
 In conf/parse-plugins.xml.
 
 * try your crawl again.
 
 
 I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
 with the attached file
 
 Thanks! Let me know what happens after you uncomment the line above.
 
 Cheers,
 Chris
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

Re: nutch crawl issue

2010-05-01 Thread Phil Barnett

This sounds exactly like what I have been experiencing.

On Wed, Apr 28, 2010 at 12:39 AM, matthew a. grisius
mgris...@comcast.netwrote:

 using Nutch nightly build nutch-2010-04-27_04-00-28:

 I am trying to bin/nutch crawl a single html file generated by javadoc
 and no links are followed. I verified this with bin/nutch readdb and
 bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
 seed doc specified is processed.

 I searched and reviewed the nutch-user archive and tried several
 different settings but none of the settings appear to have any effect.

 I then downloaded maven-2.2.1 so that I could mvn install tika and
 produce tika-app-0.7.jar to command line extract information about the
 html javadoc file. I am not familiar w/ tika but the command line
 version doesn't return any metadata, e.g. no 'src=' links from the html
 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
 nutch uses tika and maybe it's not related . . .

 Has anyone crawled javadoc files or have any suggestions? Thanks.

 -m.

Re: nutch crawl issue

2010-05-01 Thread Mattmann, Chris A (388J)

Hi Matthew,

 Hi Matthew,
 
 There is an open issue with Tika (e.g.
 https://issues.apache.org/jira/browse/TIKA-379) that could explain the
 differences betwen parse-html and parse-tika. Note that you can specify :
 *parse-(html|pdf) *in order to get both HTML and PDF files.
 
 The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
 rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
 PDFs, but has problems with some html. Nutch 1.1 includes more current
 PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.

Interesting: well one solution comes to mind. Can you test this out?

* uncomment the lines:

mimeType name=text/html
plugin id=parse-html /
/mimeType

In conf/parse-plugins.xml.

* try your crawl again.

 
 I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
 with the attached file

Thanks! Let me know what happens after you uncomment the line above.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++

Re: nutch crawl issue

2010-04-29 Thread matthew a. grisius

in nutch-site.xml I modified plugin.includes

parse-(html) works
parse-(tika) does not

I need to also parse pdfs so I need both features, I tried parse-(html|
tika) to see if html would be selected before tika and that did not
work.

On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
 using Nutch nightly build nutch-2010-04-27_04-00-28:
 
 I am trying to bin/nutch crawl a single html file generated by javadoc
 and no links are followed. I verified this with bin/nutch readdb and
 bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
 seed doc specified is processed.
 
 I searched and reviewed the nutch-user archive and tried several
 different settings but none of the settings appear to have any effect.
 
 I then downloaded maven-2.2.1 so that I could mvn install tika and
 produce tika-app-0.7.jar to command line extract information about the
 html javadoc file. I am not familiar w/ tika but the command line
 version doesn't return any metadata, e.g. no 'src=' links from the html
 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
 nutch uses tika and maybe it's not related . . .
 
 Has anyone crawled javadoc files or have any suggestions? Thanks.
 
 -m.

Re: nutch crawl issue

2010-04-29 Thread arpit khurdiya

 if u r using  nigthly build, Did u changed d same in parse-plugin.xml??
uncomment this:
 mimeType name=text/html
plugin id=parse-html /
/mimeType

hopefully this helps u

On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius
mgris...@comcast.net wrote:
 in nutch-site.xml I modified plugin.includes

 parse-(html) works
 parse-(tika) does not

 I need to also parse pdfs so I need both features, I tried parse-(html|
 tika) to see if html would be selected before tika and that did not
 work.

 On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
 using Nutch nightly build nutch-2010-04-27_04-00-28:

 I am trying to bin/nutch crawl a single html file generated by javadoc
 and no links are followed. I verified this with bin/nutch readdb and
 bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
 seed doc specified is processed.

 I searched and reviewed the nutch-user archive and tried several
 different settings but none of the settings appear to have any effect.

 I then downloaded maven-2.2.1 so that I could mvn install tika and
 produce tika-app-0.7.jar to command line extract information about the
 html javadoc file. I am not familiar w/ tika but the command line
 version doesn't return any metadata, e.g. no 'src=' links from the html
 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
 nutch uses tika and maybe it's not related . . .

 Has anyone crawled javadoc files or have any suggestions? Thanks.

 -m.






-- 
Regards,
Arpit Khurdiya

Re: nutch crawl issue

2010-04-29 Thread Julien Nioche

Hi Matthew,

There is an open issue with Tika (e.g.
https://issues.apache.org/jira/browse/TIKA-379) that could explain the
differences betwen parse-html and parse-tika. Note that you can specify :
*parse-(html|pdf) *in order to get both HTML and PDF files.

Could you please open an issue in JIRA
https://issues.apache.org/jira/browse/NUTCH) and attach the file you are
trying to process? I'll have a look and see if it is related to TIKA-379.

Thanks

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 29 April 2010 17:02, matthew a. grisius mgris...@comcast.net wrote:

 in nutch-site.xml I modified plugin.includes

 parse-(html) works
 parse-(tika) does not

 I need to also parse pdfs so I need both features, I tried parse-(html|
 tika) to see if html would be selected before tika and that did not
 work.

 On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
  using Nutch nightly build nutch-2010-04-27_04-00-28:
 
  I am trying to bin/nutch crawl a single html file generated by javadoc
  and no links are followed. I verified this with bin/nutch readdb and
  bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
  seed doc specified is processed.
 
  I searched and reviewed the nutch-user archive and tried several
  different settings but none of the settings appear to have any effect.
 
  I then downloaded maven-2.2.1 so that I could mvn install tika and
  produce tika-app-0.7.jar to command line extract information about the
  html javadoc file. I am not familiar w/ tika but the command line
  version doesn't return any metadata, e.g. no 'src=' links from the html
  'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
  nutch uses tika and maybe it's not related . . .
 
  Has anyone crawled javadoc files or have any suggestions? Thanks.
 
  -m.

Re: nutch crawl issue

2010-04-28 Thread matthew a. grisius

My subject should've been clearer, e.g. it should've read Nutch 1.1
nightly build crawl issue.

Also, I did verify that Nutch 1.0 successfully completes crawling the
javadoc html file and can be verified with luke-1.0.1 and searched using
command line bin/nutch org.apache.nutch.searcher.NutchBean java

On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
 using Nutch nightly build nutch-2010-04-27_04-00-28:
 
 I am trying to bin/nutch crawl a single html file generated by javadoc
 and no links are followed. I verified this with bin/nutch readdb and
 bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
 seed doc specified is processed.
 
 I searched and reviewed the nutch-user archive and tried several
 different settings but none of the settings appear to have any effect.
 
 I then downloaded maven-2.2.1 so that I could mvn install tika and
 produce tika-app-0.7.jar to command line extract information about the
 html javadoc file. I am not familiar w/ tika but the command line
 version doesn't return any metadata, e.g. no 'src=' links from the html
 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
 nutch uses tika and maybe it's not related . . .
 
 Has anyone crawled javadoc files or have any suggestions? Thanks.
 
 -m.

Re: nutch crawl issue

Re: nutch crawl issue

Re: nutch crawl issue

Re: nutch crawl issue

Re: nutch crawl issue

Re: nutch crawl issue

Re: nutch crawl issue

Re: nutch crawl issue

Re: nutch crawl issue

Re: nutch crawl issue

Re: nutch crawl issue

11 matches

Site Navigation

Mail list logo

Footer information