Re: nutch crawl issue

2010-05-05 Thread matthew a. grisius
Hi Chris,

The 'maven install package' produced this for each
target/maven-shared-archive-resources/... file.

...
[INFO] [bundle:bundle {execution: default-bundle}]
[ERROR] Error building bundle
org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not
exist: target/maven-shared-archive-resources/META-INF/NOTICE~
[ERROR] Error building bundle
org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not
exist: target/maven-shared-archive-resources/META-INF/DEPENDENCIES~
[ERROR] Error building bundle
org.apache.tika:tika-app:bundle:0.8-SNAPSHOT : Input file does not
exist: target/maven-shared-archive-resources/META-INF/LICENSE~
[ERROR] Error(s) found in bundle configuration
[INFO]

[ERROR] BUILD ERROR
[INFO]

[INFO] Error(s) found in bundle configuration

[INFO]

[INFO] For more information, run Maven with the -e switch
[INFO]

[INFO] Total time: 1 minute 24 seconds
[INFO] Finished at: Wed May 05 23:38:56 EDT 2010
[INFO] Final Memory: 40M/271M
[INFO]


Assuming this was the right thing to do, I renamed each file to match
the missing filename, e.g. rename DEPENDENCIES to DEPENDENCIES~ (and
NOTICE, LICENSE) in each 'target' and re-ran to generate the new jars.
and produce this:

[INFO]

[INFO] Reactor Summary:
[INFO]

[INFO] Apache Tika parent  SUCCESS
[2.261s]
[INFO] Apache Tika core .. SUCCESS
[14.429s]
[INFO] Apache Tika parsers ... SUCCESS
[32.370s]
[INFO] Apache Tika application ... SUCCESS
[34.179s]
[INFO] Apache Tika OSGi bundle ... SUCCESS
[16.081s]
[INFO] Apache Tika ... SUCCESS
[0.237s]
[INFO]

[INFO]

[INFO] BUILD SUCCESSFUL
[INFO]

[INFO] Total time: 1 minute 41 seconds
[INFO] Finished at: Wed May 05 23:43:56 EDT 2010
[INFO] Final Memory: 37M/278M
[INFO]


in plugin/parse-tika I replaced parse-tika.jar and tika-parsers-0.7.jar
with tika-core-0.8-SNAPSHOT.jar and tika-parsers-0.8-SNAPSHOT.jar

in lib/ I replaced tika-core-0.7.jar with tika-core-0.8-SNAPSHOT.jar

I ran bin/nutch crawl and it completed w/o error. All of the javascript
was fetched and appeared to be parsed w/o error. However, and I'm not
sure the correct terminology to use,  no more urls were generated to
fetch than before. So the patch appears to be a step in the right
direction. The problem is with the FRAMESET/FRAME and how javascript is
used to generate content in the FRAMES.

As Julien suggested, I will read the archive deeper and look at the
legacy parse-js. I suppose as a really ugly 'brute force' work around I
could walk the directory tree w/ a perl script and generate a 'seed
list' of html URLs to fetch. Ugh. If you have any more ideas please let
me know how I can help. Thanks.

-m.

On Tue, 2010-05-04 at 21:50 -0700, Mattmann, Chris A (388J) wrote:
 Hi Matthew,
 
 I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at
 Julien’s patch and see if there is a way to get it committed sooner rather
 than later.
 
 One way to help me do that ― since you already have an environment and set
 of use cases where this is reproduceable can you apply TIKA-379 to a local
 checkout of tika trunk (I’ll show you how) and then let me know if that
 fixes parse-tika for you?
 
 Here are the steps:
 
 svn co http://svn.apache.org/repos/asf/lucene/tika/trunk ./tika
 cd tika
 wget http://bit.ly/bXeLkf; (if you don't have SSL support, then manually
 download the linked file)
 patch -p0  TIKA-379-3.patch
 mvn install package
 
 Then grab tika-parsers and tika-core out of the respective tika-core/target
 and tika-parsers/target directories and drop those jars in your
 parse-tika/lib folder, replacing their originals. Then, try your nutch crawl
 again.
 
 See if that works. In the meanwhile, I'll inspect Julien's patch.
 
 Thanks!
 
 Cheers,
 Chris
 
 On 5/4/10 9:02 PM, matthew a. grisius mgris...@comcast.net wrote:
 
  Hi Chris,
  
  It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
  and/or javascript. Using the parse-html suggested work around I am able
  to process my simple test cases such as javadoc which does 

Re: nutch crawl issue

2010-05-04 Thread matthew a. grisius
Hi Chris,

It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
and/or javascript. Using the parse-html suggested work around I am able
to process my simple test cases such as javadoc which does include
simple embedded javascript (of course I can't verify that it is actually
parsing it though). I expanded my testing to include two more complex
examples that heavily use HTML FRAMESET/FRAME and more complex
javascript:

134 mb, 11,269 files
1.9 gb, 133,978 files

They both fail at the top level with the similar errors such as:

fetching
http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js
fetching
http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocQuickRefs/DSDocBanner.htm
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
Error parsing:
http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js:
 UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type text/javascript
Attempting to finish item from unknown queue:
org.apache.nutch.fetcher.fetcher$fetchi...@1532fc
fetch of
http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSDocCommon.js
 failed with: java.lang.ArrayIndexOutOfBoundsException: -56
-finishing thread FetcherThread, activeThreads=2

I tried several property settings to mimic the previous work around and
could not solve it. Any suggestions?

So, I'm not sure how to categorize the issues more accurately. I have
many javadoc sets and lots of simple HTML that will now parse, but I
have other examples such as the two mentioned above that won't parse and
therefore can't be crawled. It seems to me to be systematic rather than
exceptional. I cannot believe that I'm the only one who will experience
these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks
for asking.

-m.



On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote:
 Hi Matthew,
 
 Awesome! Glad it worked. Now my next question  how often are you seeing
 that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
 trying to process? Or just some of them? Or particular ones (categories of
 them). The reason I ask is that I¹m trying to determine whether I should
 commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a
 systematic thing versus an exception.
 
 Let me know and thanks!
 
 Cheers,
 Chris
 
 
 On 5/3/10 9:04 AM, matthew a. grisius mgris...@comcast.net wrote:
 
  Hi Chris,
  
  Yes, that worked. I caught up on email and noticed that Arpit also
  mentioned the same thing. Sorry I missed it, thanks to both of you!
  
  -m.
  
  On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
  Hi Matthew,
  
  Hi Matthew,
  
  There is an open issue with Tika (e.g.
  https://issues.apache.org/jira/browse/TIKA-379) that could explain the
  differences betwen parse-html and parse-tika. Note that you can specify :
  *parse-(html|pdf) *in order to get both HTML and PDF files.
  
  The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
  rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
  PDFs, but has problems with some html. Nutch 1.1 includes more current
  PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
  
  Interesting: well one solution comes to mind. Can you test this out?
  
  * uncomment the lines:
  
  mimeType name=text/html
  plugin id=parse-html /
  /mimeType
  
  In conf/parse-plugins.xml.
  
  * try your crawl again.
  
  
  I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
  with the attached file
  
  Thanks! Let me know what happens after you uncomment the line above.
  
  Cheers,
  Chris
  
  ++
  Chris Mattmann, Ph.D.
  Senior Computer Scientist
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 171-266B, Mailstop: 171-246
  Email: chris.mattm...@jpl.nasa.gov
  WWW:   http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Assistant Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
  
  
  
  
 
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 

Re: nutch crawl issue

2010-05-04 Thread Mattmann, Chris A (388J)
Hi Matthew,

I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at
Julien’s patch and see if there is a way to get it committed sooner rather
than later.

One way to help me do that ― since you already have an environment and set
of use cases where this is reproduceable can you apply TIKA-379 to a local
checkout of tika trunk (I’ll show you how) and then let me know if that
fixes parse-tika for you?

Here are the steps:

svn co http://svn.apache.org/repos/asf/lucene/tika/trunk ./tika
cd tika
wget http://bit.ly/bXeLkf; (if you don't have SSL support, then manually
download the linked file)
patch -p0  TIKA-379-3.patch
mvn install package

Then grab tika-parsers and tika-core out of the respective tika-core/target
and tika-parsers/target directories and drop those jars in your
parse-tika/lib folder, replacing their originals. Then, try your nutch crawl
again.

See if that works. In the meanwhile, I'll inspect Julien's patch.

Thanks!

Cheers,
Chris

On 5/4/10 9:02 PM, matthew a. grisius mgris...@comcast.net wrote:

 Hi Chris,
 
 It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
 and/or javascript. Using the parse-html suggested work around I am able
 to process my simple test cases such as javadoc which does include
 simple embedded javascript (of course I can't verify that it is actually
 parsing it though). I expanded my testing to include two more complex
 examples that heavily use HTML FRAMESET/FRAME and more complex
 javascript:
 
 134 mb, 11,269 files
 1.9 gb, 133,978 files
 
 They both fail at the top level with the similar errors such as:
 
 fetching
 http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD
 ocCommon.js
 fetching
 http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocQuickRefs/DSDo
 cBanner.htm
 -finishing thread FetcherThread, activeThreads=8
 -finishing thread FetcherThread, activeThreads=7
 -finishing thread FetcherThread, activeThreads=9
 -finishing thread FetcherThread, activeThreads=6
 -finishing thread FetcherThread, activeThreads=5
 -finishing thread FetcherThread, activeThreads=4
 -finishing thread FetcherThread, activeThreads=3
 Error parsing:
 http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD
 ocCommon.js: UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type
 text/javascript
 Attempting to finish item from unknown queue:
 org.apache.nutch.fetcher.fetcher$fetchi...@1532fc
 fetch of
 http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD
 ocCommon.js failed with: java.lang.ArrayIndexOutOfBoundsException: -56
 -finishing thread FetcherThread, activeThreads=2
 
 I tried several property settings to mimic the previous work around and
 could not solve it. Any suggestions?
 
 So, I'm not sure how to categorize the issues more accurately. I have
 many javadoc sets and lots of simple HTML that will now parse, but I
 have other examples such as the two mentioned above that won't parse and
 therefore can't be crawled. It seems to me to be systematic rather than
 exceptional. I cannot believe that I'm the only one who will experience
 these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks
 for asking.
 
 -m.
 
 
 
 On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote:
 Hi Matthew,
 
 Awesome! Glad it worked. Now my next question  how often are you seeing
 that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
 trying to process? Or just some of them? Or particular ones (categories of
 them). The reason I ask is that I¹m trying to determine whether I should
 commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a
 systematic thing versus an exception.
 
 Let me know and thanks!
 
 Cheers,
 Chris
 
 
 On 5/3/10 9:04 AM, matthew a. grisius mgris...@comcast.net wrote:
 
 Hi Chris,
 
 Yes, that worked. I caught up on email and noticed that Arpit also
 mentioned the same thing. Sorry I missed it, thanks to both of you!
 
 -m.
 
 On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
 Hi Matthew,
 
 Hi Matthew,
 
 There is an open issue with Tika (e.g.
 https://issues.apache.org/jira/browse/TIKA-379) that could explain the
 differences betwen parse-html and parse-tika. Note that you can specify :
 *parse-(html|pdf) *in order to get both HTML and PDF files.
 
 The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
 rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
 PDFs, but has problems with some html. Nutch 1.1 includes more current
 PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
 
 Interesting: well one solution comes to mind. Can you test this out?
 
 * uncomment the lines:
 
 mimeType name=text/html
 plugin id=parse-html /
 /mimeType
 
 In conf/parse-plugins.xml.
 
 * try your crawl again.
 
 
 I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
 with the attached file
 
 Thanks! 

Re: nutch crawl issue

2010-05-03 Thread matthew a. grisius
Hi Chris,

Yes, that worked. I caught up on email and noticed that Arpit also
mentioned the same thing. Sorry I missed it, thanks to both of you!

-m.

On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
 Hi Matthew,
 
  Hi Matthew,
  
  There is an open issue with Tika (e.g.
  https://issues.apache.org/jira/browse/TIKA-379) that could explain the
  differences betwen parse-html and parse-tika. Note that you can specify :
  *parse-(html|pdf) *in order to get both HTML and PDF files.
  
  The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
  rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
  PDFs, but has problems with some html. Nutch 1.1 includes more current
  PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
 
 Interesting: well one solution comes to mind. Can you test this out?
 
 * uncomment the lines:
 
 mimeType name=text/html
 plugin id=parse-html /
 /mimeType
 
 In conf/parse-plugins.xml.
 
 * try your crawl again.
 
  
  I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
  with the attached file
 
 Thanks! Let me know what happens after you uncomment the line above.
 
 Cheers,
 Chris
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 



Re: nutch crawl issue

2010-05-03 Thread Mattmann, Chris A (388J)
Hi Matthew,

Awesome! Glad it worked. Now my next question  how often are you seeing
that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
trying to process? Or just some of them? Or particular ones (categories of
them). The reason I ask is that I¹m trying to determine whether I should
commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a
systematic thing versus an exception.

Let me know and thanks!

Cheers,
Chris


On 5/3/10 9:04 AM, matthew a. grisius mgris...@comcast.net wrote:

 Hi Chris,
 
 Yes, that worked. I caught up on email and noticed that Arpit also
 mentioned the same thing. Sorry I missed it, thanks to both of you!
 
 -m.
 
 On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
 Hi Matthew,
 
 Hi Matthew,
 
 There is an open issue with Tika (e.g.
 https://issues.apache.org/jira/browse/TIKA-379) that could explain the
 differences betwen parse-html and parse-tika. Note that you can specify :
 *parse-(html|pdf) *in order to get both HTML and PDF files.
 
 The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
 rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
 PDFs, but has problems with some html. Nutch 1.1 includes more current
 PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
 
 Interesting: well one solution comes to mind. Can you test this out?
 
 * uncomment the lines:
 
 mimeType name=text/html
 plugin id=parse-html /
 /mimeType
 
 In conf/parse-plugins.xml.
 
 * try your crawl again.
 
 
 I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
 with the attached file
 
 Thanks! Let me know what happens after you uncomment the line above.
 
 Cheers,
 Chris
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.mattm...@jpl.nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: nutch crawl issue

2010-05-01 Thread Phil Barnett
This sounds exactly like what I have been experiencing.

On Wed, Apr 28, 2010 at 12:39 AM, matthew a. grisius
mgris...@comcast.netwrote:

 using Nutch nightly build nutch-2010-04-27_04-00-28:

 I am trying to bin/nutch crawl a single html file generated by javadoc
 and no links are followed. I verified this with bin/nutch readdb and
 bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
 seed doc specified is processed.

 I searched and reviewed the nutch-user archive and tried several
 different settings but none of the settings appear to have any effect.

 I then downloaded maven-2.2.1 so that I could mvn install tika and
 produce tika-app-0.7.jar to command line extract information about the
 html javadoc file. I am not familiar w/ tika but the command line
 version doesn't return any metadata, e.g. no 'src=' links from the html
 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
 nutch uses tika and maybe it's not related . . .

 Has anyone crawled javadoc files or have any suggestions? Thanks.

 -m.




Re: nutch crawl issue

2010-05-01 Thread matthew a. grisius
Hi Julien,

On Thu, 2010-04-29 at 18:36 +0100, Julien Nioche wrote:
 Hi Matthew,
 
 There is an open issue with Tika (e.g.
 https://issues.apache.org/jira/browse/TIKA-379) that could explain the
 differences betwen parse-html and parse-tika. Note that you can specify :
 *parse-(html|pdf) *in order to get both HTML and PDF files.

The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
PDFs, but has problems with some html. Nutch 1.1 includes more current
PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.

 
 Could you please open an issue in JIRA
 https://issues.apache.org/jira/browse/NUTCH) and attach the file you are
 trying to process? I'll have a look and see if it is related to TIKA-379.

I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
with the attached file

Thanks.

-m.

 
 Thanks
 
 Julien



Re: nutch crawl issue

2010-05-01 Thread Mattmann, Chris A (388J)
Hi Matthew,

 Hi Matthew,
 
 There is an open issue with Tika (e.g.
 https://issues.apache.org/jira/browse/TIKA-379) that could explain the
 differences betwen parse-html and parse-tika. Note that you can specify :
 *parse-(html|pdf) *in order to get both HTML and PDF files.
 
 The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
 rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
 PDFs, but has problems with some html. Nutch 1.1 includes more current
 PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.

Interesting: well one solution comes to mind. Can you test this out?

* uncomment the lines:

mimeType name=text/html
plugin id=parse-html /
/mimeType

In conf/parse-plugins.xml.

* try your crawl again.

 
 I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
 with the attached file

Thanks! Let me know what happens after you uncomment the line above.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: nutch crawl issue

2010-04-29 Thread matthew a. grisius
in nutch-site.xml I modified plugin.includes

parse-(html) works
parse-(tika) does not

I need to also parse pdfs so I need both features, I tried parse-(html|
tika) to see if html would be selected before tika and that did not
work.

On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
 using Nutch nightly build nutch-2010-04-27_04-00-28:
 
 I am trying to bin/nutch crawl a single html file generated by javadoc
 and no links are followed. I verified this with bin/nutch readdb and
 bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
 seed doc specified is processed.
 
 I searched and reviewed the nutch-user archive and tried several
 different settings but none of the settings appear to have any effect.
 
 I then downloaded maven-2.2.1 so that I could mvn install tika and
 produce tika-app-0.7.jar to command line extract information about the
 html javadoc file. I am not familiar w/ tika but the command line
 version doesn't return any metadata, e.g. no 'src=' links from the html
 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
 nutch uses tika and maybe it's not related . . .
 
 Has anyone crawled javadoc files or have any suggestions? Thanks.
 
 -m.
 



Re: nutch crawl issue

2010-04-29 Thread arpit khurdiya
 if u r using  nigthly build, Did u changed d same in parse-plugin.xml??
uncomment this:
 mimeType name=text/html
plugin id=parse-html /
/mimeType

hopefully this helps u

On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius
mgris...@comcast.net wrote:
 in nutch-site.xml I modified plugin.includes

 parse-(html) works
 parse-(tika) does not

 I need to also parse pdfs so I need both features, I tried parse-(html|
 tika) to see if html would be selected before tika and that did not
 work.

 On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
 using Nutch nightly build nutch-2010-04-27_04-00-28:

 I am trying to bin/nutch crawl a single html file generated by javadoc
 and no links are followed. I verified this with bin/nutch readdb and
 bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
 seed doc specified is processed.

 I searched and reviewed the nutch-user archive and tried several
 different settings but none of the settings appear to have any effect.

 I then downloaded maven-2.2.1 so that I could mvn install tika and
 produce tika-app-0.7.jar to command line extract information about the
 html javadoc file. I am not familiar w/ tika but the command line
 version doesn't return any metadata, e.g. no 'src=' links from the html
 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
 nutch uses tika and maybe it's not related . . .

 Has anyone crawled javadoc files or have any suggestions? Thanks.

 -m.






-- 
Regards,
Arpit Khurdiya


Re: nutch crawl issue

2010-04-29 Thread Julien Nioche
Hi Matthew,

There is an open issue with Tika (e.g.
https://issues.apache.org/jira/browse/TIKA-379) that could explain the
differences betwen parse-html and parse-tika. Note that you can specify :
*parse-(html|pdf) *in order to get both HTML and PDF files.

Could you please open an issue in JIRA
https://issues.apache.org/jira/browse/NUTCH) and attach the file you are
trying to process? I'll have a look and see if it is related to TIKA-379.

Thanks

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 29 April 2010 17:02, matthew a. grisius mgris...@comcast.net wrote:

 in nutch-site.xml I modified plugin.includes

 parse-(html) works
 parse-(tika) does not

 I need to also parse pdfs so I need both features, I tried parse-(html|
 tika) to see if html would be selected before tika and that did not
 work.

 On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
  using Nutch nightly build nutch-2010-04-27_04-00-28:
 
  I am trying to bin/nutch crawl a single html file generated by javadoc
  and no links are followed. I verified this with bin/nutch readdb and
  bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
  seed doc specified is processed.
 
  I searched and reviewed the nutch-user archive and tried several
  different settings but none of the settings appear to have any effect.
 
  I then downloaded maven-2.2.1 so that I could mvn install tika and
  produce tika-app-0.7.jar to command line extract information about the
  html javadoc file. I am not familiar w/ tika but the command line
  version doesn't return any metadata, e.g. no 'src=' links from the html
  'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
  nutch uses tika and maybe it's not related . . .
 
  Has anyone crawled javadoc files or have any suggestions? Thanks.
 
  -m.
 




Re: nutch crawl issue

2010-04-28 Thread matthew a. grisius
My subject should've been clearer, e.g. it should've read Nutch 1.1
nightly build crawl issue.

Also, I did verify that Nutch 1.0 successfully completes crawling the
javadoc html file and can be verified with luke-1.0.1 and searched using
command line bin/nutch org.apache.nutch.searcher.NutchBean java

On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
 using Nutch nightly build nutch-2010-04-27_04-00-28:
 
 I am trying to bin/nutch crawl a single html file generated by javadoc
 and no links are followed. I verified this with bin/nutch readdb and
 bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
 seed doc specified is processed.
 
 I searched and reviewed the nutch-user archive and tried several
 different settings but none of the settings appear to have any effect.
 
 I then downloaded maven-2.2.1 so that I could mvn install tika and
 produce tika-app-0.7.jar to command line extract information about the
 html javadoc file. I am not familiar w/ tika but the command line
 version doesn't return any metadata, e.g. no 'src=' links from the html
 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
 nutch uses tika and maybe it's not related . . .
 
 Has anyone crawled javadoc files or have any suggestions? Thanks.
 
 -m.