[jira] Commented: (NUTCH-766) Tika parser

2010-02-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12834658#action_12834658
 ] 

Hudson commented on NUTCH-766:
--

Integrated in Nutch-trunk #1071 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1071/])


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832454#action_12832454
 ] 

Julien Nioche commented on NUTCH-766:
-

@Chris : I just did a fresh co from svn, applied the patch v3 and unzipped 
sample.tar.gz onto  the directory parse-tika and ran the test just as you did 
but could not reproduce the problem.  Could there be a difference between your 
version and the trunk?

@Sami :  

{quote} was there a reason not to use AutoDetect parser?  {quote} 
I suppose we could as long we give it a clue about the MimeType obtained from 
the Content.  As you pointed out, there could be a duplication with the 
detection done by Mime-Util. I suppose one way to do would be to add a new 
version of the method getParse(Content conte, MimeType type). That's an 
interesting point.

{quote} Also was there a reson not to parse html wtih tika?  {quote} 
It is supposed to do so, if it does not then it's a bug which needs urgent 
fixing.

Regarding parsing package formats, I think the plan is that Tika will handle 
that in the future but we could try to do that now if we find a relatively 
clean mechanism for doing so. BTW could you please send a diff and not the full 
code of the class you posted earlier, that would make the comparison much 
easier.




 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library 

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832564#action_12832564
 ] 

Julien Nioche commented on NUTCH-766:
-

I had a closer look at the HTML parsing issue. What happens  is that the 
association between the mime-type and the parser implementation is not 
explicitely set in parse-plugins.xml so the ParserFactory goes through all the 
plugins and gets the ones with a matching mimetype (or * for Tika). The Tika 
parser takes no precedence over the default HTML parser and the latter gets 
first in the list and is used for parsing.

Of course that does not happen if parse-html is not specified in 
plugin.includes or if an explicit mapping is set in parse-plugins.xml.  I don't 
think we want to have to specify explicitely that tika should be used in all 
the mappings and reserve cases for when a parser must be used instead of Tika.

What we could do though is that in the cases where no explicit mapping is set 
for a mimetype, Tika (or any parser marked as supporting any mimetype) will be 
put first in the list of discovered parsers so it would remain the default 
choice unless an explicit mapping is set (even if a plugin is loaded and can 
handle the type).

Makes sense?



 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library 

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832565#action_12832565
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

Hi Julien:

{quote}
@Chris : I just did a fresh co from svn, applied the patch v3 and unzipped 
sample.tar.gz onto the directory parse-tika and ran the test just as you did 
but could not reproduce the problem. Could there be a difference between your 
version and the trunk? 
{quote}

I tried this process last night:

1. SVN up to r908832
2. download patch v3
3. download sample.tgz
4. apply patch v3 to r908832
5. untar sample.tgz into src/plugin/parse-tika, creating a sample folder in 
that dir
6. ant clean compile-core test

Any idea why I'm seeing the error?

Cheers,
Chris


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to 

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832583#action_12832583
 ] 

Julien Nioche commented on NUTCH-766:
-

@Chris : did you do 

ant -f src/plugin/parse-tika/build-ivy.xml 

between 5 and 6? This is required in order to populate the lib directory 
automatically

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832588#action_12832588
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

@Julien:

Sigh, no I didn't! :(

That's probably why! Thanks for the help. I'll try it later today. If that 
passes, my +1 to commit. 

@Sami, regarding your updates, would you be OK with me creating another issue 
to track them, attaching your diffs as patches against this issue, once 
committed to the trunk? That way we'll make sure they get into 1.1, but we 
won't block this issue anymore from getting in. Let me know what you think, 
thanks.

Cheers,
Chris


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 

[jira] Commented: (NUTCH-766) Tika parser

2010-02-11 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832866#action_12832866
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

- forgot to add in dep libs, added in r909269. Thanks!

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, 
 sample.tar.gz, TikaParser.java


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-10 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832250#action_12832250
 ] 

Andrzej Bialecki  commented on NUTCH-766:
-

+1 to commit this - please remember to update nutch-default.xml to switch to 
the tika plugin, perhaps add a comment about the deprecated parse-* plugins - 
most people look here and not in the parse-plugins, where this change is 
documented...

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832255#action_12832255
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

{quote}
+1 to commit this...
{quote}

Awesome, Andrzej. Will do so tonight, PST, if I don't hear any objections 
between now and then...

Thanks!

Cheers,
Chris


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-02-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832398#action_12832398
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

I'm going to hold off on committing this tonight. I've updated the docs per 
Andrzej, and I've also updated CHANGES.txt, but when running:

{code}
ant clean compile-core test
{code}

I'm seeing these messages during plugin testing for parse-tika:

{noformat}
2010-02-10 22:39:16,593 ERROR tika.TikaParser (TikaParser.java:getParse(63)) - 
Can't retrieve Tika parser for mime-type application/pdf
-  ---

Testcase: testIt took 2.684 sec
FAILED
null
junit.framework.AssertionFailedError
at org.apache.nutch.tika.TestPdfParser.testIt(TestPdfParser.java:79)
{noformat}

It seems that the TikaConfig is not being found? I was looking at 
TikaParser#setConf and it seems that a default config is being created for 
Tika, but maybe not being loaded correctly? I need to look into this more...

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to 

[jira] Commented: (NUTCH-766) Tika parser

2010-02-10 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12832406#action_12832406
 ] 

Sami Siren commented on NUTCH-766:
--

I suggest that we would still drive this a bit further an use. currently this 
patch does not use Tika for pkg formats nor html.

Julien: was there a reason not to use AutoDetect parser? The only thing that I 
could come with was that the mime type detection would be done twice. We could 
get around this by implementing somethin simlilar to what composite parser does 
(it uses a parser (AutodetectParser) class from the context to do further 
parsing) to cover all supported pkg formats.

Also was there a reson not to parse html wtih tika?

I have a patch nearby to demonstrate some of the improvements that I will try 
to post briefly.

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed 

[jira] Commented: (NUTCH-766) Tika parser

2010-01-28 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805892#action_12805892
 ] 

Julien Nioche commented on NUTCH-766:
-

Here is a slightly better version of the patch which : 
• fixes a small bug in the Tika parser (the API has changed slightly between 
1.5beta and 1.5)
• fixes a bug with the TestParserFactory
• adds the tika-plugin to the list of plugins to be built in 
src/plugin/build.xml
• limits public exposure of methods and classes (see Sami's comment)
• modified parse-plugins.xml : added parse-tika and commented out associations 
between some mime-types and the old parsers

I've also added an ANT script which uses IVY to pull the dependencies and 
copies them into the lib dir. Obviously this won't be needed when the plugin is 
committed but should simplify the initial testing. All you need to do after 
applying the patch is to :

cd src/plugin/parse-tika/
ant -f build-ivy.xml

Am also attaching the content of the sample directory as an archive - just 
unzip onto the src/plugin/parse-tika/ before calling ant test-plugins

Julien




 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch, 
 NUTCH-766.v2, sample.tar.gz


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library 

[jira] Commented: (NUTCH-766) Tika parser

2010-01-27 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12805661#action_12805661
 ] 

Sami Siren commented on NUTCH-766:
--

{quote}
Sure, it's more of a configuration backwards-compat issue. For those folks who 
have gone to the trouble of customizing their nutch configuration 
(nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the 
parsing plugins (e.g., basically say they don't exist anymore and update your 
deployed configuration to use the tika-plugin), this patch would require a 
configuration update in their deployed environments. Because of that, why don't 
we ease them into that upgrade with at least one released version before the 
plugins go away. It would make it easier from a configuration backwards-compat 
perspective.
{quote}

Ok, so you mean that we need to have duplicate parser plugins because we don't 
want to ask people already using nutch to reconfigure the bits this involves 
now even though we have to do it later? How is postponing going to ease the 
task they need to do anyway at some point? I still don't understand the (longer 
term) benefit.

I am not strongly against the idea of keeping duplicate plugins, I mean it's 
just another ~20M in the .job, what I am worried about is that the history will 
repeat itself and we will end up having one more case of duplicate components 
(in this case many of them) doing the same work and no interest in cleaning up 
afterwards. Doing it the way I suggested would guarantee that this will not 
happen.


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library 

[jira] Commented: (NUTCH-766) Tika parser

2010-01-25 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804448#action_12804448
 ] 

Sami Siren commented on NUTCH-766:
--

+1, I'm going to agree on this one here Julien. Other communities  have 
convinced me of the need for backwards compat and unobtrusiveness when 
bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving 
the old plugins (perhaps mentioning they should be deprecated and replaced by 
the Tika functionality) and then removing them in 1.2 or 1.3.

Chris, can you please explain me how keeping two components doing identical 
work would be more backwards compatible than having only 1? 



 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 

[jira] Commented: (NUTCH-766) Tika parser

2010-01-25 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804546#action_12804546
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

Hi Sami:

{quote}
Chris, can you please explain me how keeping two components doing identical 
work would be more backwards compatible than having only 1?
{quote}

Sure, it's more of a configuration backwards-compat issue. For those folks who 
have gone to the trouble of customizing their nutch configuration 
(nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the 
parsing plugins (e.g., basically say they don't exist anymore and update your 
deployed configuration to use the tika-plugin), this patch would require a 
configuration update in their deployed environments. Because of that, why don't 
we ease them into that upgrade with at least one released version before the 
plugins go away. It would make it easier from a configuration backwards-compat 
perspective.

HTH,
Chris


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each 

[jira] Commented: (NUTCH-766) Tika parser

2010-01-25 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12804558#action_12804558
 ] 

Andrzej Bialecki  commented on NUTCH-766:
-

I agree with Chris, +1 on keeping the old plugins in 1.1 with a prominent 
deprecation note, but I feel equally strongly that we should not prolong their 
life-cycle beyond what we can support, i.e. I'm +1 on removing them in 1.2/1.3. 
We simply don't have resources to maintain so many duplicate plugins, and 
instead we should direct our efforts to improve those in Tika.

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to 

[jira] Commented: (NUTCH-766) Tika parser

2010-01-22 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803664#action_12803664
 ] 

Sami Siren commented on NUTCH-766:
--

I took a brief look into the proposed patch, some somments:

The public API footprint of new classes should be smaller, eg use private, 
package private or protected methods/classes as much as possible.

I think the end result of this plugin should be replacing all Tika supported 
parsers (or the parsers we choose to replace) with the TikaParser and not to 
build a parallel ways to parse same formats. So I think we need to copy all of 
the the existing test files and moveadapt the existing testcases fully before 
committing this. That is a good way of seeing that the parse result is what is 
expected and also find out about possible differences with old vs. Tika version.


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language 

[jira] Commented: (NUTCH-766) Tika parser

2010-01-22 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803670#action_12803670
 ] 

Julien Nioche commented on NUTCH-766:
-

 I think the end result of this plugin should be replacing all Tika supported 
 parsers (or the parsers we choose to replace) with the TikaParser and not to 
 build a parallel ways to parse same formats. 

That's how I see it - it's just that we have the option of choosing when to use 
Tika or not for a given mimetype. It is used by default unless an association 
is created between a parser implementation and   a mimetype in the 
parse-plugins.xml

 So I think we need to copy all of the the existing test files and moveadapt 
 the existing testcases fully before committing this. That is a good way of 
 seeing that the parse result is what is expected and also find out about 
 possible differences with old vs. Tika version.

Sure, but it would be silly to block the whole Tika plugin because Tika does 
not support such or such format as well as the original Nutch plugins. As I 
explained above we can configure which parser to use for which mimetype and use 
the Tika-plugin by default.   Hopefully the Tika implementation will get better 
and better and there will be no need for keeping the old plugins.

BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the 
current version of Tika and the existing Nutch parsers

Even if we decide to keep using the old plugins for some of the formats to 
start with, we'd still be able to the Tika plugin by default for the ones which 
have already the same coverage


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library 

[jira] Commented: (NUTCH-766) Tika parser

2010-01-22 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803673#action_12803673
 ] 

Sami Siren commented on NUTCH-766:
--

 Sure, but it would be silly to block the whole Tika plugin because Tika does 
 not support such or such format as well as the original Nutch plugins. As I 
 explained above we can configure which parser to use for which mimetype and 
 use the Tika-plugin by default. Hopefully the Tika implementation will get 
 better and better and there will be no need for keeping the old plugins.

I meant test files for the parsers we replace, not all

 BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the 
 current version of Tika and the existing Nutch parsers

ok, I had misses that one. 

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your 

[jira] Commented: (NUTCH-766) Tika parser

2010-01-22 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12803709#action_12803709
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

{quote}
Sure, but it would be silly to block the whole Tika plugin because Tika does 
not support such or such format as well as the original Nutch plugins. As I 
explained above we can configure which parser to use for which mimetype and use 
the Tika-plugin by default. Hopefully the Tika implementation will get better 
and better and there will be no need for keeping the old plugins.
{quote}

+1, I'm going to agree on this one here Julien. Other communities ;) have 
convinced me of the need for backwards compat and unobtrusiveness when bringing 
in new functionality or results. +1 to at least in Nutch 1.1 leaving the old 
plugins (perhaps mentioning they should be deprecated and replace by the Tika 
functionality) and then removing them in 1.2 or 1.3.

I got bogged down with my paid job, but I found some Apache time recently so 
this is tops on my list to tackle.

Cheers,
Chris



 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small 

[jira] Commented: (NUTCH-766) Tika parser

2010-01-11 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798718#action_12798718
 ] 

Chris A. Mattmann commented on NUTCH-766:
-

Hi Julien:

I have had a look and was trying to test it out but got sidetracked. Give me 
this week to try and put together a final reviewable/commitable patch, 
otherwise, it's all yours.

Cheers,
Chris


 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-766) Tika parser

2010-01-11 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798727#action_12798727
 ] 

Julien Nioche commented on NUTCH-766:
-

Hi Chris, 

No worries, I'd rather wait for you to have a look at it. It's quite a big 
change and it would be better if someone else had a look at it. Being the 
author I might miss something obvious

Thanks

J.

 Tika parser
 ---

 Key: NUTCH-766
 URL: https://issues.apache.org/jira/browse/NUTCH-766
 Project: Nutch
  Issue Type: New Feature
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
 Fix For: 1.1

 Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch


 Tika handles a lot of different formats under the bonnet and exposes them 
 nicely via SAX events. What is described here is a tika-parser plugin which 
 delegates the pasring mechanism of Tika but can still coexist with the 
 existing parsing plugins which is useful for formats partially handled by 
 Tika (or not at all). Some of the elements below have already been discussed 
 on the mailing lists. Note that this is work in progress, your feedback is 
 welcome.
 Tika is already used by Nutch for its MimeType implementations. Tika comes as 
 different jar files (core and parsers), in the work described here we decided 
 to put the libs in 2 different places
 NUTCH_HOME/lib : tika-core.jar
 NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
 Tika being used by the core only for its Mimetype functionalities we only 
 need to put tika-core at the main lib level whereas the tika plugin obviously 
 needs the tika-parsers.jar + all the jars used internally by Tika
 Due to limitations in the way Tika loads its classes, we had to duplicate the 
 TikaConfig class in the tika-plugin. This might be fixed in the future in 
 Tika itself or avoided by refactoring the mimetype part of Nutch using 
 extension points.
 Unlike most other parsers, Tika handles more than one Mime-type which is why 
 we are using * as its mimetype value in the plugin descriptor and have 
 modified ParserFactory.java so that it considers the tika parser as 
 potentially suitable for all mime-types. In practice this means that the 
 associations between a mime type and a parser plugin as defined in 
 parse-plugins.xml are useful only for the cases where we want to handle a 
 mime type with a different parser than Tika. 
 The general approach I chose was to convert the SAX events returned by the 
 Tika parsers into DOM objects and reuse the utilities that come with the 
 current HTML parser i.e. link detection,  metatag handling but also means 
 that we can use the HTMLParseFilters in exactly the same way. The main 
 difference though is that HTMLParseFilters are not limited to HTML documents 
 anymore as the XHTML tags returned by Tika can correspond to a different 
 format for the original document. There is a duplication of code with the 
 html-plugin which will be resolved by either a) getting rid of the 
 html-plugin altogether or b) exporting its jar and make the tika parser 
 depend on it.
 The following libraries are required in the lib/ directory of the tika-parser 
 : 
   library name=asm-3.1.jar/
   library name=bcmail-jdk15-144.jar/
   library name=commons-compress-1.0.jar/
   library name=commons-logging-1.1.1.jar/
   library name=dom4j-1.6.1.jar/
   library name=fontbox-0.8.0-incubator.jar/
   library name=geronimo-stax-api_1.0_spec-1.0.1.jar/
   library name=hamcrest-core-1.1.jar/
   library name=jce-jdk13-144.jar/
   library name=jempbox-0.8.0-incubator.jar/
   library name=metadata-extractor-2.4.0-beta-1.jar/
   library name=mockito-core-1.7.jar/
   library name=objenesis-1.0.jar/
   library name=ooxml-schemas-1.0.jar/
   library name=pdfbox-0.8.0-incubating.jar/
   library name=poi-3.5-FINAL.jar/
   library name=poi-ooxml-3.5-FINAL.jar/
   library name=poi-scratchpad-3.5-FINAL.jar/
   library name=tagsoup-1.2.jar/
   library name=tika-parsers-0.5-SNAPSHOT.jar/
   library name=xml-apis-1.0.b2.jar/
   library name=xmlbeans-2.3.0.jar/
 There is a small test suite which needs to be improved. We will need to have 
 a look at each individual format and check that it is covered by Tika and if 
 so to the same extent; the Wiki is probably the right place for this. The 
 language identifier (which is a HTMLParseFilter) seemed to work fine.
  
 Again, your comments are welcome. Please bear in mind that this is just a 
 first step. 
 Julien
 http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.