[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834658#action_12834658 ] Hudson commented on NUTCH-766: -- Integrated in Nutch-trunk #1071 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1071/]) > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, > sample.tar.gz, TikaParser.java > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1283#action_1283 ] Hudson commented on NUTCH-766: -- Integrated in Nutch-trunk #1067 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1067/]) - 2nd part of Tika parser - fix for Tika parser > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, > sample.tar.gz, TikaParser.java > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832866#action_12832866 ] Chris A. Mattmann commented on NUTCH-766: - - forgot to add in dep libs, added in r909269. Thanks! > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, > sample.tar.gz, TikaParser.java > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832588#action_12832588 ] Chris A. Mattmann commented on NUTCH-766: - @Julien: Sigh, no I didn't! :( That's probably why! Thanks for the help. I'll try it later today. If that passes, my +1 to commit. @Sami, regarding your updates, would you be OK with me creating another issue to track them, attaching your diffs as patches against this issue, once committed to the trunk? That way we'll make sure they get into 1.1, but we won't block this issue anymore from getting in. Let me know what you think, thanks. Cheers, Chris > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, > sample.tar.gz, TikaParser.java > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832583#action_12832583 ] Julien Nioche commented on NUTCH-766: - @Chris : did you do ant -f src/plugin/parse-tika/build-ivy.xml between 5 and 6? This is required in order to populate the lib directory automatically > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, > sample.tar.gz, TikaParser.java > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832565#action_12832565 ] Chris A. Mattmann commented on NUTCH-766: - Hi Julien: {quote} @Chris : I just did a fresh co from svn, applied the patch v3 and unzipped sample.tar.gz onto the directory parse-tika and ran the test just as you did but could not reproduce the problem. Could there be a difference between your version and the trunk? {quote} I tried this process last night: 1. SVN up to r908832 2. download patch v3 3. download sample.tgz 4. apply patch v3 to r908832 5. untar sample.tgz into src/plugin/parse-tika, creating a sample folder in that dir 6. ant clean compile-core test Any idea why I'm seeing the error? Cheers, Chris > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, > sample.tar.gz, TikaParser.java > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832564#action_12832564 ] Julien Nioche commented on NUTCH-766: - I had a closer look at the HTML parsing issue. What happens is that the association between the mime-type and the parser implementation is not explicitely set in parse-plugins.xml so the ParserFactory goes through all the plugins and gets the ones with a matching mimetype (or * for Tika). The Tika parser takes no precedence over the default HTML parser and the latter gets first in the list and is used for parsing. Of course that does not happen if parse-html is not specified in plugin.includes or if an explicit mapping is set in parse-plugins.xml. I don't think we want to have to specify explicitely that tika should be used in all the mappings and reserve cases for when a parser must be used instead of Tika. What we could do though is that in the cases where no explicit mapping is set for a mimetype, Tika (or any parser marked as supporting any mimetype) will be put first in the list of discovered parsers so it would remain the default choice unless an explicit mapping is set (even if a plugin is loaded and can handle the type). Makes sense? > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, > sample.tar.gz, TikaParser.java > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832454#action_12832454 ] Julien Nioche commented on NUTCH-766: - @Chris : I just did a fresh co from svn, applied the patch v3 and unzipped sample.tar.gz onto the directory parse-tika and ran the test just as you did but could not reproduce the problem. Could there be a difference between your version and the trunk? @Sami : {quote} was there a reason not to use AutoDetect parser? {quote} I suppose we could as long we give it a clue about the MimeType obtained from the Content. As you pointed out, there could be a duplication with the detection done by Mime-Util. I suppose one way to do would be to add a new version of the method getParse(Content conte, MimeType type). That's an interesting point. {quote} Also was there a reson not to parse html wtih tika? {quote} It is supposed to do so, if it does not then it's a bug which needs urgent fixing. Regarding parsing package formats, I think the plan is that Tika will handle that in the future but we could try to do that now if we find a relatively clean mechanism for doing so. BTW could you please send a diff and not the full code of the class you posted earlier, that would make the comparison much easier. > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, > sample.tar.gz, TikaParser.java > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automati
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832406#action_12832406 ] Sami Siren commented on NUTCH-766: -- I suggest that we would still drive this a bit further an use. currently this patch does not use Tika for pkg formats nor html. Julien: was there a reason not to use AutoDetect parser? The only thing that I could come with was that the mime type detection would be done twice. We could get around this by implementing somethin simlilar to what composite parser does (it uses a parser (AutodetectParser) class from the context to do further parsing) to cover all supported pkg formats. Also was there a reson not to parse html wtih tika? I have a patch nearby to demonstrate some of the improvements that I will try to post briefly. > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832398#action_12832398 ] Chris A. Mattmann commented on NUTCH-766: - I'm going to hold off on committing this tonight. I've updated the docs per Andrzej, and I've also updated CHANGES.txt, but when running: {code} ant clean compile-core test {code} I'm seeing these messages during plugin testing for parse-tika: {noformat} 2010-02-10 22:39:16,593 ERROR tika.TikaParser (TikaParser.java:getParse(63)) - Can't retrieve Tika parser for mime-type application/pdf - --- Testcase: testIt took 2.684 sec FAILED null junit.framework.AssertionFailedError at org.apache.nutch.tika.TestPdfParser.testIt(TestPdfParser.java:79) {noformat} It seems that the TikaConfig is not being found? I was looking at TikaParser#setConf and it seems that a default config is being created for Tika, but maybe not being loaded correctly? I need to look into this more... > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832255#action_12832255 ] Chris A. Mattmann commented on NUTCH-766: - {quote} +1 to commit this... {quote} Awesome, Andrzej. Will do so tonight, PST, if I don't hear any objections between now and then... Thanks! Cheers, Chris > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832250#action_12832250 ] Andrzej Bialecki commented on NUTCH-766: - +1 to commit this - please remember to update nutch-default.xml to switch to the tika plugin, perhaps add a comment about the deprecated parse-* plugins - most people look here and not in the parse-plugins, where this change is documented... > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805892#action_12805892 ] Julien Nioche commented on NUTCH-766: - Here is a slightly better version of the patch which : • fixes a small bug in the Tika parser (the API has changed slightly between 1.5beta and 1.5) • fixes a bug with the TestParserFactory • adds the tika-plugin to the list of plugins to be built in src/plugin/build.xml • limits public exposure of methods and classes (see Sami's comment) • modified parse-plugins.xml : added parse-tika and commented out associations between some mime-types and the old parsers I've also added an ANT script which uses IVY to pull the dependencies and copies them into the lib dir. Obviously this won't be needed when the plugin is committed but should simplify the initial testing. All you need to do after applying the patch is to : cd src/plugin/parse-tika/ ant -f build-ivy.xml Am also attaching the content of the sample directory as an archive - just unzip onto the src/plugin/parse-tika/ before calling ant test-plugins Julien > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch, > NUTCH-766.v2, sample.tar.gz > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805661#action_12805661 ] Sami Siren commented on NUTCH-766: -- {quote} Sure, it's more of a configuration backwards-compat issue. For those folks who have gone to the trouble of customizing their nutch configuration (nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the parsing plugins (e.g., basically say they don't exist anymore and update your deployed configuration to use the tika-plugin), this patch would require a configuration update in their deployed environments. Because of that, why don't we ease them into that upgrade with at least one released version before the plugins go away. It would make it easier from a configuration backwards-compat perspective. {quote} Ok, so you mean that we need to have duplicate parser plugins because we don't want to ask people already using nutch to reconfigure the bits this involves now even though we have to do it later? How is postponing going to ease the task they need to do anyway at some point? I still don't understand the (longer term) benefit. I am not strongly against the idea of keeping duplicate plugins, I mean it's just another ~20M in the .job, what I am worried about is that the history will repeat itself and we will end up having one more case of duplicate components (in this case many of them) doing the same work and no interest in cleaning up afterwards. Doing it the way I suggested would guarantee that this will not happen. > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (whic
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804558#action_12804558 ] Andrzej Bialecki commented on NUTCH-766: - I agree with Chris, +1 on keeping the old plugins in 1.1 with a prominent deprecation note, but I feel equally strongly that we should not prolong their life-cycle beyond what we can support, i.e. I'm +1 on removing them in 1.2/1.3. We simply don't have resources to maintain so many duplicate plugins, and instead we should direct our efforts to improve those in Tika. > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804546#action_12804546 ] Chris A. Mattmann commented on NUTCH-766: - Hi Sami: {quote} Chris, can you please explain me how keeping two components doing identical work would be more backwards compatible than having only 1? {quote} Sure, it's more of a configuration backwards-compat issue. For those folks who have gone to the trouble of customizing their nutch configuration (nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the parsing plugins (e.g., basically say they don't exist anymore and update your deployed configuration to use the tika-plugin), this patch would require a configuration update in their deployed environments. Because of that, why don't we ease them into that upgrade with at least one released version before the plugins go away. It would make it easier from a configuration backwards-compat perspective. HTH, Chris > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804448#action_12804448 ] Sami Siren commented on NUTCH-766: -- >+1, I'm going to agree on this one here Julien. Other communities have >convinced me of the need for backwards compat and unobtrusiveness when >bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving >the old plugins (perhaps mentioning they should be deprecated and replaced by >the Tika functionality) and then removing them in 1.2 or 1.3. Chris, can you please explain me how keeping two components doing identical work would be more backwards compatible than having only 1? > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803709#action_12803709 ] Chris A. Mattmann commented on NUTCH-766: - {quote} Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins. {quote} +1, I'm going to agree on this one here Julien. Other communities ;) have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replace by the Tika functionality) and then removing them in 1.2 or 1.3. I got bogged down with my paid job, but I found some Apache time recently so this is tops on my list to tackle. Cheers, Chris > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803673#action_12803673 ] Sami Siren commented on NUTCH-766: -- > Sure, but it would be silly to block the whole Tika plugin because Tika does > not support such or such format as well as the original Nutch plugins. As I > explained above we can configure which parser to use for which mimetype and > use the Tika-plugin by default. Hopefully the Tika implementation will get > better and better and there will be no need for keeping the old plugins. I meant test files for the parsers we replace, not all > BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the > current version of Tika and the existing Nutch parsers ok, I had misses that one. > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803670#action_12803670 ] Julien Nioche commented on NUTCH-766: - > I think the end result of this plugin should be replacing all Tika supported > parsers (or the parsers we choose to replace) with the TikaParser and not to > build a parallel ways to parse same formats. That's how I see it - it's just that we have the option of choosing when to use Tika or not for a given mimetype. It is used by default unless an association is created between a parser implementation and a mimetype in the parse-plugins.xml > So I think we need to copy all of the the existing test files and move&adapt > the existing testcases fully before committing this. That is a good way of > seeing that the parse result is what is expected and also find out about > possible differences with old vs. Tika version. Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins. BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the current version of Tika and the existing Nutch parsers Even if we decide to keep using the old plugins for some of the formats to start with, we'd still be able to the Tika plugin by default for the ones which have already the same coverage > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803664#action_12803664 ] Sami Siren commented on NUTCH-766: -- I took a brief look into the proposed patch, some somments: The public API footprint of new classes should be smaller, eg use private, package private or protected methods/classes as much as possible. I think the end result of this plugin should be replacing all Tika supported parsers (or the parsers we choose to replace) with the TikaParser and not to build a parallel ways to parse same formats. So I think we need to copy all of the the existing test files and move&adapt the existing testcases fully before committing this. That is a good way of seeing that the parse result is what is expected and also find out about possible differences with old vs. Tika version. > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798727#action_12798727 ] Julien Nioche commented on NUTCH-766: - Hi Chris, No worries, I'd rather wait for you to have a look at it. It's quite a big change and it would be better if someone else had a look at it. Being the author I might miss something obvious Thanks J. > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-766) Tika parser
[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798718#action_12798718 ] Chris A. Mattmann commented on NUTCH-766: - Hi Julien: I have had a look and was trying to test it out but got sidetracked. Give me this week to try and put together a final reviewable/commitable patch, otherwise, it's all yours. Cheers, Chris > Tika parser > --- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > > > > > > > > > > > > > > > > > > > > > > > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.