[jira] Created: (NUTCH-807) JSParseFilter produces weired URL
JSParseFilter produces weired URL - Key: NUTCH-807 URL: https://issues.apache.org/jira/browse/NUTCH-807 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.0.0 Environment: Redhat 2.6.18-128.1.6.el5PAE i686 i686 i386 GNU/Linux Reporter: Minyao Zhu This is found when crawling site: http://zhidao.baidu.com/( a Chinese language site ) It appears this page contains javascripts which confused JSParseFilter, which produced URL like this: http://zhidao.baidu.com/){if(A===46){baidu.hide( Not sure the impact/scope of this issue in general. The observation for this specific site is, much less pages got crawled. Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Apache Tika 0.7 Release Candidate #1
(apologies for the cross-post, but this impacts Nutch 1.1, so just wanted folks to see it) * +1 on extending the deadline until Monday, April 5th. Right now, we have 3 +1s, so technically we could still do the 72 hrs and still be OK, but I¹m fine with giving folks some more time to take a look * Thanks to jzitting and gsingers for taking a look and voting so far * Once Tika 0.7 is out the door, I will move forward on pushing out a Nutch 1.1 RC (after we upgrade Nutch to use Tika 0.7 -- Julien, help? :) ). That OK, Nutchers? * Thanks for comments on the CHANGES from gsingers, and the mention to include the sha1 of the src archive from jzitting. Will do on both, going forward. * +1 for having a direct link to tika-app on the website. Cheers, Chris On 4/1/10 11:41 PM, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Wed, Mar 31, 2010 at 10:01 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Please vote on releasing these packages as Apache Tika 0.7. +1 Thanks! Some minor notes: * It would be good to have also a SHA1 checksum for the release archive. * Perhaps we should start offering also the tika-app jar as a direct download from l.a.o/tika/download.html? The vote is open for the next 72 hours. It looks like people.apache.org is not accessible at the moment (I downloaded the release candidate yesterday), so it might be a good idea to extend the vote period over the Easter holidays. BR, Jukka Zitting ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Created: (NUTCH-809) Parse-metatags plugin
Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-809.patch h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: NUTCH-809.patch Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-809.patch h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Apache Tika 0.7 Release Candidate #1
Hi Chris, * Once Tika 0.7 is out the door, I will move forward on pushing out a Nutch 1.1 RC (after we upgrade Nutch to use Tika 0.7 -- Julien, help? :) ). That OK, Nutchers? Great. I'll definitely give 0.7 a try and make sure it works in Nutch. Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 4/1/10 11:41 PM, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Wed, Mar 31, 2010 at 10:01 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Please vote on releasing these packages as Apache Tika 0.7. +1 Thanks! Some minor notes: * It would be good to have also a SHA1 checksum for the release archive. * Perhaps we should start offering also the tika-app jar as a direct download from l.a.o/tika/download.html? The vote is open for the next 72 hours. It looks like people.apache.org is not accessible at the moment (I downloaded the release candidate yesterday), so it might be a good idea to extend the vote period over the Easter holidays. BR, Jukka Zitting ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] Updated: (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: (was: NUTCH-809.patch) Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: NUTCH-809.patch Modified version of the plugin which is compatible with parse-tika Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-809.patch h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Description: h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com was: h2. Parse-metatags plugin *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).* To use the legacy HTML parser specify in parse-plugins.xml {code:xml} mimeType name=text/html plugin id=parse-html / /mimeType {code} The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-809.patch h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The MetaTagsQueryFilter allows to include the fields above in the Nutch queries. This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-808) Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs
[ https://issues.apache.org/jira/browse/NUTCH-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12852840#action_12852840 ] Enis Soztutar commented on NUTCH-808: - A candidate framework is DataNucleus. It has the following benefits. - Apache 2 license. - JDO support - HBase, RDBMS, XML persistance. I will further investigate whether we can integrate Hadoop writables/Avro serialization so that objects can be passed from Mapred. Evaluate ORM Frameworks which support non-relational column-oriented datastores and RDBMs -- Key: NUTCH-808 URL: https://issues.apache.org/jira/browse/NUTCH-808 Project: Nutch Issue Type: Task Reporter: Enis Soztutar Assignee: Enis Soztutar We have an ORM layer in the NutchBase branch, which uses Avro Specific Compiler to compile class definitions given in JSON. Before moving on with this, we might benefit from evaluating other frameworks, whether they suit our needs. We want at least the following capabilities: - Using POJOs - Able to persist objects to at least HBase, Cassandra, and RDBMs - Able to efficiently serialize objects as task outputs from Hadoop jobs - Allow native queries, along with standard queries Any comments, suggestions for other frameworks are welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [VOTE] Apache Tika 0.7 Release Candidate #1
On 2010-04-02 16:14, Mattmann, Chris A (388J) wrote: * Once Tika 0.7 is out the door, I will move forward on pushing out a Nutch 1.1 RC (after we upgrade Nutch to use Tika 0.7 -- Julien, help? :) ). That OK, Nutchers? Yes - thanks! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [VOTE] Apache Tika 0.7 Release Candidate #1
Hey Jukka, Sounds good to me then if no one else objects. I'll wait the 72 hrs (Sat, 4:01 PM EST) and then assuming the VOTE passes, roll the releases out to the mirrors and then work on Nutch 1.1. Cheers, Chris On 4/2/10 11:41 AM, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Fri, Apr 2, 2010 at 4:14 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: +1s, so technically we could still do the 72 hrs and still be OK, but I'm fine with giving folks some more time to take a look I'm fine with closing the vote already at 72 hours since the p.a.o outage only seemed to last a few hours. Jukka ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++