Move Nutch to Hadoop 2.X
Hi, My name is Dulaj Viduranga and I’m a 3rd year Computer Science and Engineering student at University of Moratuwa, Sri Lanka. I’m excited about Move Nutch to Hadoop 2.X project and I would like to participate in contributing to the project. Also If you are willing to, I’m very excited to have this, as my GSoC 2015 project this summer. Please let me know how to get involved. Thank you. Dulaj Viduranga.
Re: Move Nutch to Hadoop 2.X
Great, Dulaj. I think one of the starting points would be to work to engage via JIRA since I think Lewis has created a JIRA issue for this and tagged the appropriate issue as gsoc2015. We would welcome you via GSOC and I recommend you begin engaging via JIRA to get started on your proposal ASAP. Cheers and welcome! Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Dulaj Viduranga vidura...@icloud.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Wednesday, February 11, 2015 at 6:25 AM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Move Nutch to Hadoop 2.X Hi, My name is Dulaj Viduranga and I’m a 3rd year Computer Science and Engineering student at University of Moratuwa, Sri Lanka. I’m excited about Move Nutch to Hadoop 2.X project and I would like to participate in contributing to the project. Also If you are willing to, I’m very excited to have this, as my GSoC 2015 project this summer. Please let me know how to get involved. Thank you. Dulaj Viduranga.
Re: org.mortbay.proxy package not found in nutch 1.x, Ref Class - ProxyTestbed
Hi, the jetty-client-6.1.22.jar is a dependency needed only for testing. Consequently, it's placed in build/test/lib/ but only if you run the tests, resp. call % ant resolve-test There is also a target % ant eclipse which writes a complete Eclipse project configuration. Sometimes, if dependencies change, you have to run it again. Of course, even with this config you have to run % ant resolve-default resolve-test after a clean to copy all dependencies into build/{lib,test/lib}/ Best, Sebastian On 02/11/2015 05:00 AM, Preetam Pradeepkumar Shingavi wrote: Hi, I am trying to configure Nutch 1.X on eclipse, and configured the build path to include all jars from the build-lib folder. There is a class ProxyTestbed.java which has a error in importing the following package : import *org.mortbay.proxy.*AsyncProxyServlet; (proxy package not found) I tried to figure out that this class file loads from *jetty-6.1.26.jar, *but is not actually present in this jar. Am I missing anything here ? Do I download any other jar ? Thanks in advance !
[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7
[ https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317186#comment-14317186 ] Markus Jelsma commented on NUTCH-1925: -- ill check it out and check it in tomorrow. Upgrade Tika to version 1.7 --- Key: NUTCH-1925 URL: https://issues.apache.org/jira/browse/NUTCH-1925 Project: Nutch Issue Type: Improvement Components: build Reporter: Tyler Palsulich Assignee: Markus Jelsma Priority: Blocker Fix For: 1.10, 2.3.1 Attachments: NUTCH-1925.palsulich.patch, NUTCH-1925.palsulich.v2.patch Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant API changes between 1.6 and 1.7. So, this should be a one line update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1925) Upgrade Tika to version 1.7
[ https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated NUTCH-1925: --- Attachment: NUTCH-1925.palsulich.v2.patch Updated patch which includes an update to the instructions for how to upgrade Tika (sed script to format the required jars list). All tests pass on my computer (no tests commented out). Upgrade Tika to version 1.7 --- Key: NUTCH-1925 URL: https://issues.apache.org/jira/browse/NUTCH-1925 Project: Nutch Issue Type: Improvement Components: build Reporter: Tyler Palsulich Assignee: Markus Jelsma Priority: Blocker Fix For: 1.10, 2.3.1 Attachments: NUTCH-1925.palsulich.patch, NUTCH-1925.palsulich.v2.patch Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant API changes between 1.6 and 1.7. So, this should be a one line update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7
[ https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317132#comment-14317132 ] Lewis John McGibbney commented on NUTCH-1925: - Any objection to commit folks? Upgrade Tika to version 1.7 --- Key: NUTCH-1925 URL: https://issues.apache.org/jira/browse/NUTCH-1925 Project: Nutch Issue Type: Improvement Components: build Reporter: Tyler Palsulich Assignee: Markus Jelsma Priority: Blocker Fix For: 1.10, 2.3.1 Attachments: NUTCH-1925.palsulich.patch, NUTCH-1925.palsulich.v2.patch Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant API changes between 1.6 and 1.7. So, this should be a one line update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7
[ https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317187#comment-14317187 ] Markus Jelsma commented on NUTCH-1925: -- Ill check it out, and check it in tomorrow -Original message- Upgrade Tika to version 1.7 --- Key: NUTCH-1925 URL: https://issues.apache.org/jira/browse/NUTCH-1925 Project: Nutch Issue Type: Improvement Components: build Reporter: Tyler Palsulich Assignee: Markus Jelsma Priority: Blocker Fix For: 1.10, 2.3.1 Attachments: NUTCH-1925.palsulich.patch, NUTCH-1925.palsulich.v2.patch Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant API changes between 1.6 and 1.7. So, this should be a one line update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type
[ https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1928: Attachment: NUTCH-1928v4.patch [~jorgelbg] please check out this new patch. It includes all of the necessary additions to build.xml as well as default.properties and the plugin build configuration. What we are missing is your configuration file key, value and description for the mimetype-filter.txt files within nutch-default.xml. Can you please add the latter? Once this is done this patch is well and truly ready to make it in IMHO. Thanks Jorge. Indexing filter of documents by the MIME type - Key: NUTCH-1928 URL: https://issues.apache.org/jira/browse/NUTCH-1928 Project: Nutch Issue Type: Improvement Components: indexer, plugin Reporter: Jorge Luis Betancourt Gonzalez Assignee: Jorge Luis Betancourt Gonzalez Labels: filter, mime-type, plugin Fix For: 1.10 Attachments: NUTCH-1928v4.patch, mimetype-patch-v3.patch This allows to filter the indexed documents by the MIME type property of the crawled content. Basically this will allow you to restrict the MIME type of the contents that will be stored in Solr/Elasticsearch index without the need to restrict the crawling/parsing process, so no need to use URLFilter plugin family. Also this address one particular corner case when certain URLs doesn't have any format to filter such as some RSS feeds (http://www.awesomesite.com/feed) and it will end in your index mixed with all your HTML content. A configuration can file specified on the {{mimetype.filter.file}} property in the {{nutch-site.xml}}. This file use the same format as the {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an {{allow all}} policy is used instead, so all your crawled documents will be indexed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)