Move Nutch to Hadoop 2.X

2015-02-11 Thread Dulaj Viduranga
Hi,
My name is Dulaj Viduranga and I’m a 3rd year Computer Science and 
Engineering student at University of Moratuwa, Sri Lanka.
I’m excited about Move Nutch to Hadoop 2.X project and I would like 
to participate in contributing to the project. Also If you are willing to, I’m 
very excited to have this, as my GSoC 2015 project this summer.
Please let me know how to get involved.

Thank you.
Dulaj Viduranga.

Re: Move Nutch to Hadoop 2.X

2015-02-11 Thread Mattmann, Chris A (3980)
Great, Dulaj. I think one of the starting points would be to
work to engage via JIRA since I think Lewis has created a JIRA
issue for this and tagged the appropriate issue as gsoc2015.

We would welcome you via GSOC and I recommend you begin engaging
via JIRA to get started on your proposal ASAP.

Cheers and welcome!

Cheers,
Chris



++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Dulaj Viduranga vidura...@icloud.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Wednesday, February 11, 2015 at 6:25 AM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Move Nutch to Hadoop 2.X

Hi,
   My name is Dulaj Viduranga and I’m a 3rd year Computer Science and
Engineering student at University of Moratuwa, Sri Lanka.
   I’m excited about Move Nutch to Hadoop 2.X project and I would like to
participate in contributing to the project. Also If you are willing to,
I’m very excited to have this, as my GSoC 2015 project this summer.
   Please let me know how to get involved.

Thank you.
Dulaj Viduranga.



Re: org.mortbay.proxy package not found in nutch 1.x, Ref Class - ProxyTestbed

2015-02-11 Thread Sebastian Nagel
Hi,

the jetty-client-6.1.22.jar
is a dependency needed only for testing.
Consequently, it's placed in
 build/test/lib/
but only if you run the tests, resp. call
 % ant resolve-test

There is also a target
 % ant eclipse
which writes a complete Eclipse project configuration.
Sometimes, if dependencies change, you have to run it again.

Of course, even with this config you have to run
 % ant resolve-default resolve-test
after a clean to copy all dependencies into build/{lib,test/lib}/

Best,
Sebastian

On 02/11/2015 05:00 AM, Preetam Pradeepkumar Shingavi wrote:
 Hi,
 
 I am trying to configure Nutch 1.X on eclipse, and configured the build path 
 to include all jars
 from the build-lib folder.
 
 There is a class ProxyTestbed.java which has a error in importing the 
 following package :
 import *org.mortbay.proxy.*AsyncProxyServlet; (proxy package not found)
 
 I tried to figure out that this class file loads from *jetty-6.1.26.jar, *but 
 is not actually
 present in this jar. 
 
 Am I missing anything here ? Do I download any other jar ?
 
 Thanks in advance !



[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-11 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317186#comment-14317186
 ] 

Markus Jelsma commented on NUTCH-1925:
--

ill check it out and check it in tomorrow.

 Upgrade Tika to version 1.7
 ---

 Key: NUTCH-1925
 URL: https://issues.apache.org/jira/browse/NUTCH-1925
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Tyler Palsulich
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1925.palsulich.patch, NUTCH-1925.palsulich.v2.patch


 Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant 
 API changes between 1.6 and 1.7. So, this should be a one line update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-11 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated NUTCH-1925:
---
Attachment: NUTCH-1925.palsulich.v2.patch

Updated patch which includes an update to the instructions for how to upgrade 
Tika (sed script to format the required jars list).

All tests pass on my computer (no tests commented out).

 Upgrade Tika to version 1.7
 ---

 Key: NUTCH-1925
 URL: https://issues.apache.org/jira/browse/NUTCH-1925
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Tyler Palsulich
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1925.palsulich.patch, NUTCH-1925.palsulich.v2.patch


 Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant 
 API changes between 1.6 and 1.7. So, this should be a one line update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-11 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317132#comment-14317132
 ] 

Lewis John McGibbney commented on NUTCH-1925:
-

Any objection to commit folks?

 Upgrade Tika to version 1.7
 ---

 Key: NUTCH-1925
 URL: https://issues.apache.org/jira/browse/NUTCH-1925
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Tyler Palsulich
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1925.palsulich.patch, NUTCH-1925.palsulich.v2.patch


 Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant 
 API changes between 1.6 and 1.7. So, this should be a one line update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-11 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317187#comment-14317187
 ] 

Markus Jelsma commented on NUTCH-1925:
--

Ill check it out, and check it in tomorrow
 
-Original message-


 Upgrade Tika to version 1.7
 ---

 Key: NUTCH-1925
 URL: https://issues.apache.org/jira/browse/NUTCH-1925
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Tyler Palsulich
Assignee: Markus Jelsma
Priority: Blocker
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1925.palsulich.patch, NUTCH-1925.palsulich.v2.patch


 Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant 
 API changes between 1.6 and 1.7. So, this should be a one line update.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-11 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1928:

Attachment: NUTCH-1928v4.patch

[~jorgelbg] please check out this new patch. It includes all of the necessary 
additions to build.xml as well as default.properties and the plugin build 
configuration.
What we are missing is your configuration file key, value and description for 
the mimetype-filter.txt files within nutch-default.xml.
Can you please add the latter?
Once this is done this patch is well and truly ready to make it in IMHO.
Thanks Jorge.

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: NUTCH-1928v4.patch, mimetype-patch-v3.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)