Re: Nutch in production

2016-09-29 Thread Mattmann, Chris A (3980)
Yep also check out the work that Sujen Shah just merged (also on my team at JPL and USC) where you can publish events to an ActiveMQ queue from Nutch crawling. That should allow all sorts of production dashboards and analytics. ++

Re: Apache Nutch 2.x and Spark tutorial

2016-08-02 Thread Mattmann, Chris A (3980)
Hi Gaurav, It doesn’t exist yet. However my group at USC is working on a project called Sparkler [1] that does that, but we haven’t made a release yet. We are actively working on it though! Cheers, Chris [1] http://github.com/USCDataScience/sparkler.git

Re: Unable to find documentation for Nutch 1.12, Wiki is outdated

2016-08-01 Thread Mattmann, Chris A (3980)
tz > >On Mon, Aug 1, 2016, 1:04 PM Mattmann, Chris A (3980) < >chris.a.mattm...@jpl.nasa.gov> wrote: > >> Great work Sebastien thank you for this. Would you be willing to >> update the wiki with this info? Please let me know your username >> a

Re: [VOTE] Release Apache Nutch 1.12

2016-06-16 Thread Mattmann, Chris A (3980)
+1 from me, great job Lewis and team! SIGS pass, CHECKSUMS pass: LMC-053601:apache-nutch-1.12-rc1 mattmann$ $HOME/bin/stage_apache_rc apache-nutch 1.12-bin https://dist.apache.org/repos/dist/dev/nutch/1.12/ % Total% Received % Xferd Average Speed TimeTime Time Current

Re: Robots.txt

2016-05-24 Thread Mattmann, Chris A (3980)
/ ++ On 5/24/16, 3:24 PM, "BlackIce" <blackice...@gmail.com> wrote: >I don't recall messing with anything to do with robots.txt, I want us to >be as polite as possible. >On May 25, 2016 12:22 AM, "Mattmann, Chris A (3980)" < >chris.a.mattm...@jpl.nas

Re: Robots.txt

2016-05-24 Thread Mattmann, Chris A (3980)
Hi, For security research, there is an option to white-list robots.txt. It’s not enabled by default and must be directly enabled. The solution is - there isn’t one. People used to just hack Nutch and do the same thing by commenting out a line of code which accomplished the same check. Those

Re: Nutch Docker Images Available on Dockerhub

2016-05-15 Thread Mattmann, Chris A (3980)
great work! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov

Re: Visualization Tool for Nutch

2016-05-02 Thread Mattmann, Chris A (3980)
Bin I completely agree. My team built the following: 1. Memex Explorer (http://github.com/memex-explorer/memex-explorer) but not actively developed anymore that used Bokeh.js and streaming publishing from Nutch under development to publish events and visualize crawls 2. We are using D3.js in my

Re: nutch-selenium help

2016-04-12 Thread Mattmann, Chris A (3980)
i have used >https://wiki.apache.org/nutch/AdvancedAjaxInteraction > >> On Apr 13, 2016, at 1:30 AM, Mattmann, Chris A (3980) >> <chris.a.mattm...@jpl.nasa.gov> wrote: >> >> Hi, the plugin is now part of Nutch, so you don’t need to use the >> Github one and

Re: nutch-selenium help

2016-04-12 Thread Mattmann, Chris A (3980)
<sabah.kh...@wayne.edu> wrote: >The link that i provided is the same as the one on the wiki page. > >> On Apr 13, 2016, at 1:13 AM, Mattmann, Chris A (3980) >> <chris.a.mattm...@jpl.nasa.gov> wrote: >> >> Please use the selenium plugin that is part of Nut

Re: Best Practices for Plugin Dev and Deployment

2016-04-08 Thread Mattmann, Chris A (3980)
gt;Hi Chris, thanks for the response, here are some elaborations of my initial >questions on the basis of your reply. > >On Wed, Apr 6, 2016 at 2:12 PM, Mattmann, Chris A (3980) < >chris.a.mattm...@jpl.nasa.gov> wrote: > >> Hi Thiago, >> >> Welcome! >>

Re: Best Practices for Plugin Dev and Deployment

2016-04-06 Thread Mattmann, Chris A (3980)
Hi Thiago, Welcome! First thing to check out: http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer I would follow that by checking out info on how to use our Source Code repo: http://wiki.apache.org/nutch/UsingGit OK now on to your specific questions: On 4/6/16, 8:48 AM, "Thiago

Re: [NOTICE] Nutch now using Writeable Git repos at the ASF

2016-03-01 Thread Mattmann, Chris A (3980)
Markus, Also have a look at git-svn which is a tool that allows SVN commands and git to work together. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet

[NOTICE] Nutch now using Writeable Git repos at the ASF

2016-02-26 Thread Mattmann, Chris A (3980)
Hi Team, Nutch now officially uses Git to manage its source repos. You can see the final elements to that here: https://issues.apache.org/jira/browse/INFRA-11300 I’ve written a guide for the wiki describing how to migrate your existing SVN checkout to Nutch if you are a user or a developer.

Re: Frontera: large-scale, distributed web crawling framework

2016-02-14 Thread Mattmann, Chris A (3980)
e >cases, so we could compare. > >It’s up to us which direction to choose, but I think 1. and 2. options >are most important. > >Currently, Frontera is moving towards the ease of use: ZeroMQ transport, >transport layer abstraction, standalone Frontera/Scrapy based c

Re: [MASSMAIL]Extract Contact Information - Custom Parser

2016-02-12 Thread Mattmann, Chris A (3980)
That’s a cool idea but how would we set up the redirect since wouldn’t that have to occur at SO? ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory

Re: [MASSMAIL]Extract Contact Information - Custom Parser

2016-02-12 Thread Mattmann, Chris A (3980)
I have a student working on this right now. One thing - Tika has a PhoneNumber Content Handler and it would be leveraged here in such a plugin type in Nutch. Tyler Palsulich worked on it from our DARPA work. ++ Chris Mattmann, Ph.D.

Re: [MASSMAIL]Extract Contact Information - Custom Parser

2016-02-12 Thread Mattmann, Chris A (3980)
ct Contact Information - Custom Parser >we could create an account for the project at SO, give the user list as an >email address and set up an alert so that any question tagged as [nutch] >gets sent to user@nutch.apache.org >That should work shouldn't it? > >On 12 February

Re: [VOTE] Release Apache Nutch 2.3.1rc2

2016-01-20 Thread Mattmann, Chris A (3980)
My bad I said I would do this! Here you go it’s done: +1 SIGS, checksums check out: [chipotle:~/tmp/apache-nutch-2.3.1-rc2] mattmann% $HOME/bin/stage_apache_rc apache-nutch 2.3.1-src https://dist.apache.org/repos/dist/dev/nutch/2.3.1rc2/ % Total% Received % Xferd Average Speed Time

Re: How to deploy Selenium on Server?

2015-12-22 Thread Mattmann, Chris A (3980)
you said this plugin is old, do you have some recommendations >for me, which is easy to deploy as i am a quite inexperience nutch user? > >Thanks again, Mattmann. > >Best Regards, >Byzen. Ma > >2015-12-22 1:44 GMT+08:00 Mattmann, Chris A (3980) < >chris.a.mattm...@jpl.

Re: How to deploy Selenium on Server?

2015-12-21 Thread Mattmann, Chris A (3980)
Hi Byzen, That’s the old plugin, we integrated it into Nutch trunk. Have a look at it integrated with Nutch here: https://wiki.apache.org/nutch/AdvancedAjaxInteraction Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect

Re: Anthelion from Yahoo

2015-12-17 Thread Mattmann, Chris A (3980)
Got it. Seems like there is great overlap here with the work that Sujen and Asitang and our team at JPL already did directly in Nutch to allow focused crawling based on Naive Bayes and also scoring similarity using cosine similarity. A cool project would be to compare the approaches (at least

Re: Anthelion from Yahoo

2015-12-16 Thread Mattmann, Chris A (3980)
What's it do? Sent from my iPhone > On Dec 16, 2015, at 6:55 PM, Otis Gospodnetić > wrote: > > Hi, > > FYI: https://github.com/yahoo/anthelion > > Anyone tried using it yet? > > Otis > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr &

Re: [RELEASE] Apache Nutch 1.11

2015-12-08 Thread Mattmann, Chris A (3980)
Booom Sent from my iPhone On Dec 8, 2015, at 3:18 PM, Michael Joyce > wrote: Cheers for pushing this out Lewis. And great job everyone on the hard work!!! -- Jimmy On Tue, Dec 8, 2015 at 1:26 AM, Markus Jelsma

Re: [VOTE] Release Apache Nutch 1.11 RC#2

2015-12-04 Thread Mattmann, Chris A (3980)
Hi Lewis, +1 from me. SIGS and CHECKSUMS check out. bash-3.2$ for atype in bin src; do /Users/mattmann/bin/stage_apache_rc apache-nutch 1.11-$atype https://dist.apache.org/repos/dist/dev/nutch/1.11rc2/; done % Total% Received % Xferd Average Speed TimeTime Time Current

Re: [ANNOUNCE] New Nutch committer and PMC - Michael Joyce

2015-11-10 Thread Mattmann, Chris A (3980)
welcome, Mike! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: Frontera: large-scale, distributed web crawling framework

2015-10-28 Thread Mattmann, Chris A (3980)
Hi Alex, I didn’t see any more traffic about this. Are you still looking for feedback? Are there any plans to make Frontera and Nutch work together? I’m still interested of course. Thanks. Thanks, Chris ++ Chris Mattmann, Ph.D.

[VOTE] Apache Nutch 1.11 Release Candidate #1

2015-10-25 Thread Mattmann, Chris A (3980)
Hi Folks, A first candidate for the Nutch 1.11 release is available at: https://dist.apache.org/repos/dist/dev/nutch/1.11/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-1.11-rc1/ The SHA1 checksum of the archive is

Re: Apache Nutch Output structure

2015-10-02 Thread Mattmann, Chris A (3980)
Please see: http://wiki.apache.org/nutch/NutchFileFormats ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519,

Re: Frontera: large-scale, distributed web crawling framework

2015-10-02 Thread Mattmann, Chris A (3980)
Hi, I don’t think Alexander is doing anything wrong. In fact, he’s asking for input on his web crawling framework on the Nutch user list which I imagine contains many people interested in distributed web crawling. There doesn’t appear to be a direct Nutch connection here in his framework,

Re: [VOTE] Release Apache Nutch 2.3.1

2015-09-30 Thread Mattmann, Chris A (3980)
+1 from me: [chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/stage_apache_rc apache-nutch 2.3.1 https://dist.apache.org/repos/dist/dev/nutch/2.3.1 [chipotle:~/tmp/nutch-2.3.1-rc1] mattmann% $HOME/bin/stage_apache_rc apache-nutch 2.3.1-src https://dist.apache.org/repos/dist/dev/nutch/2.3.1 %

Re: [VOTE] Release Apache Nutch 2.3.1

2015-09-30 Thread Mattmann, Chris A (3980)
I’ll download and VOTE on the release right now Lewis. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519,

Re: Webcast : Apache Nutch on EMR

2015-09-23 Thread Mattmann, Chris A (3980)
Thanks Julien, great work ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra

2015-09-09 Thread Mattmann, Chris A (3980)
welcome Asitang!! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: Nutch on the cloud

2015-07-23 Thread Mattmann, Chris A (3980)
maybe a repository of frequent problems? that sort? thanks for the heads up on the other guide. gave me a starting point. On Thu, Jul 23, 2015 at 6:24 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Ankit for the honest feedback. Would you be willing to update our wiki

Re: Nutch on the cloud

2015-07-22 Thread Mattmann, Chris A (3980)
Thanks Ankit for the honest feedback. Would you be willing to update our wiki and improve the instructions based on your experiences for our gotchas? We have a guide we have been working on ourselves to getting Nutch running and churning on ElasticMap Reduce. That’s where I’d recommend starting.

RE: CXF dependency on 1.10

2015-06-23 Thread Mattmann, Chris A (3980)
Hey Markus, I wonder if the Nutch pom.xml was updated on release? Looks like it was b/c it refs CXF: http://repo1.maven.org/maven2/org/apache/nutch/nutch/1.10/nutch-1.10.pom Also 3.0.4 for CXF is available in Central: http://repo1.maven.org/maven2/org/apache/cxf/cxf/3.0.4/ Not sure why it's

RE: REST API for crawling

2015-06-12 Thread Mattmann, Chris A (3980)
Thank you Dzmitry! All, FYI too - Nutch 1.x has an actively developed REST API. We are targeting for integration as a mechanism for both the Nutch admin GUI (GSoC Project last summer) and for Memex Explorer (http://github.com/memex-explorer/memex-explorer). We are also building a Nutch python

Re: AW: Deduplication -- custom Signature

2015-06-03 Thread Mattmann, Chris A (3980)
Yes, big time interest, Breno! Thanks and would appreciate your contribution. Instructions are here if you use Github: http://github.com/apache/nutch/#contributing, otherwise, JIRA and SVN patch would be fine too. Thanks! Cheers, Chris

Re: about language extraction for zip documents

2015-05-31 Thread Mattmann, Chris A (3980)
+1, agreed. This would be a welcomed addition. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519,

Re: Reverse Geocoding with Nutch 1.10

2015-05-06 Thread Mattmann, Chris A (3980)
awesome job Lewis ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: [VOTE] Release Apache Nutch 1.10

2015-05-03 Thread Mattmann, Chris A (3980)
+1 from me! SIGS, CHECKSUMS check out, looks gr8. [chipotle:~/tmp/apache-nutch-1.10-rc1] mattmann% $HOME/bin/stage_apache_rc apache-nutch 1.10-src https://dist.apache.org/repos/dist/dev/nutch/1.10/ % Total% Received % Xferd Average Speed TimeTime Time Current

Re: A bug in org.apache.nutch.parse.ParseUtil?

2015-04-20 Thread Mattmann, Chris A (3980)
Sounds great, Arkadi (isAnySuccess()). Please submit a pull request and/or patch when you get a chance. This sounds like a needed change for sure. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems

Re: HTTP Post Authentication

2015-04-07 Thread Mattmann, Chris A (3980)
Thanks Tizy - adding Tyler to this in case he didn’t see it. Tyler is this what you were running into? Thoughts? ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion

Re: website structure discovery?

2015-03-30 Thread Mattmann, Chris A (3980)
Hi Scott, It’s a pretty good tool for that - it is a Web Crawler, which is used to discover the web graph of a domain or of the entire internet - from pages, to documents, to images, to other web resources. Nutch crawls, identifies URLs, fetches them, parses, them and indexes them for search. It

Re: [MASSMAIL]Re: website structure discovery?

2015-03-30 Thread Mattmann, Chris A (3980)
™ •FullCapitalStackhttp://www.fullcapitalstack.com™ •CrowdRabbithttp://www.crowdrabbit.com™ On Mar 30, 2015, at 10:28 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote: Hi Scott, It’s a pretty good tool for that - it is a Web Crawler, which is used

Re: Crawl images and store locally

2015-03-24 Thread Mattmann, Chris A (3980)
Hi Tizy, After you crawl the images, take a look at ./bin/nutch dump to get the images out. ./bin/nutch commoncrawldumper also will dump into the common crawl format. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect

Re: [ANNOUNCE] New Nutch committer and PMC - Mo Omer

2015-03-22 Thread Mattmann, Chris A (3980)
Welcome, Mo! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: Nutch with Selenium pops up Firefox window

2015-02-24 Thread Mattmann, Chris A (3980)
: Nutch with Selenium pops up Firefox window JayavanthShenoy On Feb 23, 2015 10:11 PM, Mattmann, Chris A (3980) [via Lucene] ml-node+s472066n4188439...@n3.nabble.com wrote: Thanks what’s your username? ++ Chris Mattmann, Ph.D

Re: Nutch with Selenium pops up Firefox window

2015-02-23 Thread Mattmann, Chris A (3980)
registered on Apache Nutch wiki. Please add me. Thanks, Jay On Sun, Feb 22, 2015 at 4:35 PM, Mattmann, Chris A (3980) [via Lucene] ml-node+s472066n418810...@n3.nabble.com wrote: woot! Jay can you please add this to the wiki? https://wiki.apache.org/nutch/AdvancedAjaxInteraction

Re: URL filter plugins for nutch

2015-02-22 Thread Mattmann, Chris A (3980)
Hey Markus, We mean exact and near duplicates (defined by a similarity metric). Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory

Re: Nutch with Selenium pops up Firefox window

2015-02-22 Thread Mattmann, Chris A (3980)
Thanks Mo! Jay, would you be able to add these tips to: https://wiki.apache.org/nutch/AdvancedAjaxInteraction Would appreciate an FAQ section there. You will need to register on the wiki.apache.org/nutch/ site, then after let me know I can add you to the ContributorsGroup. Cheers, Chris

Re: Nutch with Selenium pops up Firefox window

2015-02-22 Thread Mattmann, Chris A (3980)
woot! Jay can you please add this to the wiki? https://wiki.apache.org/nutch/AdvancedAjaxInteraction ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory

Re: URL filter plugins for nutch

2015-02-22 Thread Mattmann, Chris A (3980)
Thanks Markus, you are correct it would be a bit more straightforward. However, in the URLFilter, students can create a class that takes in a NutchConfiguration object, which provides paths to the relevant Databases, and then uses the associated Java reader classes, e.g., LinkDbReader(with the

Re: subscribe to the mailing list (CSCI572)

2015-02-22 Thread Mattmann, Chris A (3980)
Please send an email to user-subscr...@nutch.apache.org Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office:

Re: [ANNOUNCE] New Nutch committer and PMC - Jorge Luis Betancourt Gonzalez

2015-02-19 Thread Mattmann, Chris A (3980)
Welcome to the party, Jorge! Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527

Re: Newbie

2015-02-08 Thread Mattmann, Chris A (3980)
Thanks Trevor. Moving user-owner@n.a.o to BCC since I think you meant to ask this on the user@n.a.o list. I think the best bet is to check out the Nutch wiki with several tutorials and other info on how to get started. Also we would welcome you to join the dev and user lists (by sending blank

Re: unsubscribe

2015-02-08 Thread Mattmann, Chris A (3980)
Please send an email to dev-unsubscr...@nutch.apache.org and user-unsubscr...@nutch.apache.org and follow the instructions from there. [moved dev@nutch.a.o and user@nutch.a.o to BCC] ++ Chris Mattmann, Ph.D. Chief Architect

Re: InvertLinks Performance Nutch 1.6

2015-02-05 Thread Mattmann, Chris A (3980)
WOW friggin awesome ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: How to implement an own crawler for specific tasks with nutch?

2015-02-01 Thread Mattmann, Chris A (3980)
Hi Yusniel, Thanks for your question and for using Nutch! Yep it’s possible to implement a focused crawler, which is defined hopefully by the following criteria: 1. partitioned URL space (in Nutch you use URL filters and normalizers for this and seed lists and injection) 2. only certain content

Re: [MASSMAIL]Re: How to implement an own crawler for specific tasks with nutch?

2015-02-01 Thread Mattmann, Chris A (3980)
- Mensaje original - De: Chris A Mattmann (3980) chris.a.mattm...@jpl.nasa.gov Para: user@nutch.apache.org Enviados: Domingo, 1 de Febrero 2015 11:44:45 Asunto: [MASSMAIL]Re: How to implement an own crawler for specific tasks with nutch? Hi Yusniel, Thanks for your question and for using Nutch

Re: How to implement an own crawler for specific tasks with nutch?

2015-02-01 Thread Mattmann, Chris A (3980)
+1 thanks Markus ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: Nutch IRI URIs

2015-01-30 Thread Mattmann, Chris A (3980)
-tabpanel#comment-13968762 2015-01-29 21:59 GMT+01:00 Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov: Thanks Talat, good question. So what you want are the URLs to actually come through with encoding and stuff like the 2nd example? I think that can be done via a URL filter

Re: Nutch IRI URIs

2015-01-29 Thread Mattmann, Chris A (3980)
support. IMHO We should add IRI support in urlnormalizer-basic. Wdyt ? Talat 2015-01-29 8:05 GMT+02:00 Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov: Hi Talat, What are these? I’m sorry but do you have a pointer (sorry if it’s obvious). Cheers, Chris

Re: Nutch IRI URIs

2015-01-28 Thread Mattmann, Chris A (3980)
Hi Talat, What are these? I’m sorry but do you have a pointer (sorry if it’s obvious). Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory

Re: Differences between parse-html and parse-tika for generation of parse metadata

2015-01-10 Thread Mattmann, Chris A (3980)
aye! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW:

Re: nutchpy

2015-01-09 Thread Mattmann, Chris A (3980)
Yep it's awesome work funded by the DARPA memex project and our team. Cc'ing Andy Terrel for awareness thanks Lewis! Sent from my iPhone On Jan 9, 2015, at 6:04 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Folks, Just wanted to make folk aware of some work Continuum

Fwd: nutchpy

2015-01-09 Thread Mattmann, Chris A (3980)
/seqreader-app-1.0-SNAPSHOT-jar-with-dependencies.jar' On Sat, Jan 10, 2015 at 3:21 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote: Yep it's awesome work funded by the DARPA memex project and our team. Cc'ing Andy Terrel for awareness thanks

Re: Nutch works on Hadoop 2.5.2 with Hbase 0.98.8

2015-01-05 Thread Mattmann, Chris A (3980)
Great work! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov

Re: question about robots.txt

2014-12-13 Thread Mattmann, Chris A (3980)
Hi Shane, They get it from the http.agent.* properties in your nutch-conf.xml or your nutch-site.xml. You give your crawler the identifying name., description, url, email and version. Cheers! Chris ++ Chris Mattmann, Ph.D. Chief

Re: Nutch vs Lucidworks Fusion

2014-10-15 Thread Mattmann, Chris A (3980)
Thanks Andrzej. We have been doing some awesome stuff with Tika lately (OCR, GDAL and other things), and glad to hear you guys are integrating with that. If there's any good stuff you guys have (like NER, etc.) that would be appreciated to be pushed up, and also to be collaborated on. We are

Re: Nutch vs Lucidworks Fusion

2014-10-07 Thread Mattmann, Chris A (3980)
Thanks AB. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov

Re: Nutch vs Lucidworks Fusion

2014-10-05 Thread Mattmann, Chris A (3980)
Thanks for the info Grant. Hope to see more info about the crawler at some point and maybe even some day an ASF Fusion crawler (which you guys already contribute a ton to open source and maybe it will happen some day anyways). Lots of good stuff going on in Nutch, Tika, Solr, OODT, your guys

RE: Nutch not crawling deep enough into directory structure

2014-09-08 Thread Mattmann, Chris A (3980)
Hi Paul, Try expanding your last parameter (which is the # of crawling rounds). Also make sure to check these properties: property namedb.ignore.internal.links/name valuefalse/value descriptionIf true, when adding new links to a page, links from the same host are ignored. This is an

Re: Nutch FAQ

2014-09-01 Thread Mattmann, Chris A (3980)
Definitely needs updating. Some students with some documentation time would be a great help here. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion

Re: [ANNOUNCE] GSoC Create a Wicket-based Web Application for Nutch Project SUCCESSFUL

2014-09-01 Thread Mattmann, Chris A (3980)
yes please, list us as a friend of Wicket. Amazing work!! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519,

Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Mattmann, Chris A (3980)
+1, great. I'd like to have a conversation about versioning. Since we're at 1.9, my suggestion would be to have the next in the trunk series (1.x) move to version 3.x post 1.9 for the release. Nutch2 remains Nutch and can be worked on there. That would give us a nice split in the diversionary

Re: [RELEASE] Apache Nutch 1.9

2014-08-20 Thread Mattmann, Chris A (3980)
Here here, great job dudes ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: Nutch @ApacheCon Europe 2014

2014-07-31 Thread Mattmann, Chris A (3980)
So awesome great to hear guys! ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Re: [DISCUSS] [VOTE] Remove pom.xml from source

2014-07-15 Thread Mattmann, Chris A (3980)
nothing else and is fully regenerated from the template at every release. We can remove it. Thanks Julien On 15 July 2014 19:07, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hey Julien, Does the ant deploy generate a fully POM though? I don't think it does I think it just

Re: Nutch survey

2014-05-22 Thread Mattmann, Chris A (3980)
Will do I will fill it out ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-5th floor Email: