Re: UrlRegexFilter is getting destroyed for unrealistically long links

2018-03-12 Thread Edward Capriolo
Some regular expressions (those with backtracing) can be very expensive for lomg strings https://regular-expressions.mobi/catastrophic.html?wlr=1 Maybe that is your issue. On Monday, March 12, 2018, Sebastian Nagel wrote: > Good catch. It should be renamed to be consistent with other propertie

Re: I'm just going to throw this out there...

2017-08-20 Thread Edward Capriolo
On Wednesday, August 16, 2017, Michael Chen < yiningchen2...@u.northwestern.edu> wrote: > Hi Ray, > > Haha the documentations :) Let's hope that it'll get better or we'll all > need super human problem solving abilities. But perhaps you're on a better > path by making a cookbook and contributing a

Re: Stuck at Step One

2017-07-21 Thread Edward Capriolo
I have run into this. The nutch shell scripts do not return error status so you assume bin/crawl has done something when truly it failed. Sometimes the best way is to determine if you need a plugin.xml file and what the content should be. Possibly put a blank xml file in its place and see if the er

Re: Google Summer of Code Weekly Reports.

2017-07-12 Thread Edward Capriolo
Nice job and very diligent. On Wed, Jul 12, 2017 at 11:38 AM, Omkar Reddy wrote: > Hello all, > > Please find my updated weekly reports here[0]. Please feel free to provide > any suggestions. > > Thanks, > Omkar. > > [0] > https://wiki.apache.org/nutch/GoogleSummerOfCode/GraphGeneratorTool/ > We

What up with 2.3.1 ?

2017-06-03 Thread Edward Capriolo
Hello, In the past I had an awesome experience with nutch. About 8 years ago I ran a process where I checked out each process in our SVN repo, ran doxogen/javadoc on them. Then unleashed nutch on them and setup a searchable front end. I am doing a video coarse '10 Hadoop able problems' and I want

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

2012-05-05 Thread Edward Capriolo
ne a new input format, how do I use it in a hive table definition? > For the SequenceFileInputFormat, the table definition would read as "...STORED AS SEQUENCEFILE". > With the new one, how do I specify it in the definition? "STORED AS 'com.xyz.abc.MyInputFormat'?

Re: Problem writing SerDe to read Nutch crawldb because Hive seems to ignore the Key and only reads the Value from SequenceFiles.

2012-05-05 Thread Edward Capriolo
This is one of the things about hive the key is not easily available. You are going to need an input format that creates a new value which is contains the key and the value. Like this: -> new MyKeyValue< > On Sat, May 5, 2012 at 4:05 PM, Ali Safdar Kureishy wrote: > Hi, > > I have attached