Hey Markus,

On Nov 24, 2011, at 10:23 AM, Markus Jelsma wrote:

> Hi Chris
> 
> https://issues.apache.org/jira/browse/NUTCH-1087

Thanks for the pointer. I'll check it out.

> 
> Use the org.apache.nutch.net.URLFilterChecker to test.

Sweet, I didn't know about this tool. OK, I tried it out, check it (note that 
this includes my instrumented stuff, hence
the printlns):

echo 
"http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file";
 | ./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName 
org.apache.nutch.urlfilter.regex.RegexURLFilter
Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter
@#((#(#@ EVALUATING at_download LINK!: 
[http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
 matched? [false]
@#((#(#@ EVALUATING at_download LINK!: 
[http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
 matched? [false]
@#((#(#@ EVALUATING at_download LINK!: 
[http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
 matched? [true]
+http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file

So, looks like it didn't match the first 2 rules, but matched the 3rd one and 
thus it actually includes the URL fine. So, watch this, here are 
my 3 relevant rules:

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

+^http://([a-z0-9]*\.)*vault.fbi.gov/

So, that makes perfect sense. RegexURLFilter appears to be working normally, so 
that's fine. 

So, .... what's the deal, then? ParserChecker works fine, it shows that an 
outlink from this URL:

http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view

Is in fact the at_download link:

[chipotle:local/nutch/framework] mattmann% ./bin/nutch 
org.apache.nutch.parse.ParserChecker 
http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep 
"download"
  outlink: toUrl: 
http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file 
anchor: watergat1summary.pdf
[chipotle:local/nutch/framework] mattmann% 

RegexURLFilter takes in either of those URLs, and says they are fine:

[chipotle:local/nutch/framework] mattmann% ./bin/nutch 
org.apache.nutch.parse.ParserChecker 
http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view | grep 
download | awk '{print $3}' | ./bin/nutch org.apache.nutch.net.URLFilterChecker 
-filterName org.apache.nutch.urlfilter.regex.RegexURLFilter
Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter
@#((#(#@ EVALUATING at_download LINK!: 
[http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
 matched? [false]
@#((#(#@ EVALUATING at_download LINK!: 
[http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
 matched? [false]
@#((#(#@ EVALUATING at_download LINK!: 
[http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file]:
 matched? [true]
+http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/at_download/file
[chipotle:local/nutch/framework] mattmann% 

[chipotle:local/nutch/framework] mattmann% echo 
"http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view"; | 
./bin/nutch org.apache.nutch.net.URLFilterChecker -filterName 
org.apache.nutch.urlfilter.regex.RegexURLFilter
Checking URLFilter org.apache.nutch.urlfilter.regex.RegexURLFilter
URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] 
doesn't have at_download in it!
URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] 
doesn't have at_download in it!
URL: [http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view] 
doesn't have at_download in it!
+http://vault.fbi.gov/watergate/watergate-summary-part-01-of-02/view
[chipotle:local/nutch/framework] mattmann% 

Any idea why i wouldn't get getting the at_download URLs downloaded then? 
Here's http.content.limit, 
db.max.outlinks from my Nutch conf:

[chipotle:local/nutch/framework] mattmann% egrep -i -A1 
"db\.max\.outlinks|http\.content\.limit" conf/nutch-default.xml
  confuse this setting with the http.content.limit setting.
  </description>
--
  <name>http.content.limit</name>
  <value>-1</value>
--
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
--
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
[chipotle:local/nutch/framework] mattmann% 


Cheers,
Chris

> 
>> Hey Markus,
>> 
>> On Nov 24, 2011, at 9:50 AM, Markus Jelsma wrote:
>>>> Hey Markus,
>>>> 
>>>> On Nov 24, 2011, at 9:16 AM, Markus Jelsma wrote:
>>>>> I think Marek is right, the crawl-filter _is_ used in the crawl
>>>>> command. I don't know what happens if it isnt there.
>>>> 
>>>> Interesting. Where is the crawl-urlfilter.txt? It's not in my built
>>>> conf directory in runtime/local/conf from 1.4?
>>> 
>>> Its gone! I checked and last saw it in 1.2. Strange
>>> 
>>>>> Good reasons to get rid of the crawl command and stuff in 1.5 if you
>>>>> ask me.
>>>> 
>>>> I'd be in favor of replacing the current Crawl command with a simple
>>>> Java driver that just calls the underlying Inject, Generate, and Fetch
>>>> tools. Would that work?
>>> 
>>> There's an open issue to replace it with a basic crawl shell script. It's
>>> easier to understand and uses the same commands. Non-Java users should
>>> be able to deal with it better, and provide us with better problem
>>> descriptions.
>> 
>> +1, that would be cool indeed. Do you know what issue it is?
>> 
>> BTW, I'm currently instrument urlfilter-regex to see if I can figure out
>> if it's dropping the at_download URLs for whatever reason. Sigh.
>> 
>> Cheers,
>> Chris
>> 
>>>> Cheers,
>>>> Chris
>>>> 
>>>>> On Thursday 24 November 2011 16:59:42 Mattmann, Chris A (388J) wrote:
>>>>>> Hi Marek,
>>>>>> 
>>>>>> On Nov 24, 2011, at 7:46 AM, Marek Bachmann wrote:
>>>>>>> I think when you use the crawl command instead of the single
>>>>>>> commands, you have to specify the regEx rules in the
>>>>>>> crawl-urlfilter.txt file. But I don't know if it is still the case
>>>>>>> in 1.4
>>>>>>> 
>>>>>>> Could that be the problem?
>>>>>> 
>>>>>> Doesn't look like there's a crawl-urlfilter.txt in the conf dir.
>>>>>> Also it looks like urlfilter-regex is the one that's enabled by
>>>>>> default and shipped with the basic config.
>>>>>> 
>>>>>> Thanks for trying to help though. I'm going to figure this out! Or,
>>>>>> someone is going to probably tell me what I'm doing wrong.
>>>>>> We'll see what happens first :-)
>>>>>> 
>>>>>> Cheers,
>>>>>> Chris
>>>>>> 
>>>>>>> On 24.11.2011 16:20, Mattmann, Chris A (388J) wrote:
>>>>>>>> On Nov 24, 2011, at 3:21 AM, Julien Nioche wrote:
>>>>>>>>>> OK, nm. This *is* different behavior from 1.3 apparently, but I
>>>>>>>>>> figured out how to make it work in 1.4 (instead of editing the
>>>>>>>>>> global, top-level conf/nutch-default.xml,
>>>>>>>>>> I needed to edit runtime/local/conf/nutch-default.xml). Crawling
>>>>>>>>>> is forging ahead.
>>>>>>>>> 
>>>>>>>>> yep, I think this is documented on the Wiki. It is partially why
>>>>>>>>> I suggested that we deliver the content of runtime/local as our
>>>>>>>>> binary release for next time. Most people use Nutch in local
>>>>>>>>> mode so this would make their lives easier, as for the advanced
>>>>>>>>> users (read pseudo or real distributed) they need to recompile
>>>>>>>>> the job file anyway and I'd expect them to use the src release
>>>>>>>> 
>>>>>>>> +1, I'll be happy to edit build.xml and make that happen for 1.5.
>>>>>>>> 
>>>>>>>> In the meanwhile, time to figure out why I still can't get it to
>>>>>>>> crawl the PDFs... :(
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Chris
>>>>>>>> 
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> Chris Mattmann, Ph.D.
>>>>>>>> Senior Computer Scientist
>>>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>>>> Email: [email protected]
>>>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Senior Computer Scientist
>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 171-266B, Mailstop: 171-246
>>>>>> Email: [email protected]
>>>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Assistant Professor, Computer Science Department
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Senior Computer Scientist
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 171-266B, Mailstop: 171-246
>>>> Email: [email protected]
>>>> WWW:   http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Assistant Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: [email protected]
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to